A catalog of side effects
Optimizing compilers like to keep track of each IR instruction’s effects . An instruction’s effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state). This post can be thought of as a continuation of What I talk about when I talk about IRs , specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have? Different compilers represent and track these effects differently. I’ve been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more. Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as “writes to a local variable”, “writes to a list”, or “reads from the stack” indicate what instructions can be re-ordered, duplicated, or removed entirely. For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR: The goal of effects is to communicate to the compiler if, for example, these two IR instructions can be re-ordered. The second instruction might write to a location that the first one reads. But it also might not! This is about knowing if and alias —if they are different names that refer to the same object. We can sometimes answer that question directly, but often it’s cheaper to compute an approximate answer: could they even alias? It’s possible that and have different types, meaning that (as long as you have strict aliasing) the and operations that implement these reads and writes by definition touch different locations. And if they look at disjoint locations, there need not be any explicit order enforced. Different compilers keep track of this information differently. The null effect analysis gives up and says “every instruction is maximally effectful” and therefore “we can’t re-order or delete any instructions”. That’s probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid. But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That’s where I am right now. So here’s a catalog of different compilers I have looked at recently. There are two main ways I have seen to represent effects: bitsets and heap range lists. We’ll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers. We’ll start with Cinder , a Python JIT, because that’s what I used to work on. Cinder tracks heap effects for its high-level IR (HIR) in instr_effects.h . Pretty much everything happens in the function, which is expected to know everything about what effects the given instruction might have. The data representation is a bitset representation of a lattice called an and that is defined in alias_class.h . Each bit in the bitset represents a distinct location in the heap: reads from and writes to each of these locations are guaranteed not to affect any of the other locations. Here is the X-macro that defines it: Note that each bit implicitly represents a set: does not refer to a specific list index, but the infinite set of all possible list indices. It’s any list index. Still, every list index is completely disjoint from, say, every entry in a global variable table. (And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected , is the container.) Like other bitset lattices, it’s possible to union the sets by or-ing the bits. It’s possible to query for overlap by and-ing the bits. If this sounds familiar, it’s because (as the repo notes) it’s a similar idea to Cinder’s type lattice representation . Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects): Union operations naturally hit a fixpoint at and intersection operations naturally hit a fixpoint at . All of this together lets the optimizer ask and answer questions such as: Let’s take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR: You can imagine that declares that it reads from the heap and declares that it writes to the heap. Because tuple and list pointers cannot be casted into one another and therefore cannot alias, these are disjoint heaps in our bitset. Therefore , therefore these memory operations can never interfere! They can (for example) be re-ordered arbitrarily. In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE. DCE involves first finding the set of instructions that need to be kept around because they are useful/important/have effects. So here is what the Cinder DCE looks like: There are some other checks in there but is right there at the core of it! Now that we have seen the bitset representation of effects and an implementation in Cinder, let’s take a look at a different representation and and an implementation in JavaScriptCore. I keep coming back to How I implement SSA form by Fil Pizlo , one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of “abstract heaps” felt very… well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click. It didn’t make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations—because, you know, JSC is six compilers in a trenchcoat [ citation needed ] . DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We’ll look at DOMJIT mostly because it’s a smaller example and also illustrates something else that’s interesting: builtins. We’ll come back to builtins in a minute. Let’s take a lookat how DOMJIT structures its abstract heaps : a YAML file. It’s a hierarchy. is a subheap of is a subheap of… and so on. A write to any is a write to is a write to … Sibling heaps are unrelated: and , for example, are disjoint. To get a feel for this, I wired up a simplified version of ZJIT’s bitset generator (for types! ) to read a YAML document and generate a bitset. It generated the following Rust code: It’s not a fancy X-macro, but it’s a short and flexible Ruby script. Then I took the DOMJIT abstract heap generator —also funnily enough a short Ruby script—modified the output format slightly, and had it generate its int pairs: It already comes with a little diagram, which is super helpful for readability. Any empty range(s) represent empty heap effects: if the start and end are the same number, there are no effects. There is no one value, but any empty range could be normalized to . Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me. What about checking overlap? Here is the implementation in JSC : (See also How to check for overlapping intervals and Range overlap in two compares for more fun.) While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap. With int ranges, it’s a little more complicated. An imprecise union of and can take the maximal range that covers both and . To get a more precise union, you have to keep track of both. In the worst case, if you want efficient arbitrary queries, you need to store your int ranges in an interval tree. So what gives? I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it’s more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn’t holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post: The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. […] you can have instructions report that they read/write multiple heaps […] you can have a utility function that produces such lists on demand. It’s important to note that this doesn’t actually involve any allocation of lists. JSC does this very clever thing where they have “functors” that they pass in as arguments that compress/summarize what they want to out of an instruction’s effects. Let’s take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat. Note that in the example below is a thin wrapper over the DFG compiler’s own equivalent: is the function that calls these functors ( or in this case) for each effect that the given IR instruction declares. I’ve pulled some relevant snippets of , which is quite long, that I think are interesting. First, some instructions (constants, here) have no effects. There’s some utility in the call but I didn’t understand fully. Then there are some instructions that conditionally have effects depending on the use types of their operands. 1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code. Some run-time IR guards that might cause side exits are annotated as such—they write to the heap. Local variable instructions read specific heaps indexed by what looks like the local index but I’m not sure. This means accessing two different locals won’t alias! Instructions that allocate can’t be re-ordered, it looks like; they both read and write the . This probably limits the amount of allocation sinking that can be done. Then there’s , which is the builtins stuff I was talking about. We’ll come back to that after the code block. (Remember that these operations are very similar to DOMJIT’s with a couple more details—and in some cases even contain DOMJIT s!) This node is the way for the DOM APIs in the browser—a significant chunk of the builtins, which are written in C++—to communicate what they do to the optimizing compiler. Without any annotations, the JIT has to assume that a call into C++ could do anything to the JIT state. Bummer! But because, for example, annotates what memory it reads from and what it doesn’t write to, the JIT can optimize around it better—or even remove the access completely. It means the JIT can reason about calls to known builtins the same way that it reasons about normal JIT opcodes. (Incidentally it looks like it doesn’t even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.) Last, we’ll look at Simple, which has a slightly different take on all of this. Simple is Cliff Click’s pet Sea of Nodes (SoN) project to try and showcase the idea to the world—outside of a HotSpot C2 context. This one is a little harder for me to understand but it looks like each translation unit has a that doles out different classes of memory nodes for each alias class. Each IR node then takes data dependencies on whatever effect nodes it might uses. Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): “Our approach is a form of TBAA similar to the ‘FieldTypeDecl’ algorithm described in the paper.” The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10 . Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets. HHVM , a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h . HHVM has a couple places that use this information, such as a definition-sinking pass , alias analysis , DCE , store elimination , refcount opts , and more. If you are wondering why the HHVM representation looks similar to the Cinder representation, it’s because some former HHVM engineers such as Brett Simmers also worked on Cinder! (note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource ) Android’s ART Java runtime also uses a bitset for its effect representation. It’s a very compact class called in nodes.h . The side effects are used in loop-invariant code motion , global value numbering , write barrier elimination , scheduling , and more. CoreCLR mostly uses a bitset for its class. This one is interesting though because it also splits out effects specifically to include sets of local variables ( ). V8 is also about six completely different compilers in a trenchcoat. Turboshaft uses a struct in operations.h called which is two bitsets for reads/writes of effects. This is used in value numbering as well a bunch of other small optimization passes they call “reducers”. Maglev also has this thing called in their IR nodes that also looks like a bitset and is used in their various reducers. It has effect query methods on it such as and . Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself. Guile Scheme looks like it has a custom tagging scheme type thing. Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8. Remember to ask the right questions of your IR when doing analysis. Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory , Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful. This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes). ↩ where might this instruction write? (because CPython is reference counted and incref implies ownership) where does this instruction borrow its input from? do these two instructions’ write destinations overlap? This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes). ↩