GreatReads - Blog Aggregator · Phoenix Framework

0 views

Max Bernstein 2 weeks ago

Partial static single information form

In compilers, static single information form (SSI) is a common extension to static single assignment form (SSA). It was introduced by C. Scott Ananian in 1999 in his MS thesis (PDF) 1 . SSI extends your existing SSA intermediate representation by discovering facts from your existing program and reifying them as path-dependent/flow-sensitive IR nodes. That might sound complicated, but at least the basic idea is pretty natural. I talk a little bit about it in What I talk about when I talk about IRs and I’ll rehash here in more depth, starting with some motivating examples. Consider this admittedly contrived example: We should be able to learn from the comparison that in some branches in the IR, is positive. In that region, we can add a new IR instruction that attaches that knowledge right in the instruction’s type field (yay, sparseness!) and then rewrite uses of to now use . Because we’ve done that, our (imaginary) optimization rule that gets rid of on known-positive integers can kick in, and we can delete the invocation of . Yay, optimization! But a couple of questions remain, at least for me: We’ll go through them, starting with the compiler pipeline. The original SSI paper starts with (I think?) SSA form and places some number of new refinement nodes based on conditionals. I have admittedly not tried very hard, but the into-SSI algorithms look complicated and kind of heavyweight. As a reward, you get “linear” into-SSI time complexity. But I am a humble compiler engineer, and I don’t have the time to go through and load all of this into my head. Instead what I have seen done and have been doing is to take a shortcut: build partial SSI during SSA construction 2 . Most of the time this is from bytecode, but it could also be from some other non-SSA IR. In any case, this is an excellent shortcut for two reasons: This is pretty compelling. We can learn from the bytecode with a very small amount of marginal new complexity. See my implementation in ZJIT , for example. All it really does is modify the abstract interpreter state when building SSA out of , , and bytecode instructions to take into account the new refined values. This is fine for branches that are already in the user’s source program but sometimes optimization, especially of dynamic languages, adds new branches that were not there before. And sometimes these branches get added much later, long after SSA construction. What then? Can we do something similar and rely on existing infrastructure? Implicit in this “can we do it” is the assumption that your IR tracks data dependencies from use to corresponding def, but not from def to uses. Sea of Nodes (at least the Simple implementation), is an IR that tracks both directions all the time for easier rewriting. Many IRs do not do this, so we will continue assuming that there’s no “easy way out”. JIT optimization of dynamic language compilers often adds synthetic instructions to the IR that enforce pre-conditions. These guards allow optimizing happy/fast path cases in JIT code while leaving the interpreter as a fallback. For example, we might be able to optimize two back-to-back instructions (a very dynamic operation in the world of ideas, but fast when concretely implemented using object shapes) from: which is very generic and involves calling into C code that might raise an exception, to something more like: which is much faster (assuming shape stability at run-time). There’s an irritating problem, though, which is that we have a bunch of duplicate instructions littered around the IR now because our optimizer worked on each instruction individually. Kind of a “template optimizer” situation. Now we need some pass to clean up the detritus. Global value numbering (GVN) will do a good job of de-duplicating instructions. It should notice that we already have an instruction that looks like called and rewrite into . That’s great because we have de-duplicated the guard. GVN may not get everything, though; if some instructions later use , they will not get rewritten to instead use the output of these new guard instructions. To do that, we need to add some kind of pass or augment GVN with some canonicalization feature. That canonicalization would handle rewriting operands to use the “latest version” of some value, so to speak. See the canonicalization section of Chris Fallin’s excellent aegraphs blog post for more (and of course the (currently block-local) implementation in ZJIT ). Where I’m going with all of this, though, is that you may already have some dominance-based instruction rewriting mechanism in your compiler, either as part of GVN or separately! And you can use this to do a very low code into-partial-SSI in the middle of your optimizer. This means you could very well get away with inserting instructions in successor blocks of conditionals and get the into-SSI “for free”. That’s up to you. There’s a trade-off between compile-time and run-time, especially in JITs. Inserting more instructions and rewriting more times may slow down your compiler. It’s a cheap lunch, not a free one. I don’t know. I don’t have a good grasp of how this “partial SSI” compares to the “full SSI”. I don’t plan on implementing full SSI in the near future. I will note that this partial SSI approach doesn’t do two things: I can’t tell what impact this has. Like Simple, TruffleRuby is built on a Sea of Nodes IR (Graal). Chris Seaton has an excellent blog post about TruffleRuby’s use of “stamp nodes” (“Pi nodes” 3 ). The function does a lot of heavy lifting, I think because Graal tracks uses. Cinder mostly inserts instructions in the HIR builder, before into-SSA, and then lets the SSA construction take care of things. That’s where I learned this trick, actually. Here is one example of refining the type of the matched operand when building IR for pattern matching. Luau is working on something like this, but for their type checker. Chatting with someone on their team is actually part of the reason I got motivated to write this post. Android ART looks like it has HBoundType and inserts them in reference type propagation . This handles class checks, null checks, and instanceof checks. Last, I want to talk a little bit about some interesting reasoning you can do when you have two implementations of something that you can switch between. For example, JIT (+ interpreter), or aliasing and non-aliasing cases in C code, or the weirdo NULL-UB reasoning LLVM can do to C code, things like that. In ZJIT, we currently insert s opportunistically in “easy” cases when building our HIR from the interpreter bytecode. For example, if in the bytecode there is a branch that compares some value with , it will have two outgoing control-flow edges: one block where is definitely , and one block where is definitely not . In each of these control-flow edges, we can insert corresponding type refinement hints. That’s pretty standard. But we can also do weirder stuff. CRuby has a notion of heap objects vs immediate objects. Many (most?) objects are heap objects. However, integer , for example is not allocated on the heap but instead represented by a tagged bit pattern that pretends to be an address: the whole value is encoded in the pointer itself. We encode this knowledge in the HIR’s type system: “heapness” and “immediateness” each get a bit in the type lattice . We use this in the optimizer to reason about effects , among other things. We can’t know a lot of the time what type a thing is, so we pessimistically type most objects flowing through bytecode as . This type encapsulates the entire world of possible values that could go on the stack or in a local variable. On most heap objects, with only a few exceptions, you can write instance variables (fields, attributes, whatever you want to call them). You can never write an instance variable to an immediate. This means that if we observe the following pattern in the bytecode: Then after building and emitting HIR for the opcode, we can upgrade the type of from a to a . We can do this because if it weren’t a heap-allocated object, we would have left the compiled code and entered the interpreter. This is another SSI-type thing you can do in your compiler. Uhh I guess the conclusion is that you don’t have to do full SSI and partial SSI is available and not too scary? Does your compiler do this? Reader, please write in. …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”. ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore. ↩ Today I learned that this terminology comes from the ABCD paper (PDF). ↩ Where/when in the compiler pipeline do we insert and remove these type refinements? Do we need to refine after every conditional? Do we need to implement the whole into-SSI and out-of-SSI algorithms from all the complicated-looking papers? It lets me cleanly separate adding the type refinements (pretty straightforward) from the hard part of doing all of the operand rewriting and phi placement and marking and all manner of other nonsense. In addition to separating the concerns, the hard part is already done by SSA construction. We can actually just skip it! SSA construction handles phi placement, operand rewriting, all of it. It probably fits neatly into a naive or a Braun-style (PDF) construction. It doesn’t split variables with a new sigma node, and it generally inserts the refine node within the target block rather than above the branch (For only) It doesn’t insert new phi nodes; it just leaves both IR nodes available and, instead of re-merging, drops them …and optimized in 2002 (PDF), revisited in 2009 (PDF), implemented in LLVM in 2010 (PDF), investigated in 2017 for abstract compilation (PDF), and probably more. The 2009 paper by Boissinot, Brisk, Darte, and Rastello even shows that both Ananian and Singer’s papers have bugs, while perhaps unintentionally also making an excellent pun about the literature being “sparse”. ↩ This blog post is different than the what the LLVM paper (PDF) calls partial SSI. Partial for different reasons. Maybe it’s not even single information anymore. ↩ Today I learned that this terminology comes from the ABCD paper (PDF). ↩

Backend

0 views

Max Bernstein 1 months ago

Value numbering

Welcome back to compiler land. Today we’re going to talk about value numbering , which is like SSA, but more. Static single assignment (SSA) gives names to values: every expression has a name, and each name corresponds to exactly one expression. It transforms programs like this: where the variable is assigned more than once in the program text, into programs like this: where each assignment to has been replaced with an assignment to a new fresh name. It’s great because it makes clear the differences between the two expressions. Though they textually look similar, they compute different values. The first computes 1 and the second computes 2. In this example, it is not possible to substitute in a variable and re-use the value of , because the s are different. But what if we see two “textually” identical instructions in SSA? That sounds much more promising than non-SSA because the transformation into SSA form has removed (much of) the statefulness of it all. When can we re-use the result? Identifying instructions that are known at compile-time to always produce the same value at run-time is called value numbering . To understand value numbering, let’s extend the above IR snippet with two more instructions, v3 and v4. In this new snippet, v3 looks the same as v1: adding v0 and 1. Assuming our addition operation is some ideal mathematical addition, we can absolutely re-use v1; no need to compute the addition again. We can rewrite the IR to something like: This is kind of similar to the destructive union-find representation that JavaScriptCore and a couple other compilers use, where the optimizer doesn’t eagerly re-write all uses but instead leaves a little breadcrumb / instruction 1 . We could then run our copy propagation pass (“union-find cleanup”?) and get: Great. But how does this happen? How does an optimizer identify reusable instruction candidates that are “textually identical”? Generally, there is no actual text in the IR . One popular solution is to compute a hash of each instruction. Then any instructions with the same hash (that also compare equal, in case of collisions) are considered equivalent. This is called hash-consing . When trying to figure all this out, I read through a couple of different implementations. I particularly like the Maxine VM implementation. For example, here is the (hashing) and functions for most binary operations, slightly modified for clarity: The rest of the value numbering implementation assumes that if a function returns 0, it does not wish to be considered for value numbering. Why might an instruction opt-out of value numbering? An instruction might opt out of value numbering if it is not “pure”. Some instructions are not pure. Purity is in the eye of the beholder, but in general it means that an instruction does not interact with the state of the outside world, except for trivial computation on its operands. (What does it mean to de-duplicate/cache/reuse ?) A load from an array object is also not a pure operation 2 . The load operation implicitly relies on the state of the memory. Also, even if the array was known-constant, in some runtime systems, the load might raise an exception. Changing the source location where an exception is raised is generally frowned upon. Languages such as Java often have requirements about where exceptions are raised codified in their specifications. We’ll work only on pure operations for now, but we’ll come back to this later. We do often want to optimize impure operations as well! We’ll start off with the simplest form of value numbering, which operates only on linear sequences of instructions, like basic blocks or traces. Let’s build a small implementation of local value numbering (LVN). We’ll start with straight-line code—no branches or anything tricky. Most compiler optimizations on control-flow graphs (CFGs) iterate over the instructions “top to bottom” 3 and it seems like we can do the same thing here too. From what we’ve seen so far optimizing our made-up IR snippet, we can do something like this: The find-and-replace, remember, is not a literal find-and-replace, but instead something like: (if you have been following along with the toy optimizer series) This several-line function (as long as you already have a hash map and a union-find available to you) is enough to build local value numbering! And real compilers are built this way, too. If you don’t believe me, take a look at this slightly edited snippet from Maxine’s value numbering implementation. It has all of the components we just talked about: iterating over instructions, map lookup, and some substitution. This alone will get you pretty far. Code generators of all shapes tend to leave messy repeated computations all over their generated code and this will make short work of them. Sometimes, though, your computations are spread across control flow—over multiple basic blocks. What do you do then? Computing value numbers for an entire function is called global value numbering (GVN) and it requires dealing with control flow (if, loops, etc). I don’t just mean that for an entire function, we run local value numbering block-by-block. Global value numbering implies that expressions can be de-duplicated and shared across blocks. Let’s tackle control flow case by case. First is the simple case from above: one block. In this case, we can go top to bottom with our value numbering and do alright. The second case is also reasonable to handle: one block flowing into another. In this case, we can still go top to bottom. We just have to find a way to iterate over the blocks. If we’re not going to share value maps between blocks, the order doesn’t matter. But since the point of global value numbering is to share values, we have to iterate them in topological order (reverse post order (RPO)). This ensures that predecessors get visited before successors. If you have , we have to visit first and then . Because of how SSA works and how CFGs work, the second block can “look up” into the first block and use the values from it. To get global value numbering working, we have to copy ’s value map before we start processing so we can re-use the instructions. Maybe something like: Then the expressions can accrue across blocks. can re-use the already-computed from because it is still in the map. …but this breaks as soon as you have control-flow splits. Consider the following shape graph: We’re going to iterate over that graph in one of two orders: A B C or A C B. In either case, we’re going to be adding all this stuff into the value map from one block (say, B) that is not actually available to its sibling block (say, C). When I say “not available”, I mean “would not have been computed before”. This is because we execute either A then B or A then C. There’s no world in which we execute B then C. But alright, look at a third case where there is such a world: a control-flow join. In this diagram, we have two predecessor blocks B and C each flowing into D. In this diagram, B always flows into D and also C always flows into D. So the iterator order is fine, right? Well, still no. We have the same sibling problem as before. B and C still can’t share value maps. We also have a weird question when we enter D: where did we come from? If we came from B, we can re-use expressions from B. If we came from C, we can re-use expressions from C. But we cannot in general know which predecessor block we came from. The only block we know for sure that we executed before D is A. This means we can re-use A’s value map in D because we can guarantee that all execution paths that enter D have previously gone through A. This relationship is called a dominator relationship and this is the key to one style of global value numbering that we’re going to talk about in this post. A block can always use the value map from any other block that dominates it. For completeness’ sake, in the diamond diagram, A dominates each of B and C, too. We can compute dominators a couple of ways 4 , but that’s a little bit out of scope for this blog post. If we assume that we have dominator information available in our CFG, we can use that for global value numbering. And that’s just what—you guessed it—Maxine VM does. It iterates over all blocks in reverse post-order, doing local value numbering, threading through value maps from dominator blocks. In this case, their method gets the immediate dominator : the “closest” dominator block of all the blocks that dominate the current one. And that’s it! That’s the core of Maxine’s GVN implementation . I love how short it is. For not very much code, you can remove a lot of duplicate pure SSA instructions. This does still work with loops, but with some caveats. From p7 of Briggs GVN : The φ-functions require special treatment. Before the compiler can analyze the φ-functions in a block, it must previously have assigned value numbers to all of the inputs. This is not possible in all cases; specifically, any φ-function input whose value flows along a back edge (with respect to the dominator tree) cannot have a value number. If any of the parameters of a φ-function have not been assigned a value number, then the compiler cannot analyze the φ-function, and it must assign a unique, new value number to the result. It also talks about eliminating useless phis, which is optional, but would the strengthen global value numbering pass: it makes more information transparent. But what if we want to handle impure instructions? Languages such as Java allow for reading fields from the / object within methods as if the field were a variable name. This makes code like the following common: Each of these reference to and is an implicit reference to or , which is semantically a field load off an object. You can see it in the bytecode (thanks, Matt Godbolt): When straightforwardly building an SSA IR from the JVM bytecode for this method, you will end up with a bunch of IR that looks like this: Pretty much the same as the bytecode. Even though no code in the middle could modify the field (which would require a re-load), we still have a duplicate load. Bummer. I don’t want to re-hash this too much but it’s possible to fold Load and store forwarding into your GVN implementation by either: See, there’s nothing fundamentally stopping you from tracking the state of your heap at compile-time across blocks. You just have to do a little more bookkeeping. In our dominator-based GVN implementation, for example, you can: Not so bad. Maxine doesn’t do global memory tracking, but they do a limited form of load-store forwarding while building their HIR from bytecode: see GraphBuilder which uses the MemoryMap to help track this stuff. At least they would not have the same duplicate instructions in the example above! We’ve now looked at one kind of value numbering and one implementation of it. What else is out there? Apparently, you can get better results by having a unified hash table (p9 of Briggs GVN ) of expressions, not limiting the value map to dominator-available expressions. Not 100% on how this works yet. They note: Using a unified hash-table has one important algorithmic consequence. Replacements cannot be performed on-line because the table no longer reflects availability. Which is the first time that it occurred to me that hash-based value numbering with dominators was an approximation of available expression analysis. There’s also a totally different kind of value numbering called value partitioning (p12 of Briggs GVN ). See also a nice blog post about this by Allen Wang from the Cornell compiler course . I think this mostly replaces the hashing bit, and you still need some other thing for the available expressions bit. Ben Titzer and Seth Goldstein have some good slides from CMU . Where they talk about the worklist dataflow approach. Apparently this is slower but gets you more available expressions than just looking to dominator blocks. I wonder how much it differs from dominator+unified hash table. While Maxine uses hash table cloning to copy value maps from dominator blocks, there are also compilers such as Cranelift that use scoped hash maps to track this information more efficiently. (Though Amanieu notes that you may not need a scoped hash map and instead can tag values in your value map with the block they came from, ignoring non-dominating values with a quick check. The dominance check makes sense but I haven’t internalized how this affects the set of available expressions yet.) You may be wondering if this kind of algorithm even helps at all in a dynamic language JIT context. Surely everything is too dynamic, right? Actually, no! The JIT hopes to eliminate a lot of method calls and dynamic behaviors, replacing them with guards, assumptions, and simpler operations. These strength reductions often leave behind a lot of repeated instructions. Just the other day, Kokubun filed a value-numbering-like PR to clean up some of the waste. ART has a recent blog post about speeding up GVN. Go forth and give your values more numbers. There’s been an ongoing discussion with Phil Zucker on SSI, GVN, acyclic egraphs, and scoped union-find. TODO summarize Commutativity; canonicalization Seeding alternative representations into the GVN Aegraphs and union-find during GVN https://github.com/bytecodealliance/rfcs/blob/main/accepted/cranelift-egraph.md https://github.com/bytecodealliance/wasmtime/issues/9049 https://github.com/bytecodealliance/wasmtime/issues/4371 Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find. ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR. ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems. ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), … ↩ initialize a map from instruction numbers to instruction pointers for each instruction if wants to participate in value numbering if ’s value number is already in the map, replace all pointers to in the rest of the program with the corresponding value from the map otherwise, add to the map doing load-store forwarding as part of local value numbering and clearing memory information from the value map at the end of each block, or keeping track of effects across blocks track heap write effects for each block at the start of each block B, union all of the “kill” sets for every block back to its immediate dominator finally, remove the stuff that got killed from the dominator’s value map V8 Hydrogen Writing this post is roughly the time when I realized that the whole time I was wondering why Cinder did not use union-find for rewriting, it actually did! Optimizing instruction by replacing with followed by copy propagation is equivalent to union-find. ↩ In some forms of SSA, like heap-array SSA or sea of nodes, it’s possible to more easily de-duplicate loads because the memory representation has been folded into (modeled in) the IR. ↩ The order is a little more complicated than that: reverse post-order (RPO). And there’s a paper called “A Simple Algorithm for Global Data Flow Analysis Problems” that I don’t yet have a PDF for that claims that RPO is optimal for solving dataflow problems. ↩ There’s the iterative dataflow way (described in the Cooper paper (PDF)), Lengauer-Tarjan (PDF), the Engineered Algorithm (PDF), hybrid/Semi-NCA approach (PDF), … ↩

Backend

Tutorial

Java

0 views

Max Bernstein 2 months ago

Using Perfetto in ZJIT

Originally published on Rails At Scale . Look! A trace of slow events in a benchmark! Hover over the image to see it get bigger. Now read on to see what the slow events are and how we got this pretty picture. The first rule of just-in-time compilers is: you stay in JIT code. The second rule of JIT is: you STAY in JIT code! When control leaves the compiled code to run in the interpreter—what the ZJIT team calls either a “side-exit” or a “deopt”, depending on who you talk to—things slow down. In a well-tuned system, this should happen pretty rarely. Right now, because we’re still bringing up the compiler and runtime system, it happens more than we would like. We’re reducing the number of exits over time. We can track our side-exit reduction progress with , which, on process exit, prints out a tidy summary of the counters for all of the bad stuff we track. It’s got side-exits. It’s got calls to C code. It’s got calls to slow-path runtime helpers. It’s got everything. Here is a chopped-up sample of stats output for the Lobsters benchmark, which is a large Rails app: (I’ve cut out significant chunks of the stats output and replaced them with because it’s overwhelming the first time you see it.) The first thing you might note is that the thing I just described as terrible for performance is happening over twelve million times . The second thing you might notice is that despite this, we’re staying in JIT code seemingly a high percentage of the time. Or are we? Is 80% high? Is a 4.5% class guard miss ratio high? What about 11% for shapes? It’s hard to say. The counters are great because they’re quick and they’re reasonably stable proxies for performance. There’s no substitute for painstaking measurements on a quiet machine but if the counter for Bad Slow Thing goes down (and others do not go up), we’re probably doing a good job. But they’re not great for building intuition. For intuition, we want more tangible feeling numbers. We want to see things. The third thing is that you might ask yourself “self, where are these exits coming from?” Unfortunately, counters cannot tell you that. For that, we want stack traces. This lets us know where in the guest (Ruby) code triggers an exit. Ideally also we would want some notion of time: we would want to know not just where these events happen but also when. Are the exits happening early, at application boot? At warmup? Even during what should be steady state application time? Hard to say. So we need more tools. Thankfully, Perfetto exists. Perfetto is a system for visualizing and analyzing traces and profiles that your application generates. It has both a web UI and a command-line UI. We can emit traces for Perfetto and visualize them there. Take a look at this sample ZJIT Perfetto trace generated by running Ruby with 1 . What do you see? I see a couple arrows on the left. Arrows indicate “instant” point-in-time events. Then I see a mess of purple to the right of that until the end of the trace. Hover over an arrow. Find out that each arrow is a side-exit. Scream silently. But it’s a friendly arrow. It tells you what the side-exit reason is. If you click it, it even tells you the stack trace in the pop-up panel on the bottom. If we click a couple of them, maybe we can learn more. We can also zoom by mousing over the track, holding Ctrl, and scrolling. That will get us look closer. But there are so many… Fortunately, Perfetto also provides a SQL interface to the traces. We can write a query to aggregate all of the side exit events from the table and line them up with the topmost method from the backtrace arguments in the table: This pulls up a query box at the bottom showing us that there are a couple big hotspots: It even has a helpful option to export the results Markdown table so I can paste (an edited version) into this blog post: Looks like we should figure out why we’re having shape misses so much and that will clear up a lot of exits. (Hint: it’s because once we make our first guess about what we think the object shape will be, we don’t re-assess… yet .) This has been a taste of Perfetto. There’s probably a lot more to explore. Please join the ZJIT Zulip and let us know if you have any cool tracing or exploring tricks. Now I’ll explain how you too can use Perfetto from your system. Adding support to ZJIT was pretty straightforward. The first thing is that you’ll need some way to get trace data out of your system. We write to a file with a well-known location ( ), but you could do any number of things. Perhaps you can stream events over a socket to another process, or to a server that aggregates them, or store them internally and expose a webserver that serves them over the internet, or… anything, really. Once you have that, you need a couple lines of code to emit the data. Perfetto accepts a number of formats. For example, in his excellent blog post , Tristan Hume opens with such a simple snippet of code for logging Chromium Trace JSON-formatted events (lightly modified by me): This snippet is great. It shows, end-to-end, writing a stream of one event. It is a complete (X) event, as opposed to either: It was enough to get me started. Since it’s JSON, and we have a lot of side exits, the trace quickly ballooned to 8GB large for a several second benchmark. Not great. Now, part of this is our fault—we should side exit less—and part of it is just the verbosity of JSON. Thankfully, Perfetto ingests more compact binary formats, such as the Fuchsia trace format . In addition to being more compact, FXT even supports string interning. After modifying the tracer to emit FXT, we ended with closer to 100MB for the same benchmark. We can reduce further by sampling —not writing every exit to the trace, but instead every K exits (for some (probably prime) K). This is why we provide the option. Check out the trace writer implementation from the point this article was written. We could trace: Visualizations are awesome. Get your data in the right format so you can ask the right questions easily. Thanks for Perfetto! Also, looks like visualizations are now available in Perfetto canary. Time to go make some fun histograms… This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember. ↩ two discrete timestamped begin (B) and end (E) events that book-end something, or an instant (i) event that has no duration, or a couple other event types in the Chromium Trace Event Format doc When methods get compiled How big the generated code is How long each compile phase takes When (and where) invalidation events happen When (and where) allocations happen from JITed code Garbage collection events This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember. ↩

JSON

0 views

Max Bernstein 3 months ago

Type-based alias analysis in the Toy Optimizer

Another entry in the Toy Optimizer series . Last time, we did load-store forwarding in the context of our Toy Optimizer. We managed to cache the results of both reads from and writes to the heap—at compile-time! We were careful to mind object aliasing: we separated our heap information into alias classes based on what offset the reads/writes referenced. This way, if we didn’t know if object and aliased, we could at least know that different offsets would never alias (assuming our objects don’t overlap and memory accesses are on word-sized slots). This is a coarse-grained heuristic. Fortunately, we often have much more information available at compile-time than just the offset, so we should use it. I mentioned in a footnote that we could use type information, for example, to improve our alias analysis. We’ll add a lightweight form of type-based alias analysis (TBAA) (PDF) in this post. We return once again to Fil Pizlo land, specifically How I implement SSA form . We’re going to be using the hierarchical heap effect representation from the post in our implementation, but you can use your own type representation if you have one already. This representation divides the heap into disjoint regions by type. Consider, for example, that objects and objects do not overlap. A pointer is never going to alias an pointer. They can therefore be reasoned about separately. But sometimes you don’t have perfect type information available. If you have in your language an base class of all objects, then the heap overlaps with, say, the heap. So you need some way to represent that too—just having an enum doesn’t work cleanly. Here is an example simplified type hierarchy: Where might represent different parts of the runtime’s data structures, and could be further segmented into , , etc. Fil’s idea is that we can represent each node in that hierarchy with a tuple of integers (inclusive, exclusive) that represent the pre- and post-order traversals of the tree. Or, if tree traversals are not engraved into your bones, they represent the range of all the nested objects within them. Then the “does this write interfere with this read” check—the aliasing check—is a range overlap query. Here’s a perhaps over-engineered Python implementation of the range and heap hierarchy based on the Ruby generator and C++ runtime code from JavaScriptCore: Where kicks off the tree-numbering scheme. Fil’s implementation also covers a bunch of abstract heaps such as SSAState and Control because his is used for code motion and whatnot. That can be added on later but we will not do so in this post. So there you have it: a type representation. Now we need to use it in our load-store forwarding. Recall that our load-store optimization pass looks like this: At its core, it iterates over the instructions, keeping a representation of the heap at compile-time. Reads get cached, writes get cached, and writes also invalidate the state of compile-time information about fields that may alias. In this case, our may alias asks only if the offsets overlap. This means that the following unit test will fail: This test is expecting the write to to still remain cached even though we wrote to the same offset in —because we have annotated as being an and as being a . If we account for type information in our alias analysis, we can get this test to pass. After doing a bunch of fussing around with the load-store forwarding (many rewrites), I eventually got it down to a very short diff: If we don’t have any type/alias information, we default to “I know nothing” ( ) for each object. Then we check range overlap. The boolean logic in looks a little weird, maybe. But we can also rewrite (via DeMorgan’s law) as: So, keeping all the cached field state about fields that are known by offset and by type not to alias. Maybe that is clearer (but not as nice a diff). Note that the type representation is not so important here! You could use a bitset version of the type information if you want. The important things are that you can cheaply construct types and check overlap between them. Nice, now our test passes! We can differentiate between memory accesses on objects of different types. But what if we knew more? Sometimes we know where an object came from. For example, we may have seen it get allocated in the trace. If we saw an object’s allocation, we know that it does not alias (for example) any object that was passed in via a parameter. We can use this kind of information to our advantage. For example, in the following made up IR snippet: We know that (among other facts) doesn’t alias or because we have seen its allocation site. I saw this in the old V8 IR Hydrogen’s lightweight alias analysis 1 : There is plenty of other useful information such as: If you have other fun ones, please write in. We only handle loads and stores in our optimizer. Unfortunately, this means we may accidentally cache stale information. Consider: what happens if a function call (or any other opaque instruction) writes into an object we are tracking? The conservative approach is to invalidate all cached information on a function call. This is definitely correct, but it’s a bummer for the optimizer. Can we do anything? Well, perhaps we are calling a well-known function or a specific IR instruction. In that case, we can annotate it with effects in the same abstract heap model: if the instruction does not write, or only writes to some heaps, we can at least only partially invalidate our heap. However, if the function is unknown or otherwise opaque, we need at least more advanced alias information and perhaps even (partial) escape analysis. Consider: even if an instruction takes no operands, we have no idea what state it has access to. If it writes to any object A, we cannot safely cache information about any other object B unless we know for sure that A and B do not alias. And we don’t know what the instruction writes to. So we may only know we can cache information about B because it was allocated locally and has not escaped. Some runtimes such as ART pre-compute all of their alias information in a bit matrix. This makes more sense if you are using alias information in a full control-flow graph, where you might need to iterate over the graph a few times. In a trace context, you can do a lot in one single pass—no need to make a matrix. As usual, this is a toy IR and a toy optimizer, so it’s hard to say how much faster it makes its toy programs. In general, though, there is a dial for analysis and optimization that goes between precision and speed. This is a happy point on that dial, only a tiny incremental analysis cost bump above offset-only invalidation, but for higher precision. I like that tradeoff. Also, it is very useful in JIT compilers where generally the managed language is a little better-behaved than a C-like language . Somewhere in your IR there will be a lot of duplicate loads and stores from a strength reduction pass, and this can clean up the mess. Thanks for joining as I work through a small use of type-based alias analysis for myself. I hope you enjoyed. Thank you to Chris Gregory for helpful feedback. I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan. ↩ If we know at compile-time that object A has 5 at offset 0 and object B has 7 at offset 0, then A and B don’t alias (thanks, CF) In the RPython JIT in PyPy, this is used to determine if two user (Python) objects don’t alias because we know the contents of the user (Python) class field Object size (though perhaps that is a special case of the above bullet) Field size/type Deferring alias checks to run-time Have a branch I made a fork of V8 to go spelunk around the Hydrogen IR. I reset the V8 repo to the last commit before they deleted it in favor of their new Sea of Nodes based IR called TurboFan. ↩

0 views

Max Bernstein 4 months ago

A multi-entry CFG design conundrum

The ZJIT compiler compiles Ruby bytecode (YARV) to machine code. It starts by transforming the stack machine bytecode into a high-level graph-based intermediate representation called HIR. We use a more or less typical 1 control-flow graph (CFG) in HIR. We have a compilation unit, , which has multiple basic blocks, . Each block contains multiple instructions, . HIR is always in SSA form, and we use the variant of SSA with block parameters instead of phi nodes. Where it gets weird, though, is our handling of multiple entrypoints. See, YARV handles default positional parameters (but not default keyword parameters) by embedding the code to compute the defaults inside the callee bytecode. Then callers are responsible for figuring out what offset in the bytecode they should start running the callee, depending on the amount of arguments the caller provides. 2 In the following example, we have a function that takes two optional positional parameters and . If neither is provided, we start at offset . If just is provided, we start at offset . If both are provided, we can start at offset . (See the jump table debug output: ) Unlike in Python, where default arguments are evaluated at function creation time , Ruby computes the default values at function call time . For this reason, embedding the default code inside the callee makes a lot of sense; we have a full call frame already set up, so any exception handling machinery or profiling or … doesn’t need special treatment. Since the caller knows what arguments it is passing, and often to what function, we can efficiently support this in the JIT. We just need to know what offset in the compiled callee to call into. The interpreter can also call into the compiled function, which just has a stub to do dispatch to the appropriate entry block. This has led us to design the HIR to support multiple function entrypoints . Instead of having just a single entry block, as most control-flow graphs do, each of our functions now has an array of function entries: one for the interpreter, at least one for the JIT, and more for default parameter handling. Each of these entry blocks is separately callable from the outside world. Here is what the (slightly cleaned up) HIR looks like for the above example: If you’re not a fan of text HIR, here is an embedded clickable visualization of HIR thanks to our former intern Aiden porting Firefox’s Iongraph : (You might have to scroll sideways and down and zoom around. Or you can open it in its own window .) Each entry block also comes with block parameters which mirror the function’s parameters. These get passed in (roughly) the System V ABI registers. This is kind of gross. We have to handle these blocks specially in reverse post-order (RPO) graph traversal. And, recently, I ran into an even worse case when trying to implement the Cooper-style “engineered” dominator algorithm: if we walk backwards in block dominators, the walk is not guaranteed to converge. All non-entry blocks are dominated by all entry blocks, which are only dominated by themselves. There is no one “start block”. So what is there to do? Approach 1 is to keep everything as-is, but handle entry blocks specially in the dominator algorithm too. I’m not exactly sure what would be needed, but it seems possible. Most of the existing block infra could be left alone, but it’s not clear how much this would “spread” within the compiler. What else in the future might need to be handled specially? Approach 2 is to synthesize a super-entry block and make it a predecessor of every interpreter and JIT entry block. Inside this approach there are two ways to do it: one ( 2.a ) is to fake it and report some non-existent block. Another ( 2.b ) is to actually make a block and a new instruction that is a quasi-jump instruction. In this approach, we would either need to synthesize fake block arguments for the JIT entry block parameters or add some kind of new instruction that reads the argument i passed in. (suggested by Iain Ireland, as seen in the IBM COBOL compiler) Approach 3 is to duplicate the entire CFG per entrypoint. This would return us to having one entry block per CFG at the expense of code duplication. It handles the problem pretty cleanly but then forces code duplication. I think I want the duplication to be opt-in instead of having it be the only way we support multiple entrypoints. What if it increases memory too much? The specialization probably would make the generated code faster, though. (suggested by Ben Titzer) None of these approaches feel great to me. The probable candidate is 2.b where we have instructions. That gives us flexibility to also later add full specialization without forcing it. Cameron Zwarich also notes that this this is an analogue to the common problem people have when implementing the reverse: postdominators. This is because often functions have multiple return IR instructions. He notes the usual solution is to transform them into branches to a single return instruction. Do you have this problem? What does your compiler do? We use extended basic blocks (EBBs), but this doesn’t matter for this post. It makes dominators and predecessors slightly more complicated (now you have dominating instructions ), but that’s about it as far as I can tell. We’ll see how they fare in the face of more complicated analysis later. ↩ Keyword parameters have some mix of caller/callee presence checks in the callee because they are passed in un-ordered. The caller handles simple constant defaults whereas the callee handles anything that may raise. Check out Kevin Newton’s awesome overview . ↩ We use extended basic blocks (EBBs), but this doesn’t matter for this post. It makes dominators and predecessors slightly more complicated (now you have dominating instructions ), but that’s about it as far as I can tell. We’ll see how they fare in the face of more complicated analysis later. ↩ Keyword parameters have some mix of caller/callee presence checks in the callee because they are passed in un-ordered. The caller handles simple constant defaults whereas the callee handles anything that may raise. Check out Kevin Newton’s awesome overview . ↩

0 views

Max Bernstein 5 months ago

The GDB JIT interface

GDB is great for stepping through machine code to figure out what is going on. It uses debug information under the hood to present you with a tidy backtrace and also determine how much machine code to print when you type . This debug information comes from your compiler. Clang, GCC, rustc, etc all produce debug data in a format called DWARF and then embed that debug information inside the binary (ELF, Mach-O, …) when you do or equivalent. Unfortunately, this means that by default, GDB has no idea what is going on if you break in a JIT-compiled function. You can step instruction-by-instruction and whatnot, but that’s about it. This is because the current instruction pointer is nowhere to be found in any of the existing debug info tables from the host runtime code, so your terminal is filled with . See this example from the V8 docs: Fortunately, there is a JIT interface to GDB. If you implement a couple of functions in your JIT and run them every time you finish compiling a function, you can get the debugging niceties for your JIT code too. See again a V8 example: Unfortunately, the GDB docs are somewhat sparse . So I went spelunking through a bunch of different projects to try and understand what is going on. GDB expects your runtime to expose a function called and a global variable called . GDB automatically adds its own internal breakpoints at this function, if it exists. Then, when you compile code, you call this function from your runtime. In slightly more detail: This is why you see compiler projects such as V8 including large swaths of code just to make object files: Because this is a huge hassle, GDB also has a newer interface that does not require making an ELF/Mach-O/…+DWARF object. This new interface requires writing a binary format of your choice. You make the writer and you make the reader. Then, when you are in GDB, you load your reader as a shared object. The reader must implement the interface specified by GDB : The function pointer does the bulk of the work and is responsible for matching code ranges to function names, line numbers, and more. Here are some details from Sanjoy Das . Only a few runtimes implement this interface. Most of them stub out the and function pointers: I think it also requires at least the reader to proclaim it is GPL via the macro . Since I wrote about the perf map interface recently, I have it on my mind. Why can’t we reuse it in GDB? I suppose it would be possible to try and upstream a patch to GDB to support the Linux perf map interface for JITs. After all, why shouldn’t it be able to automatically pick up symbols from ? That would be great baseline debug info for “free”. In the meantime, maybe it is reasonable to create a re-usable custom debug reader: It would be less flexible than both the DWARF and custom readers support: it would only be able to handle filename and code region. No embedding source code for GDB to display in your debugger. But maybe that is okay for a partial solution? Update: Here is my small attempt at such a plugin. V8 notes in their GDB JIT docs that because the JIT interface is a linked list and we only keep a pointer to the head, we get O(n 2 ) behavior. Bummer. This becomes especially noticeable since they register additional code objects not just for functions, but also trampolines, cache stubs, etc. Since GDB expects the code pointer in your symbol object file not to move, you have to make sure to have a stable symbol file pointer and stable executable code pointer. To make this happen, V8 disables its moving GC. Additionally, if your compiled function gets collected, you have to make sure to unregister the function. Instead of doing this eagerly, ART treats the GDB JIT linked list as a weakref and periodically removes dead code entries from it. Compile a function in your JIT compiler. This gives you a function name, maybe other metadata, an executable code address, and a code size Generate an entire ELF/Mach-O/… object in-memory (!) for that one function, describing its name, code region, maybe other DWARF metadata such as line number maps Write a linked list node that points at your object (“symfile”) Link it into the linked list Call , which gives GDB control of the process so it can pick up the new function’s metadata Optionally, break into (or crash inside) one of your JITed functions At some point, later, when your function gets GCed, unregister your code by editing the linked list and calling again CoreCLR/.NET JavaScriptCore ART which looks like it does something smart about grouping the JIT code entries together ( ), but I’m not sure exactly what it does TomatoDotNet a minimal example It looks like Dart used to have support for this but has since removed it yk write yk read asmjit-utilities write asmjit-utilities read Erlang/OTP write Erlang/OTP read FEX write FEX read buxn-jit write buxn-jit read box64 write box64 read When registering code, write the address and name to as you normally would Write the filename as the symfile (does this make the magic number?) Have the debug info reader just parse the perf map file

Dart

Open Source Erlang

0 views

Max Bernstein 5 months ago

How to annotate JITed code for perf/samply

Brief one today. I got asked “does YJIT/ZJIT have support for [Linux] perf?” The answer is yes, and it also works with samply (including on macOS!), because both understand the perf map interface . This is the entirety of the implementation in ZJIT 1 : Whenever you generate a function, append a one-line entry consisting of to . Per the Linux docs linked above, START and SIZE are hex numbers without 0x. symbolname is the rest of the line, so it could contain special characters. You can now happily run or and have JIT frames be named in the output. We hide this behind the flag to avoid file I/O overhead when we don’t need it. Perf map is the older way to interact with perf: a newer, more complicated way involves generating a “dump” file and then ing it. We actually use , which I noticed today is wrong. leaves in the , and it shouldn’t; instead use . ↩ We actually use , which I noticed today is wrong. leaves in the , and it shouldn’t; instead use . ↩

0 views

Max Bernstein 6 months ago

A catalog of side effects

Optimizing compilers like to keep track of each IR instruction’s effects . An instruction’s effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state). This post can be thought of as a continuation of What I talk about when I talk about IRs , specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have? Different compilers represent and track these effects differently. I’ve been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more. Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as “writes to a local variable”, “writes to a list”, or “reads from the stack” indicate what instructions can be re-ordered, duplicated, or removed entirely. For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR: The goal of effects is to communicate to the compiler if, for example, these two IR instructions can be re-ordered. The second instruction might write to a location that the first one reads. But it also might not! This is about knowing if and alias —if they are different names that refer to the same object. We can sometimes answer that question directly, but often it’s cheaper to compute an approximate answer: could they even alias? It’s possible that and have different types, meaning that (as long as you have strict aliasing) the and operations that implement these reads and writes by definition touch different locations. And if they look at disjoint locations, there need not be any explicit order enforced. Different compilers keep track of this information differently. The null effect analysis gives up and says “every instruction is maximally effectful” and therefore “we can’t re-order or delete any instructions”. That’s probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid. But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That’s where I am right now. So here’s a catalog of different compilers I have looked at recently. There are two main ways I have seen to represent effects: bitsets and heap range lists. We’ll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers. We’ll start with Cinder , a Python JIT, because that’s what I used to work on. Cinder tracks heap effects for its high-level IR (HIR) in instr_effects.h . Pretty much everything happens in the function, which is expected to know everything about what effects the given instruction might have. The data representation is a bitset representation of a lattice called an and that is defined in alias_class.h . Each bit in the bitset represents a distinct location in the heap: reads from and writes to each of these locations are guaranteed not to affect any of the other locations. Here is the X-macro that defines it: Note that each bit implicitly represents a set: does not refer to a specific list index, but the infinite set of all possible list indices. It’s any list index. Still, every list index is completely disjoint from, say, every entry in a global variable table. (And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected , is the container.) Like other bitset lattices, it’s possible to union the sets by or-ing the bits. It’s possible to query for overlap by and-ing the bits. If this sounds familiar, it’s because (as the repo notes) it’s a similar idea to Cinder’s type lattice representation . Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects): Union operations naturally hit a fixpoint at and intersection operations naturally hit a fixpoint at . All of this together lets the optimizer ask and answer questions such as: Let’s take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR: You can imagine that declares that it reads from the heap and declares that it writes to the heap. Because tuple and list pointers cannot be casted into one another and therefore cannot alias, these are disjoint heaps in our bitset. Therefore , therefore these memory operations can never interfere! They can (for example) be re-ordered arbitrarily. In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE. DCE involves first finding the set of instructions that need to be kept around because they are useful/important/have effects. So here is what the Cinder DCE looks like: There are some other checks in there but is right there at the core of it! Now that we have seen the bitset representation of effects and an implementation in Cinder, let’s take a look at a different representation and and an implementation in JavaScriptCore. I keep coming back to How I implement SSA form by Fil Pizlo , one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of “abstract heaps” felt very… well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click. It didn’t make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations—because, you know, JSC is six compilers in a trenchcoat [ citation needed ] . DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We’ll look at DOMJIT mostly because it’s a smaller example and also illustrates something else that’s interesting: builtins. We’ll come back to builtins in a minute. Let’s take a lookat how DOMJIT structures its abstract heaps : a YAML file. It’s a hierarchy. is a subheap of is a subheap of… and so on. A write to any is a write to is a write to … Sibling heaps are unrelated: and , for example, are disjoint. To get a feel for this, I wired up a simplified version of ZJIT’s bitset generator (for types! ) to read a YAML document and generate a bitset. It generated the following Rust code: It’s not a fancy X-macro, but it’s a short and flexible Ruby script. Then I took the DOMJIT abstract heap generator —also funnily enough a short Ruby script—modified the output format slightly, and had it generate its int pairs: It already comes with a little diagram, which is super helpful for readability. Any empty range(s) represent empty heap effects: if the start and end are the same number, there are no effects. There is no one value, but any empty range could be normalized to . Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me. What about checking overlap? Here is the implementation in JSC : (See also How to check for overlapping intervals and Range overlap in two compares for more fun.) While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap. With int ranges, it’s a little more complicated. An imprecise union of and can take the maximal range that covers both and . To get a more precise union, you have to keep track of both. In the worst case, if you want efficient arbitrary queries, you need to store your int ranges in an interval tree. So what gives? I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it’s more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn’t holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post: The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. […] you can have instructions report that they read/write multiple heaps […] you can have a utility function that produces such lists on demand. It’s important to note that this doesn’t actually involve any allocation of lists. JSC does this very clever thing where they have “functors” that they pass in as arguments that compress/summarize what they want to out of an instruction’s effects. Let’s take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat. Note that in the example below is a thin wrapper over the DFG compiler’s own equivalent: is the function that calls these functors ( or in this case) for each effect that the given IR instruction declares. I’ve pulled some relevant snippets of , which is quite long, that I think are interesting. First, some instructions (constants, here) have no effects. There’s some utility in the call but I didn’t understand fully. Then there are some instructions that conditionally have effects depending on the use types of their operands. 1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code. Some run-time IR guards that might cause side exits are annotated as such—they write to the heap. Local variable instructions read specific heaps indexed by what looks like the local index but I’m not sure. This means accessing two different locals won’t alias! Instructions that allocate can’t be re-ordered, it looks like; they both read and write the . This probably limits the amount of allocation sinking that can be done. Then there’s , which is the builtins stuff I was talking about. We’ll come back to that after the code block. (Remember that these operations are very similar to DOMJIT’s with a couple more details—and in some cases even contain DOMJIT s!) This node is the way for the DOM APIs in the browser—a significant chunk of the builtins, which are written in C++—to communicate what they do to the optimizing compiler. Without any annotations, the JIT has to assume that a call into C++ could do anything to the JIT state. Bummer! But because, for example, annotates what memory it reads from and what it doesn’t write to, the JIT can optimize around it better—or even remove the access completely. It means the JIT can reason about calls to known builtins the same way that it reasons about normal JIT opcodes. (Incidentally it looks like it doesn’t even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.) Last, we’ll look at Simple, which has a slightly different take on all of this. Simple is Cliff Click’s pet Sea of Nodes (SoN) project to try and showcase the idea to the world—outside of a HotSpot C2 context. This one is a little harder for me to understand but it looks like each translation unit has a that doles out different classes of memory nodes for each alias class. Each IR node then takes data dependencies on whatever effect nodes it might uses. Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): “Our approach is a form of TBAA similar to the ‘FieldTypeDecl’ algorithm described in the paper.” The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10 . Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets. HHVM , a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h . HHVM has a couple places that use this information, such as a definition-sinking pass , alias analysis , DCE , store elimination , refcount opts , and more. If you are wondering why the HHVM representation looks similar to the Cinder representation, it’s because some former HHVM engineers such as Brett Simmers also worked on Cinder! (note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource ) Android’s ART Java runtime also uses a bitset for its effect representation. It’s a very compact class called in nodes.h . The side effects are used in loop-invariant code motion , global value numbering , write barrier elimination , scheduling , and more. CoreCLR mostly uses a bitset for its class. This one is interesting though because it also splits out effects specifically to include sets of local variables ( ). V8 is also about six completely different compilers in a trenchcoat. Turboshaft uses a struct in operations.h called which is two bitsets for reads/writes of effects. This is used in value numbering as well a bunch of other small optimization passes they call “reducers”. Maglev also has this thing called in their IR nodes that also looks like a bitset and is used in their various reducers. It has effect query methods on it such as and . Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself. Guile Scheme looks like it has a custom tagging scheme type thing. Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8. Remember to ask the right questions of your IR when doing analysis. Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory , Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful. This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes). ↩ where might this instruction write? (because CPython is reference counted and incref implies ownership) where does this instruction borrow its input from? do these two instructions’ write destinations overlap? This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes). ↩

Java

0 views

Max Bernstein 8 months ago

Walking around the compiler

Walking around outside is good for you. [ citation needed ] A nice amble through the trees can quiet inner turbulence and make complex engineering problems disappear. Vicki Boykis wrote a post, Walking around the app , about a more proverbial stroll. In it, she talks about constantly using your production application’s interface to make sure the whole thing is cohesively designed with few rough edges. She also talks about walking around other parts of the implementation of the application, fixing inconsistencies, complex machinery, and broken builds. Kind of like picking up someone else’s trash on your hike. That’s awesome and universally good advice for pretty much every software project. It got me thinking about how I walk around the compiler. There’s a certain class of software project that transforms data—compression libraries, compilers, search engines—for which there’s another layer of “walking around” you can do. You have the code, yes, but you also have non-trivial output . By non-trivial, I mean an output that scales along some quality axis instead of something semi-regular like a JSON response. For compression, it’s size. For compilers, it’s generated code. You probably already have some generated cases checked into your codebase as tests. That’s awesome. I think golden tests are fantastic for correctness and for people to help understand. But this isolated understanding may not scale to more complex examples. How does your compiler handle, for example, switch-case statements in loops? Does it do the jump threading you expect it to? Maybe you’re sitting there idly wondering while you eat a cookie, but maybe that thought would only have occurred to you while you were scrolling through the optimizer. Say you are CF Bolz-Tereick and you are paging through PyPy IR. You notice some IR that looks like: “Huh”, you say to yourself, “surely the optimizer can reason that running on the result of is redundant!” But some quirk in your optimizer means that it does not. Maybe it used to work, or maybe it never did. But this little stroll revealed a bug with a quick fix (adding a new peephole optimization function): Now, thankfully, your IR looks much better: and you can check this in as a tidy test case: Fun fact: this was my first exposure to the PyPy project. CF walked me through fixing this bug 1 live at ECOOP 2022! I had a great time. If checking (and, later, testing) your assumptions is tricky, this may be a sign that your library does not expose enough of its internal state to developers. This may present a usability impediment that prevents you from immediately checking your assumptions or suspicions. For an excellent source of inspiration, see Kate’s tweets about program internals . Even if it does provide a flag like to print to the console, maybe this is hard to run from a phone 2 or a friend’s computer. For that, you may want friendlier tools . The right kind of tool invites exploration. Matthew Godbolt built the first friendly compiler explorer tool I used, the Compiler Explorer (“Godbolt”). It allows inputting programs into your web browser in many different languages and immediately seeing the compiled result. It will even execute your programs, within reason. This is a powerful tool: This combination lowers the barrier to check things tremendously . Now, sometimes you want the reverse: a Compiler Explorer -like thing in your terminal or editor so you don’t have to break flow. I unfortunately have not found a comparable tool. In addition to the immediate effects of being able to spot-check certain inputs and outputs, continued use of these tools builds long-term intuition about the behavior of the compiler. It builds mechanical sympathy . I haven’t written a lot about mechanical sympathy other than my grad school statement of purpose (PDF) and a few brief internet posts, so I will leave you with that for now. Your compiler likely compiles some applications and you can likely get access to the IR for the functions in that application. Scroll through every function’s optimized IR. If there are too many, maybe the top N functions’ IRs. See what can be improved. Maybe you will see some unexpected patterns. Even if you don’t notice anything in May, that could shift by August because of compiler advancements or a cool paper that you read in the intervening months. One time I found a bizarre reference counting bug that was causing copy-on-write and potential memory issues by noticing that some objects that should have been marked “immortal” in the IR were actually being refcounted. The bug was not in the compiler, but far away in application setup code—and yet it was visible in the IR. My conclusion is similar to Vicki’s. Put some love into your tools. Your colleagues will notice. Your users will notice. It might even improve your mood. Thank you to CF for feedback on the post. The actual fix that checks for and rewrites to . ↩ Just make sure to log off and touch grass. ↩

Testing

JSON

0 views

Max Bernstein 9 months ago

Linear scan with lifetime holes

In my last post , I explained a bit about how to retrofit SSA onto the original linear scan algorithm. I went over all of the details for how to go from low-level IR to register assignments—liveness analysis, scheduling, building intervals, and the actual linear scan algorithm. Basically, we made it to 1997 linear scan, with small adaptations for allocating directly on SSA. This time, we’re going to retrofit lifetime holes . Lifetime holes come into play because a linearized sequence of instructions is not a great proxy for storing or using metadata about a program originally stored as a graph. According to Linear Scan Register Allocation on SSA Form (PDF, 2010): The lifetime interval of a virtual register must cover all parts where this register is needed, with lifetime holes in between. Lifetime holes occur because the control flow graph is reduced to a list of blocks before register allocation. If a register flows into an -block, but not into the corresponding -block, the lifetime interval has a hole for the -block. Lifetime holes come from Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. Figure 1, though not in SSA form, is a nice diagram for understanding how lifetime holes may occur. Unfortunately, the paper contains a rather sparse plaintext description of their algorithm that I did not understand how to apply to my concrete allocator. Thankfully, other papers continued this line of research in (at least) 2002, 2005, and 2010. We will piece snippets from those papers together to understand what’s going on. Let’s take a look at the sample IR snippet from Wimmer2010 to illustrate how lifetime holes form: Virtual register R12 is not used between position 28 and 34. For this reason, Wimmer’s interval building algorithm assigns it the interval . Note how the interval has two disjoint ranges with space in the middle. Our simplified interval building algorithm from last time gave us—in the same notation—the interval (well, in our modified snippet). This simplified interval only supports one range with no lifetime holes. Ideally we would be able to use the physical register assigned to R12 for another virtual register in this empty slot! For example, maybe R14 or R15, which have short lifetimes that completely fit into the hole. Another example is a control-flow diamond. In this example, B1 jumps to either B3 or B2, which then merge at B4. Virtual register R0 is defined in B1 and only used in one of the branches, B3. It’s also not used in B4—if it were used in B4, it would be live in both B2 and B3! Once we schedule it, the need for lifetime holes becomes more apparent: Since B2 gets scheduled (in this case, arbitrarily) before B3, there’s a gap where R0—which is completely unused in B2—would otherwise take up space in our simplified interval form. Let’s fix that by adding some lifetime holes. Even though we are adding some gaps between ranges, each interval still gets assigned one location for its entire life . It’s just that in the gaps, we get to put other smaller intervals, like lichen growing between bricks. To get lifetime holes, we have to modify our interval data structure a bit. Our interval currently only supports a single range: We can change this to support multiple ranges by changing just one character !!! Har har. Okay, so we now have an array of instead of just a single . But now we have to implement the methods differently. We’ll start with . The start state of an interval is an empty array of ranges: Because we’re iterating backwards through the blocks and backwards through instructions in each block, we’ll be starting with instruction 38 and working our way linearly backwards until 16. This means that we’ll see later uses before earlier uses, and uses before defs. In order to keep the array in sorted order, we need to add each new range to the front. This is O(n) in an array, so use a deque or linked list. (Alternatively, push to the end and then reverse them afterwards.) We keep the ranges in sorted order because it makes keeping them disjoint easier, as we’ll see in and . Let’s start with since it’s very similar to the previous version: has a couple more cases, but we’ll go through them step by step. First, a quick check that the range is the right way ‘round: Then we have a straightforward case: if we don’t have any ranges yet, add a brand new one: But if we do have ranges, this new range might be totally subsumed by the existing first range. This happens if a virtual register is live for the entirety of a block and also used inside that block. The uses that cause an don’t add any new information: Another case is that the new range has a partial overlap with the existing first range. This happens when we’re adding ranges for all of the live-out virtual registers; the range for the predecessor block (say ) will abut the range for the successor block (say ). We merge these ranges into one big range (say, ): The last case is the case that gives us lifetime holes and happens when the new range is already completely disjoint from the existing first range. That is also a straightforward case: put the new range in at the start of the list. This is all fine and good. I added this to the register allocator to test out the lifetime hole finding but kept the rest of the same (changed the APIs slightly so the interval could pretend it was still one big range). The tests passed. Neat! I also verified that the lifetime holes were what we expected. This means our function works unmodified with the new implementation. That makes sense, given that we copied the implementation off of Wimmer2010, which can deal with lifetime holes. Now we would like to use this new information in the register allocator. It took a little bit of untangling, but the required modifications to support lifetime holes in the register assignment phase are not too invasive. To get an idea of the difference, I took the original Poletto1999 (PDF) algorithm and rewrote it in the style of the Mössenböck2002 (PDF) algorithm. For example, here is Poletto1999: And here it is again, reformatted a bit. The implicit and sets that don’t get names in Poletto1999 now get names. is inlined and gets a new name: Now we can pick out all of the bits of Mössenböck2002 that look like they are responsible for dealing with lifetime holes. For example, the algorithm now has a fourth set, . This set holds intervals that have holes that contain the current interval’s start position. These intervals are assigned registers that are potential candidates for the current interval to live (more on this in a sec). I say potential candidates because in order for them to be a home for the current interval, an inactive interval has to be completely disjoint from the current interval. If they overlap at all—in any of their ranges—then we would be trying to put two virtual registers into one physical register at the same program point. That’s a bad compile. We have to do a little extra bookkeeping in because now one physical register can be assigned to more than one interval that is still in the middle of being processed (active and inactive sets). If we choose to spill, we have to make sure that all conflicting uses of the register (intervals that overlap with the current interval) get reassigned locations. Note that this begins to depart from strictly linear (time) linear scan: the set is bounded not by the number of physical registers but instead by the number of virtual registers. Mössenböck2002 notes that the size of the set is generally very small, though, so “linear in practice”. EDIT: After re-reading Wimmer2010, I noticed that they say: […] introduced non-linear parts. Two of them are highlighted in Figure 6 where the set of inactive intervals is iterated. The set can contain an arbitrary number of intervals since it is not bound by the number of physical registers. Testing the current interval for intersection with all of them can therefore be expensive. When the lifetime intervals are created from code in SSA form, this test is not necessary anymore: All intervals in inactive start before the current interval, so they do not intersect with the current interval at their definition. They are inactive and thus have a lifetime hole at the current position, so they do not intersect with the current interval at its definition. SSA form therefore guarantees that they never intersect [7], making the entire loop that tests for intersection unnecessary. Unfortunately, splitting of intervals leads to intervals that no longer adhere to the SSA form properties because it destroys SSA form. Therefore, the intersection test cannot be omitted completely; it must be performed if the current interval has been split off from another interval. Which indicates to me that we may actually be able to leave off that loop over the inactive intervals after all? Unclear. I’ll have to come back and mess with this later. I left out the parts about register weights that are heuristics to improve register allocation. They are not core to supporting lifetime holes. You can add them back in if you like. Here is a text diff to make it clear what changed: This reformatting and diffing made it much easier for me to reason about what specifically had to be changed. There’s just one thing left after register assignment: resolution and SSA deconstruction. I’m pretty sure we can actually just keep the resolution the same. In our function, we are only making sure that the block arguments get parallel-moved into the block parameters. That hasn’t changed. Wimmer2010 says: Linear scan register allocation with splitting of lifetime intervals requires a resolution phase after the actual allocation. Because the control flow graph is reduced to a list of blocks, control flow is possible between blocks that are not adjacent in the list. When the location of an interval is different at the end of the predecessor and at the start of the successor, a move instruction must be inserted to resolve the conflict. That’s great news for us: we don’t do splitting. An interval, though it has lifetime holes, still only ever has one location for its entire life. So once an interval begins, we don’t need to think about moving its contents. So I was actually overly conservative in the previous post, which I have amended! Mössenböck2002 also tackles register constraints with this notion of “fixed intervals”—intervals that have been pre-allocated physical registers. Since I eventually want to use “register hinting” from Wimmer2005 and Wimmer2010, I’m going to ignore the fixed interval part of Mössenböck2002 for now. It seems like they work nicely together. We added lifetime holes to our register allocator without too much effort. This better maps the graph-like nature of the IR onto the linear sequence of instructions and should get us some better allocation for short-lived virtual registers. Maybe next time we will add interval splitting , which will help us a) address ABI constraints more cleanly in function calls and b) remove the dependence on reserving a scratch register.

0 views

Max Bernstein 9 months ago

Linear scan register allocation on SSA

Much of the code and education that resulted in this post happened with Aaron Patterson . The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space 1 . This is an example of a code snippet using virtual registers: And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps): Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren’t using R2 anymore). People use register allocators like they use garbage collectors: it’s an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it’s probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures. How do JIT compilers do register allocation? Well, “everyone knows” that “every JIT does its own variant of linear scan” 2 . This bothered me for some time because I’ve worked on a couple of JITs and still didn’t understand the backend bits. There are a couple different approaches to register allocation, but in this post we’ll focus on linear scan of SSA . I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA . Reading alone didn’t make a ton of sense—I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references! For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer’s Master’s thesis (PDF, 2004). There it is, all laid out at once. It’s very refreshing when compared to all of the compact research papers. I didn’t realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan—as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts. Throughout this post, we’ll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010’s code snippet is between the arrows and we add some filler (as alluded to in the paper): Virtual registers start with R and are defined either with an arrow or by a block parameter. Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I’ve also provided the same code in graphical form. Block names (and block parameters) are shaded with grey. We have one entry block, , that is implied in Wimmer2010. Its only job is to define and for the rest of the CFG. Then we have a loop between and with an implicit fallthrough. Instead of doing that, we instead generate a conditional branch with explicit jump targets. This makes it possible to re-order blocks as much as we like. The contents of are also just to fill in the blanks from Wimmer2010 and add some variable uses. Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately. For now, let’s rewind the clock and look at how linear scan came about. Linear scan register allocation (LSRA) has been around for awhile. It’s neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We’ll talk more about what that means in a minute.) It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called ‘C (TickC), for which a fast register allocator is quite useful. Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997. Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers. Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order. This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising. In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet: There are two definitions of and they each live for different amounts of time: In fact, the ranges are completely disjoint. It wouldn’t make sense for the register allocator to consider variables, because there’s no reason the two s should necessarily live in the same physical register. In SSA land, it’s a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We’ll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator 3 . Linear scan starts at the point in your compiler process where you already know these live ranges—that you have already done some kind of analysis to build a mapping. In this blog post, we’re going to back up to the point where we’ve just built our SSA low-level IR and have yet to do any work on it. We’ll do all of the analysis from scratch. Part of this analysis is called liveness analysis . The result of liveness analysis is a mapping of that tells you which virtual registers (remember, since we’re in SSA, instruction==vreg) are alive (used later) at the beginning of the basic block. This is called a live-in set. For example: We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition. In this case, at the end of B2, nothing is live. If we step backwards to the , we see a use: R16 becomes live. If we step once more, we see its definition—R16 no longer live—but now we see a use of R14 and R15, which become live. This leaves us with R14 and R15 being live-in to B2. This live-in set becomes B1’s live-out set because B1 is B2’s predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block. It gets more interesting, though, when we have branches: what does it mean when two blocks’ live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together. C C A A C->A B B C->B That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C. Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter . That’s right, we’re doing AI. In this liveness analysis, we’ll: We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby’s Integer class. Like most abstract interpretations, it doesn’t matter what order we iterate over the collection of basic blocks for correctness, but it does matter for performance. In this case, iterating backwards ( ) converges much faster than forwards ( ): We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now. The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop . By having this information available, it folds the liveness analysis into the live range building, which we’ll instead do separately in a moment. The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn’t. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method. For now, we will pick a schedule for the control-flow graph. In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range’s start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph: A couple interesting things to note: Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper. Since we’re not going to be messing with the order of the instructions within a block anymore, all we have to do going forward is make sure that we iterate through the blocks in . Finally, we have all that we need to compute live ranges. We’ll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We’ll have two main differences: I know I said we were going to be computing live ranges. So why am I presenting you with a function called ? That’s because early in the history of linear scan (Traub1998!), people moved from having a single range for a particular virtual register to having multiple disjoint ranges. This collection of multiple ranges is called an interval and it exists to free up registers in the context of branches. For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.) The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the interval for R12 (called i12) is . Interval holes are not strictly necessary—they exist to generate better code. So for this post, we’re going to start simple and assume 1 interval == 1 range. We may come back later and add additional ranges, but that will require some fixes to our later implementation. We’ll note where we think those fixes should happen. Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper: Another difference is that since we’re using block parameters, we don’t really have this thing. That’s just the block argument. The last difference is that since we’re skipping the loop liveness hack, we don’t modify a block’s set as we iterate through instructions. I know we said we’re building live ranges, so our class only has one on it. This is Ruby’s built-in range, but it’s really just being used as a tuple of integers here. Note that there’s some implicit behavior happening here: For example, if we have and someone calls , we end up with . There’s no gap in the middle. And if we have and someone calls , we end up with . After figuring out from scratch some of these assumptions about what the interval/range API should and should not do, Aaron and I realized that there was some actual code for in a different, earlier paper: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer. Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above. Finally we can start thinking about doing some actual register assignment. Let’s return to the 90s. Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997). I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar. Note that unlike in many programming languages these days, in the algorithm description represents a set , not a (hash-)map. In our Ruby code, we represent as an array: Internalizing this took us a bit. It is mostly a three-state machine: We would like to come back to this and incrementally modify it as we add lifetime holes to intervals. I finally understood, very late in the game, that Poletto1999 linear scan assigns only one location per virtual register. Ever . It’s not that every virtual register gets a shot in a register and then gets moved to a stack slot—that would be interval splitting and hopefully we get to that later—if a register gets spilled, it’s in a stack slot from beginning to end. I only found this out accidentally after trying to figure out a bug (that wasn’t a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mössenböck): However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely. In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately Let’s take a look at the code snippet again. Here it is before register allocation, using virtual registers: Let’s run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments: Some other things to note: If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space. In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea. Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use. This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code. Also, block ordering? You guessed it. Heuristic. Here is an example “slideshow” I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time 4 . At this point we have register assignments : we have a hash table mapping intervals to physical locations. That’s great but we’re still in SSA form: labelled code regions don’t have block arguments in hardware. We need to write some code to take us out of SSA and into the real world. We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now—interval splitting—but we can simplify. Because we don’t split intervals, we know that every interval live at the beginning of a block is either: For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes interval splitting, we would have to go back to the full Wimmer SSA resolution. This means that we’re going to iterate over every outbound edge from every block. For each edge, we’re going to insert some parallel moves. This already looks very similar to the RESOLVE function from Wimmer2010. Unfortunately, Wimmer2010 basically shrugs off with an eh, it’s already in the literature comment. What’s not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes: This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky’s implementation with bugfixes . We found Dmitry Stogov’s unmerged pull request to the same repository to fix another bug. We looked at Benoit Boissinot’s thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing. All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away. We ended up with this implementation which passes all of the tests from Sokolovsky’s repository as well as the example from Boissinot’s thesis (though, as we discussed in the email, the example solution in the thesis is incorrect 5 ). Leroy’s algorithm, which is shorter, passes almost all the tests—in one test case, it uses one more temporary variable than Boissinot’s does. We haven’t spent much time looking at why. Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the “swap problem”. That’s great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the “lost copy” problem? As it turns out, we still need to handle critical edge splitting. Let’s consider what it means to insert moves at an edge between blocks when the surrounding CFG looks a couple of different ways. These are the four (really, three) cases we may come across. In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn’t matter: at the end of A or at the beginning of B are both fine. In Case 2, if A has two successors, then we should insert the moves at the beginning of B. That way we won’t be mucking things up for the edge . In Case 3, if B has two predecessors, then we should insert the moves at the end of A. That way we won’t be mucking things up for the edge . Case 4 is the most complicated. There is no extant place in the graph we can insert moves. If we insert in A, we mess things up for . If we insert in , we mess things up for . Inserting in or doesn’t make any sense. What is there to do? As it turns out, Case 4 is called a critical edge . And we have to split it. We can insert a new block E along the edge and put the moves in E! That way they still happen along the edge without affecting any other blocks. Neat. In Ruby code, that looks like: Adding a new block invalidates the cached , so we also need to recompute that. We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in , we can clean up branches to empty blocks! (See also Nick’s post on critical edge splitting , which also links to Faddegon’s thesis, which I should at least skim.) And that’s it, folks. We have gone from virtual registers in SSA form to physical locations. Everything’s all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right? Not so fast… You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn’t really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately. Fun. We will start off by handling calls and method parameters separately, we will note that it’s not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally. We’ll call this new function after register allocation but before SSA resolution. We do it after register allocation so we know where each virtual register goes but before resolution so we can still inspect the virtual register operands. Its goal is to do a couple of things: We’ll also remove the operands since we’re placing them in special registers explicitly now. (Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.) Now, you may have noticed that we don’t do anything special for the incoming params of the function we’re compiling! That’s another thing we have to handle. Thankfully, we can handle it with yet another parallel move (wow!) at the end of . Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code. But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way. We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It’s possible to write tests that spot check sample programs of all shapes and sizes but it’s very difficult to anticipate every possible edge case that will appear in the real world. Even if the original algorithm you’re using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors. We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it’s very hard (impossible?) to scale that to sets of graphs. Maybe that’s enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It’s fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants. Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space. Some time after finding Chris’s blog post, I also stumbled across the very same thing in V8 ! I find this stuff so cool. I’ll also mention Boissinot’s Rust code again because it does something similar for parallel moves. It’s possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator , the LuaJIT register allocator , and Reverse Linear Scan Allocation is probably a good idea . By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though. We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints. The full Ruby code listing is not (yet?) public available under the Apache 2 license . UPDATE: See the post on lifetime holes . Thanks to Waleed Khan and Iain Ireland for giving feedback on this post. It’s not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms . ↩ Well. As I said on one of the social media sites earlier this year, “All AOT compilers are alike; each JIT compiler is fucked up in its own way.” JavaScript: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer notes: Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals. Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time. SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes: One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment. Cliff Click (private communication, 2025) notes: It’s easier. Got it already, why lose it […] spilling always uses use/def and def/use edges. This is inspired by Rasmus Andersson ’s graph coloring visualization that I saw some years ago. ↩ The example in the thesis is to sequentialize the following parallel copy: The solution in the thesis is: but we think this is incorrect. Solving manually, Aaron and I got: which is what the code gives us, too. ↩

Rust

Backend

JavaScript

0 views

Max Bernstein 9 months ago

Liveness analysis with Datalog

After publishing Linear scan register allocation on SSA , I had a nice call with Waleed Khan where he showed me how to Datalog. He thought it might be useful to try implementing liveness analysis as a Datalog problem. We started off with the Wimmer2010 CFG example from that post, sketching out manually which variables were live out of each block: R10 out of B1, R12 out of B2, etc. The graph from Wimmer2010 has come back! Remember, we’re using block arguments instead of phis, so defines R10 and R11 before the first instruction in B1. Then we tried to formulate liveness as a Datalog relation. Liveness is normally (at least for me) defined in terms of two relations: live-in and live-out. Live-out is “what is needed” from all of the successors of a block and live-in is the “what is needed” summary for a block. So, in fake math notation: where each of the component parts of that expression represent sets of variables: We ended up computing the live-in sets for blocks in the register allocator post but then using the live-out sets instead. So today let’s compute both live-in and live-out sets with Datalog! Datalog is a logic programming language. It probably looks and feels different from every other programming language you have used… except for maybe SQL. It might feel similar to SQL, except SQL has a certain order to it that Datalog does not. We’ll be using Souffle here because Waleed mentioned it and also I learned a bit about it in my databases class. The thing you do first is define your relations, which is what Datalog calls a table. In this case, if we want to compute liveness information, we have to know information about what a block uses, defines, and what successors it has. First, the thing you have to know about Datalog, is that it’s kind of like the opposite of array programming. We’re going to express things about sets by expressing facts about individual items in a set. For example, we’re not going to say “this block B4 uses [R10, R12, R16]”. We’re going to say three separate facts: “B4 uses R10”, “B4 uses R12”, “B4 uses R16”. You can think about it like each relation being a database table where each parameter is a column name. Here are the relations for block uses, block defs, and which blocks follow other blocks: Where here means string. We can then embed some facts inline. For example, this says “A defines R0 and R1 and uses R0”: You can also provide facts as a TSV but this file format is so irritating to construct manually and has given me silently wrong answers in Souffle before so I am not doing that for this example. You can, for your edification, manually encode all the use/def/successor facts from the previous post into Souffle—or you can copy this chunk into your file: We can declare our live-in and live-out relations similarly to our use/def/succ relations. We mark them as being so that Souffle presents us with the results. Now it’s time to define our relations. You may notice that the Souffle definitions look very similar to our earlier definitions. This is no mistake; Datalog was created for dataflow and graph problems. We’ll start with live-out: We read this left to right as “a variable is live-out of block if block is a successor of and is live-in to ”. The defines the left side in terms of the right side. The comma between and means it’s a conjunction— and . Where’s the union? Well, remember what I said about array programming? We’re not thinking in terms of sets. We’re thinking one program variable at a time. As Souffle executes our relations, will incrementally build up a table. It’s also a little weird to program in this style because wasn’t textually defined anywhere like a parameter or a variable. You kind of have to think of as connector, a binder, a foreign key—what have you. It’s a placeholder. (I don’t know how to explain this well. Sorry.) Then we can define live-in. This on the surface looks more complicated but I think that is only because of Souffle’s choice of syntax. It reads as “a variable is live-in to if it is either live-out of or used in , and not defined in . The semicolons are disjunctions— or —and the exclamation points negations— not . These relations look endlessly mutually recursive but you have to keep in mind that we’re not running functions on data, exactly. We’re declaratively expressing definitions of rules—relations. in the body of is not calling a function but instead making a query—is the row in the table ? Datalog builds the tables until saturation. Now we can run Souffle! We tell it to dump to standard output with but you could just as easily have it dump each output relation in its own separate file in the current directory by specifying . That’s neat. We got nicely formatted tables and it only took us two lines of code! Let’s compare to our Ruby code from the previous post to underscore the point: This is because we have separated the iteration-to-fixpoint bit from the main bit of the dataflow analysis: the equation. If we let Datalog do the data movement for us, we can work on defining the rules—and only the rules. This is probably why, in the fullness of time, many static analysis and compiler tools end up growing some kind of embedded (partial) Datalog engine. Call it Scholz’s tenth rule. Souffle also has the ability to compile to C++, which gives you two nice things: I don’t have any experience with this API. This is when Waleed mentioned offhandedly that he had heard about some embedded Rust datalog called Ascent . The front page of the Ascent website is a really great sell if you show up thinking “gee, I wish I had Datalog to use in my Rust program”. Right out the gate, you get reasonable-enough Datalog syntax via a proc macro. For example, here is the canonical path example for Souffle: and in Ascent: We weren’t sure if the Souffle liveness would port cleanly to Rust, but it sure did! It even lets you use your own datatypes instead of just (which the front-page example uses). Notice how we don’t have an or annotation like we did in Datalog. That’s because this is designed to be embedded in an existing program, which probably doesn’t to deal with the disk (or at least wants to read/write in its own format). Ascent lets us give it some vectors of data and then at the end lets us read some vectors of data too. Then we need only run and —both of which worked with zero issues—and see the results. It’s not a fancy looking table, but it’s very close to my program, which is neat. This is similar to embedding Souffle in C++ and then calling the C++. One difference, though, is the Souffle process has two steps. It’s a slight build system complication. But this isn’t meant to be a Datalog comparison post! Can we model all of linear scan this way? Maybe. I’m new to all this stuff. Ascent also seems to support lattices, which means we can use it to do abstract interpretation on some cool domains. Maxime Chevalier-Boisvert and I prototyped loupe , an interprocedural type analysis in Rust. We had to build our own iterate-to-fixpoint engine, which was non-trivial. I wonder how it would look to build something similar on top of Ascent. I kind of want to check out Frank McSherry ’s datatoad . That’s all for now, folks. Just a couple Datalog snippets. Happy hacking.

Database

JavaScript

0 views

Max Bernstein 9 months ago

Compiling a Lisp: Closure conversion

first – previous EDIT: /u/thunderseethe correctly points out that this is closure conversion, not lambda lifting, so I have adjusted the post title from “lambda lifting” to “closure conversion” accordingly. Thanks! I didn’t think this day would come, but I picked up the Ghuloum tutorial (PDF) again and I got a little bit further. There’s just one caveat: I have rewritten the implementation in Python. It’s available in the same repo in compiler.py . It’s brief, coming in at a little over 300 LOC + tests (compared to the C version’s 1200 LOC + tests). I guess there’s another caveat, too, which is that the Python version has no S-expression reader. But that’s fine: consider it an exercise for you, dear reader. That’s hardly the most interesting part of the tutorial. Oh, and I also dropped the instruction encoding. I’m doing text assembly now. Womp womp. Anyway, converting the lambdas as required in the paper requires three things: We have two forms that can bind variables: and . This means that we need to recognize the names in those special expressions and modify the environment. What environment, you ask? Well, I have this little class. We keep the same dict for the entire recursive traversal of the program, but we modify at each binding site and only at lambdas. To illustrate how they are used, let’s fill in some sample expressions: , , and : Well, okay, sure, we don’t actually need to think about variable names when we are dealing with simple constants. So let’s look at variables: We don’t want to actually transform the variable uses, just add some metadata about their uses. If we have some variable bound by a or a expression, we can leave it alone. Otherwise, we need to mark it. There’s one irritating special case here which is that we don’t want to consider (for example) as a free variable: it is a special language primitive. So we consider and the others as always bound. Armed with this knowledge, we can do our first recursive traversal: expressions. Since they have recursive parts and don’t bind any variables, they are the second-simplest form for this converter. This test doesn’t tell us much yet (other than adding an empty and not raising an exception). But it will soon. Let’s think about what does. It’s a bunch of features in a trench coat: To handle the closure conversion, we have to reason about all three. First, the lambda binds its parameters as new names. In fact, those are the only bound variables in a lambda. Consider: is a free variable in that lambda! We’ll want to transform that lambda into: Even if were bound by some outside the lambda, it would be free in the lambda: That means we don’t thread through the parameter to the lambda body; we don’t care what names are bound outside the lambda. We also want to keep track of the set of variables that are free inside the lambda: we’ll need them to create a form. Therefore, we also pass in a new set for the lambda body’s set. So far, all of this environment wrangling gives us: There’s also in there because any variable free in a lambda expression is also free in the current expression—well, except for the variables that are currently bound. Last, we’ll make a form and a form. The gets appended to the global list with a new label and the label gets threaded through to the . This is finicky! I think my first couple of versions were subtly wrong for different reasons. Tests help a lot here. For every place in the code where I mess with or in a recursive call, I tried to have a test that would fail if I got it wrong. Now let’s talk about the other binder. Let’s think about what does by examining a confusing let expression: Inside this expression, there are two s. One of them is bound inside the let, but the other is free inside the let! This is because evaluates all of its bindings without access to the bindings as they are being built up (for that, we would need ). So this must mean that: Which gives us, in code: Last, and somewhat boringly, we have function calls. The only thing to call out is again handling these always-bound primitive operators like , which we don’t want to have a : Now that we have these new , and forms we have to compile them into assembly. Compiling closure forms is very similar to allocating a string or a vector. In the first cell, we want to put a pointer to the code that backs the closure (this will be some label like ). We can get a reference to that using , since it will be a label in the assembly. Then we write it to the heap. Then for each free variable, we go find out where it’s defined. Since we know by construction that these are all strings, we don’t need to worry about having weird recursion issues around keeping track of a moving heap pointer. Instead, we know it’s always going to be an indirect from the stack or from the current closure. Then we write that to the heap. Then, since a closure is an object, we need to give it a tag. So we tag it with because I felt cute. You could also use or . We store the result in because that’s our compiler contract. Last, we bump the heap pointer by the size of the closure. So compiles to: and if we had a closure variable, for example : One nicety of emitting text assembly is that I can add inline comments very easily. That’s what my function is for: it just prefixes a . …wait, hold on, why are we reading from for a closure variable? That doesn’t make any sense, right? That’s because while we are reading off the closure, we are reading from a tagged pointer. Since we know the index into the closure and also the tag at compile-time, we can fold them into one neat indirect. Now let’s call some closures…! I’ll start by showing the code for because it’s a good stepping stone toward (nice job, Dr Ghuloum!). The main parts are: I think in my last version (the C version) I did this recursively because looping felt challenging to do neatly in C with the data structures I had built but since this is Python and the wild west, we’re looping. A lot of this carries over exactly to , with a couple differences: I think the stack adjustment math was by and away the most irritating thing to get right here. Oh, and also remembering to untag the closure when trying to call it. So compiles to: Not bad for a 300 line compiler! I think that’s all there is for today, folks. We got closures, free variable analysis, and indirect function calls. That’s pretty good. Happy hacking!

0 views

Max Bernstein 9 months ago

How to use snprintf

The family of functions ( , , , …) have this little-known feature to what size your buffer should be. In cases where you don’t have a fixed upper bound, this is really useful. For example: I have because man pages say The functions and write at most bytes (including the terminating null byte (‘\0’)) to . If you like, check out this tiny header-only library I wrote a couple of years ago and promptly forgot about. Go forth and please stop manually computing buffer sizes.

0 views

Max Bernstein 11 months ago

ClassDistribution from S6 JIT is really neat

One unassuming week of September 2022, Google DeepMind dropped a fully-fledged CPython JIT called S6 squashed to one commit. I had heard nothing of its development even though I was working on Cinder at the time and generally heard about new JIT efforts. I started poking at it. The README has some excellent structural explanation of how they optimize Python, including a nice introduction to hidden classes (also called shapes, layouts, and maps elsewhere). Hidden classes are core to making dynamic language runtimes fast: they allow for what is normally a hashtable lookup to become an integer comparison and a memory load. They rely on the assumption that even in a dynamic language, programmers are not very creative, and therefore for a given location in the code (PC), the number of types seen will be 1 or small. See a great tutorial by CF Bolz-Tereick on how to build a hidden class based object model. Hidden classes give you the ability to more quickly read from objects, but you, the runtime implementor, have to decide what kind of cache you want to use. Should you have a monomorphic cache? Or a polymorphic cache? In an interpreter, a common approach is to do some kind of state-machine-based bytecode rewriting . Your generic opcodes (load an attribute, load a method, add) start off unspecialized, specialize to monomorphic when they first observe a hidden class HC, rewrite themselves to polymorphic when they observe the next hidden class HC’, and may again rewrite themselves to megamorphic (the sad case) when they see the K+1th hidden class. Pure interpreters take this approach because they want to optimize as they go and the unit of optimization is normally (PDF) one opcode at a time. One interesting observation here is that while the bytecoder rewriting is used to help interpreter performance, you can reuse this specialized bytecode and its cache contents as a source of profiling information when the JIT kicks in. It’s a double use, which is a win for storage and run-time overhead. In an optimizing JIT world that cares a little less about interpreter/baseline compiler performance, the monomorphic/polymorphic split may look a little different: If you go for monomorphic and that code never sees any other hidden class, you’ve won big: the generated code is small and generally you can use these very strong type assumptions from having burned it into the code from the beginning. If you’re wrong, though, and the that ends up being a polymorphic site in the code, you lose on performance: it will be constantly jumping into the interpreter. If you go for polymorphic but the code is mostly monomorphic, then you mostly just lose on peak performance. Your code may need to walk the cmp+jcc chain in the JIT and the operation’s inferred type in your IR will not be as fine-grained as the monomorphic case. But you might side-exit less into the interpreter, which is nice. But “polymorphic” and “megamorphic” are very coarse summaries of the access patterns at that site. Yes, side exits are slow, but if a call site S is specialized only for hidden class HC and mostly sees HC but sometimes sees HC’, that’s probably fine! We can take a few occasional side exits if the primary case is fast. Let’s think about the information our caches give us right now: But we want more information than that: we want to know if the access patterns are skewed in some way. What if at some PC the interpreter sees 100x hidden class A and only 2x hidden class B? This would unfortunately look like a boring polymorphic cache. Or, maybe more interesting, what if we have a megamorphic site but one class more or less dominates? This would unfortunately look like a total bummer case even though it might be salvageable. If only we had a nice data structure for this… S6 has this small C++ class called that the interpreter uses to register what hidden classes it sees during execution profiling. It dispenses with the implicit seen order that a polymorphic cache keeps in its cmp-jcc chain and instead uses two fixed-size (they chose K=4) parallel arrays: and . Every time the interpreter captures a profile, it calls , which increments the corresponding count associated with that ID. There are a couple of interesting things that this function does: That is not much more additional space and it gets you a totally different slice of the picture than a “normal” IC and bytecode rewriting. I find the bubbling up, the other count, and the running difference especially fun. After a while, some bit of policy code decides that it’s time to switch execution modes for a given function and compile. The compiler would like to make use of this profile information. Sure, it can fiddle around with it in its raw state, but the S6 devs found a better API that random compiler passes can consume: the . The is another very small C++ class. It has only three fields: the class IDs from the (but not their counts), a field, and a field. We don’t need their counts because that’s not really the question the optimizer should be asking. The thing the optimizer actually wants to know is “how should I speculate at this PC?” and it can outsource the mechanism for that to the ’s kind (and the information implicit in the ordering of the class IDs, where the hottest class ID is in index 0). The kind can be one of five options: Empty , Monomorphic , Polymorphic , SkewedMegamorphic , and Megamorphic , each of which imply different things about how to speculate. Empty, monomorphic and polymorphic are reasonably straightforward (did we see 0, 1, or <= K class IDs?) but SkewedMegamorphic is where it gets interesting. Their heuristic for if a megamorphic PC is skewed is if the class ID in bucket 0—the most popular class ID—is over 75% of the total recorded events. This means that the optimizer still has a shot at doing something interesting at the given PC. I wonder why they didn’t also have SkewedPolymorphic. I think that’s because for polymorphic PCs, they inline the entire compare-jump chain eagerly, which puts the check for the most popular ID in the first position. Still, I think there is potentially room to decide to monomorphize a polymorphic call site. There’s some ad-hoc checking for this kind of thing in , for example to specialize where is historically either a or a . Also, sadly, they did not get to implemented SkewedMegamorphic before the project shut down, so they only handle monomorphic and polymorphic cases all across the optimizer. Ah well. Some of the time-shift profiling that you can do with a ClassDistribution seems really cool and I had not seen it before. It feels like it could help with the issues brought up in Why Aren’t More Users More Happy With Our VMs? . Maybe. Understanding the behavior of a program through a tiny lens over a small snapshot of time is challenging. (Kind of a “bits and bobbles” section a la Phil Zucker . I’m trying it out.) FeedbackVector in V8. See blog post by Benedikt Meurer , which explains how they profile generic instruction operands using a feedback lattice. Speculation in JavaScriptCore , which continues to be a fantastic resource for fast runtime development. In it, Fil argues that the cost of speculating wrong is so high that you better be darn sure that is true in See a blog post by Jan de Mooij and a blog post by Matthew Gaudet on CacheIR in SpiderMonkey (and paper! (PDF)) → helpful for trial inlining? See warp improvement blog post Tracing is just different Basic block versioning What if we had more context? Info from caller

0 views

Max Bernstein 11 months ago

Heating water from afar

Please do not take anything you read in any of my posts (but especially not this post) as engineering advice. My parents’ house has an on-demand water heater for energy efficiency reasons. This has a small drawback: you have to press a button to prime the water heater and then wait 2-5 minutes before showering. This turns into a somewhat bigger drawback for the one room for which it’s just Not Possible to wire up a button. The water heater company hypothetically sells a plug-and-play wireless solution for this sort of thing, but that is seemingly incompatible with my parents’ walls (???). Thankfully, I have a have a playbook for this. There’s just one problem: how do I make a Raspberry Pi talk to the water heater? I investigated a couple of different approaches: but after a couple of discussions with my dad and a support tech from the company, we determined that we should instead emulate a button press. To find out what that means, let’s take a look at a very sketchy wiring diagram: This means we would have to add a new “button” and have it briefly connect two wires. Because the last time I actually touched some wires was over 10 years ago in robotics and I don’t want to start any fires, I reach out to the usual suspects: Tom and Logan. They inform me that the thing I am looking for is called a relay and that companies sell pre-built relay hats for the Pi. Super. I ended up buying: At some point I sit down to build the thing and realize that I don’t actually know how relays work. The relay I bought had three ports: NC, NO, COM. After some searching, I figure out that I want one wire in NO (“normally open”) and one in COM (“common”). This means that the relay, when activated, will close the circuit. I downloaded the sample code from the company that sells the relay hats and realized that it is an extremely thin (~10 LOC) wrapper over the existing Python GPIO library provided and pre-installed by Raspberry Pi, so I just manually inlined it: If you read my previous post (linked above), you will know that is is, of course, a CGI script that is triggered on a website button press: All of the rest of the software is the same as in the previous post. Very boring stuff: httpd, systemd. Hopefully nothing goes wrong. But if it does and I need to administer this device from afar, I also set up Tailscale (no, this is not an ad; just happy). The total bill for this came to ~$40 or so, which isn’t half bad. It could probably be done for 35 cents using an old microcontroller and a paperclip or something but I wanted an exceptionally boring (to me) approach. That’s all for now. Thanks for reading!

Hardware

Tutorial