Linear scan register allocation on SSA
Much of the code and education that resulted in this post happened with Aaron Patterson . The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space 1 . This is an example of a code snippet using virtual registers: And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps): Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren’t using R2 anymore). People use register allocators like they use garbage collectors: it’s an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it’s probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures. How do JIT compilers do register allocation? Well, “everyone knows” that “every JIT does its own variant of linear scan” 2 . This bothered me for some time because I’ve worked on a couple of JITs and still didn’t understand the backend bits. There are a couple different approaches to register allocation, but in this post we’ll focus on linear scan of SSA . I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA . Reading alone didn’t make a ton of sense—I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references! For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer’s Master’s thesis (PDF, 2004). There it is, all laid out at once. It’s very refreshing when compared to all of the compact research papers. I didn’t realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan—as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts. Throughout this post, we’ll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010’s code snippet is between the arrows and we add some filler (as alluded to in the paper): Virtual registers start with R and are defined either with an arrow or by a block parameter. Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I’ve also provided the same code in graphical form. Block names (and block parameters) are shaded with grey. We have one entry block, , that is implied in Wimmer2010. Its only job is to define and for the rest of the CFG. Then we have a loop between and with an implicit fallthrough. Instead of doing that, we instead generate a conditional branch with explicit jump targets. This makes it possible to re-order blocks as much as we like. The contents of are also just to fill in the blanks from Wimmer2010 and add some variable uses. Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately. For now, let’s rewind the clock and look at how linear scan came about. Linear scan register allocation (LSRA) has been around for awhile. It’s neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We’ll talk more about what that means in a minute.) It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called ‘C (TickC), for which a fast register allocator is quite useful. Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997. Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers. Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order. This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising. In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet: There are two definitions of and they each live for different amounts of time: In fact, the ranges are completely disjoint. It wouldn’t make sense for the register allocator to consider variables, because there’s no reason the two s should necessarily live in the same physical register. In SSA land, it’s a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We’ll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator 3 . Linear scan starts at the point in your compiler process where you already know these live ranges—that you have already done some kind of analysis to build a mapping. In this blog post, we’re going to back up to the point where we’ve just built our SSA low-level IR and have yet to do any work on it. We’ll do all of the analysis from scratch. Part of this analysis is called liveness analysis . The result of liveness analysis is a mapping of that tells you which virtual registers (remember, since we’re in SSA, instruction==vreg) are alive (used later) at the beginning of the basic block. This is called a live-in set. For example: We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition. In this case, at the end of B2, nothing is live. If we step backwards to the , we see a use: R16 becomes live. If we step once more, we see its definition—R16 no longer live—but now we see a use of R14 and R15, which become live. This leaves us with R14 and R15 being live-in to B2. This live-in set becomes B1’s live-out set because B1 is B2’s predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block. It gets more interesting, though, when we have branches: what does it mean when two blocks’ live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together. C C A A C->A B B C->B That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C. Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter . That’s right, we’re doing AI. In this liveness analysis, we’ll: We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby’s Integer class. Like most abstract interpretations, it doesn’t matter what order we iterate over the collection of basic blocks for correctness, but it does matter for performance. In this case, iterating backwards ( ) converges much faster than forwards ( ): We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now. The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop . By having this information available, it folds the liveness analysis into the live range building, which we’ll instead do separately in a moment. The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn’t. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method. For now, we will pick a schedule for the control-flow graph. In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range’s start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph: A couple interesting things to note: Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper. Since we’re not going to be messing with the order of the instructions within a block anymore, all we have to do going forward is make sure that we iterate through the blocks in . Finally, we have all that we need to compute live ranges. We’ll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We’ll have two main differences: I know I said we were going to be computing live ranges. So why am I presenting you with a function called ? That’s because early in the history of linear scan (Traub1998!), people moved from having a single range for a particular virtual register to having multiple disjoint ranges. This collection of multiple ranges is called an interval and it exists to free up registers in the context of branches. For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.) The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the interval for R12 (called i12) is . Interval holes are not strictly necessary—they exist to generate better code. So for this post, we’re going to start simple and assume 1 interval == 1 range. We may come back later and add additional ranges, but that will require some fixes to our later implementation. We’ll note where we think those fixes should happen. Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper: Another difference is that since we’re using block parameters, we don’t really have this thing. That’s just the block argument. The last difference is that since we’re skipping the loop liveness hack, we don’t modify a block’s set as we iterate through instructions. I know we said we’re building live ranges, so our class only has one on it. This is Ruby’s built-in range, but it’s really just being used as a tuple of integers here. Note that there’s some implicit behavior happening here: For example, if we have and someone calls , we end up with . There’s no gap in the middle. And if we have and someone calls , we end up with . After figuring out from scratch some of these assumptions about what the interval/range API should and should not do, Aaron and I realized that there was some actual code for in a different, earlier paper: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer. Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above. Finally we can start thinking about doing some actual register assignment. Let’s return to the 90s. Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997). I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar. Note that unlike in many programming languages these days, in the algorithm description represents a set , not a (hash-)map. In our Ruby code, we represent as an array: Internalizing this took us a bit. It is mostly a three-state machine: We would like to come back to this and incrementally modify it as we add lifetime holes to intervals. I finally understood, very late in the game, that Poletto1999 linear scan assigns only one location per virtual register. Ever . It’s not that every virtual register gets a shot in a register and then gets moved to a stack slot—that would be interval splitting and hopefully we get to that later—if a register gets spilled, it’s in a stack slot from beginning to end. I only found this out accidentally after trying to figure out a bug (that wasn’t a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mössenböck): However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely. In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately Let’s take a look at the code snippet again. Here it is before register allocation, using virtual registers: Let’s run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments: Some other things to note: If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space. In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea. Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use. This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code. Also, block ordering? You guessed it. Heuristic. Here is an example “slideshow” I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time 4 . At this point we have register assignments : we have a hash table mapping intervals to physical locations. That’s great but we’re still in SSA form: labelled code regions don’t have block arguments in hardware. We need to write some code to take us out of SSA and into the real world. We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now—interval splitting—but we can simplify. Because we don’t split intervals, we know that every interval live at the beginning of a block is either: For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes interval splitting, we would have to go back to the full Wimmer SSA resolution. This means that we’re going to iterate over every outbound edge from every block. For each edge, we’re going to insert some parallel moves. This already looks very similar to the RESOLVE function from Wimmer2010. Unfortunately, Wimmer2010 basically shrugs off with an eh, it’s already in the literature comment. What’s not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes: This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky’s implementation with bugfixes . We found Dmitry Stogov’s unmerged pull request to the same repository to fix another bug. We looked at Benoit Boissinot’s thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing. All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away. We ended up with this implementation which passes all of the tests from Sokolovsky’s repository as well as the example from Boissinot’s thesis (though, as we discussed in the email, the example solution in the thesis is incorrect 5 ). Leroy’s algorithm, which is shorter, passes almost all the tests—in one test case, it uses one more temporary variable than Boissinot’s does. We haven’t spent much time looking at why. Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the “swap problem”. That’s great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the “lost copy” problem? As it turns out, we still need to handle critical edge splitting. Let’s consider what it means to insert moves at an edge between blocks when the surrounding CFG looks a couple of different ways. These are the four (really, three) cases we may come across. In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn’t matter: at the end of A or at the beginning of B are both fine. In Case 2, if A has two successors, then we should insert the moves at the beginning of B. That way we won’t be mucking things up for the edge . In Case 3, if B has two predecessors, then we should insert the moves at the end of A. That way we won’t be mucking things up for the edge . Case 4 is the most complicated. There is no extant place in the graph we can insert moves. If we insert in A, we mess things up for . If we insert in , we mess things up for . Inserting in or doesn’t make any sense. What is there to do? As it turns out, Case 4 is called a critical edge . And we have to split it. We can insert a new block E along the edge and put the moves in E! That way they still happen along the edge without affecting any other blocks. Neat. In Ruby code, that looks like: Adding a new block invalidates the cached , so we also need to recompute that. We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in , we can clean up branches to empty blocks! (See also Nick’s post on critical edge splitting , which also links to Faddegon’s thesis, which I should at least skim.) And that’s it, folks. We have gone from virtual registers in SSA form to physical locations. Everything’s all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right? Not so fast… You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn’t really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately. Fun. We will start off by handling calls and method parameters separately, we will note that it’s not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally. We’ll call this new function after register allocation but before SSA resolution. We do it after register allocation so we know where each virtual register goes but before resolution so we can still inspect the virtual register operands. Its goal is to do a couple of things: We’ll also remove the operands since we’re placing them in special registers explicitly now. (Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.) Now, you may have noticed that we don’t do anything special for the incoming params of the function we’re compiling! That’s another thing we have to handle. Thankfully, we can handle it with yet another parallel move (wow!) at the end of . Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code. But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way. We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It’s possible to write tests that spot check sample programs of all shapes and sizes but it’s very difficult to anticipate every possible edge case that will appear in the real world. Even if the original algorithm you’re using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors. We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it’s very hard (impossible?) to scale that to sets of graphs. Maybe that’s enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It’s fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants. Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space. Some time after finding Chris’s blog post, I also stumbled across the very same thing in V8 ! I find this stuff so cool. I’ll also mention Boissinot’s Rust code again because it does something similar for parallel moves. It’s possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator , the LuaJIT register allocator , and Reverse Linear Scan Allocation is probably a good idea . By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though. We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints. The full Ruby code listing is not (yet?) public available under the Apache 2 license . UPDATE: See the post on lifetime holes . Thanks to Waleed Khan and Iain Ireland for giving feedback on this post. It’s not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms . ↩ Well. As I said on one of the social media sites earlier this year, “All AOT compilers are alike; each JIT compiler is fucked up in its own way.” JavaScript: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer notes: Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals. Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time. SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes: One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment. Cliff Click (private communication, 2025) notes: It’s easier. Got it already, why lose it […] spilling always uses use/def and def/use edges. This is inspired by Rasmus Andersson ’s graph coloring visualization that I saw some years ago. ↩ The example in the thesis is to sequentialize the following parallel copy: The solution in the thesis is: but we think this is incorrect. Solving manually, Aaron and I got: which is what the code gives us, too. ↩