Latest Posts (11 found)
Max Bernstein 3 weeks ago

Walking around the compiler

Walking around outside is good for you. [ citation needed ] A nice amble through the trees can quiet inner turbulence and make complex engineering problems disappear. Vicki Boykis wrote a post, Walking around the app , about a more proverbial stroll. In it, she talks about constantly using your production application’s interface to make sure the whole thing is cohesively designed with few rough edges. She also talks about walking around other parts of the implementation of the application, fixing inconsistencies, complex machinery, and broken builds. Kind of like picking up someone else’s trash on your hike. That’s awesome and universally good advice for pretty much every software project. It got me thinking about how I walk around the compiler. There’s a certain class of software project that transforms data—compression libraries, compilers, search engines—for which there’s another layer of “walking around” you can do. You have the code, yes, but you also have non-trivial output . By non-trivial, I mean an output that scales along some quality axis instead of something semi-regular like a JSON response. For compression, it’s size. For compilers, it’s generated code. You probably already have some generated cases checked into your codebase as tests. That’s awesome. I think golden tests are fantastic for correctness and for people to help understand. But this isolated understanding may not scale to more complex examples. How does your compiler handle, for example, switch-case statements in loops? Does it do the jump threading you expect it to? Maybe you’re sitting there idly wondering while you eat a cookie, but maybe that thought would only have occurred to you while you were scrolling through the optimizer. Say you are CF Bolz-Tereick and you are paging through PyPy IR. You notice some IR that looks like: “Huh”, you say to yourself, “surely the optimizer can reason that running on the result of is redundant!” But some quirk in your optimizer means that it does not. Maybe it used to work, or maybe it never did. But this little stroll revealed a bug with a quick fix (adding a new peephole optimization function): Now, thankfully, your IR looks much better: and you can check this in as a tidy test case: Fun fact: this was my first exposure to the PyPy project. CF walked me through fixing this bug 1 live at ECOOP 2022! I had a great time. If checking (and, later, testing) your assumptions is tricky, this may be a sign that your library does not expose enough of its internal state to developers. This may present a usability impediment that prevents you from immediately checking your assumptions or suspicions. For an excellent source of inspiration, see Kate’s tweets about program internals . Even if it does provide a flag like to print to the console, maybe this is hard to run from a phone 2 or a friend’s computer. For that, you may want friendlier tools . The right kind of tool invites exploration. Matthew Godbolt built the first friendly compiler explorer tool I used, the Compiler Explorer (“Godbolt”). It allows inputting programs into your web browser in many different languages and immediately seeing the compiled result. It will even execute your programs, within reason. This is a powerful tool: This combination lowers the barrier to check things tremendously . Now, sometimes you want the reverse: a Compiler Explorer -like thing in your terminal or editor so you don’t have to break flow. I unfortunately have not found a comparable tool. In addition to the immediate effects of being able to spot-check certain inputs and outputs, continued use of these tools builds long-term intuition about the behavior of the compiler. It builds mechanical sympathy . I haven’t written a lot about mechanical sympathy other than my grad school statement of purpose (PDF) and a few brief internet posts, so I will leave you with that for now. Your compiler likely compiles some applications and you can likely get access to the IR for the functions in that application. Scroll through every function’s optimized IR. If there are too many, maybe the top N functions’ IRs. See what can be improved. Maybe you will see some unexpected patterns. Even if you don’t notice anything in May, that could shift by August because of compiler advancements or a cool paper that you read in the intervening months. One time I found a bizarre reference counting bug that was causing copy-on-write and potential memory issues by noticing that some objects that should have been marked “immortal” in the IR were actually being refcounted. The bug was not in the compiler, but far away in application setup code—and yet it was visible in the IR. My conclusion is similar to Vicki’s. Put some love into your tools. Your colleagues will notice. Your users will notice. It might even improve your mood. Thank you to CF for feedback on the post. The actual fix that checks for and rewrites to .  ↩ Just make sure to log off and touch grass.  ↩

0 views
Max Bernstein 1 months ago

Linear scan with lifetime holes

In my last post , I explained a bit about how to retrofit SSA onto the original linear scan algorithm. I went over all of the details for how to go from low-level IR to register assignments—liveness analysis, scheduling, building intervals, and the actual linear scan algorithm. Basically, we made it to 1997 linear scan, with small adaptations for allocating directly on SSA. This time, we’re going to retrofit lifetime holes . Lifetime holes come into play because a linearized sequence of instructions is not a great proxy for storing or using metadata about a program originally stored as a graph. According to Linear Scan Register Allocation on SSA Form (PDF, 2010): The lifetime interval of a virtual register must cover all parts where this register is needed, with lifetime holes in between. Lifetime holes occur because the control flow graph is reduced to a list of blocks before register allocation. If a register flows into an -block, but not into the corresponding -block, the lifetime interval has a hole for the -block. Lifetime holes come from Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. Figure 1, though not in SSA form, is a nice diagram for understanding how lifetime holes may occur. Unfortunately, the paper contains a rather sparse plaintext description of their algorithm that I did not understand how to apply to my concrete allocator. Thankfully, other papers continued this line of research in (at least) 2002, 2005, and 2010. We will piece snippets from those papers together to understand what’s going on. Let’s take a look at the sample IR snippet from Wimmer2010 to illustrate how lifetime holes form: Virtual register R12 is not used between position 28 and 34. For this reason, Wimmer’s interval building algorithm assigns it the interval . Note how the interval has two disjoint ranges with space in the middle. Our simplified interval building algorithm from last time gave us—in the same notation—the interval (well, in our modified snippet). This simplified interval only supports one range with no lifetime holes. Ideally we would be able to use the physical register assigned to R12 for another virtual register in this empty slot! For example, maybe R14 or R15, which have short lifetimes that completely fit into the hole. Another example is a control-flow diamond. In this example, B1 jumps to either B3 or B2, which then merge at B4. Virtual register R0 is defined in B1 and only used in one of the branches, B3. It’s also not used in B4—if it were used in B4, it would be live in both B2 and B3! Once we schedule it, the need for lifetime holes becomes more apparent: Since B2 gets scheduled (in this case, arbitrarily) before B3, there’s a gap where R0—which is completely unused in B2—would otherwise take up space in our simplified interval form. Let’s fix that by adding some lifetime holes. Even though we are adding some gaps between ranges, each interval still gets assigned one location for its entire life . It’s just that in the gaps, we get to put other smaller intervals, like lichen growing between bricks. To get lifetime holes, we have to modify our interval data structure a bit. Our interval currently only supports a single range: We can change this to support multiple ranges by changing just one character !!! Har har. Okay, so we now have an array of instead of just a single . But now we have to implement the methods differently. We’ll start with . The start state of an interval is an empty array of ranges: Because we’re iterating backwards through the blocks and backwards through instructions in each block, we’ll be starting with instruction 38 and working our way linearly backwards until 16. This means that we’ll see later uses before earlier uses, and uses before defs. In order to keep the array in sorted order, we need to add each new range to the front. This is O(n) in an array, so use a deque or linked list. (Alternatively, push to the end and then reverse them afterwards.) We keep the ranges in sorted order because it makes keeping them disjoint easier, as we’ll see in and . Let’s start with since it’s very similar to the previous version: has a couple more cases, but we’ll go through them step by step. First, a quick check that the range is the right way ‘round: Then we have a straightforward case: if we don’t have any ranges yet, add a brand new one: But if we do have ranges, this new range might be totally subsumed by the existing first range. This happens if a virtual register is live for the entirety of a block and also used inside that block. The uses that cause an don’t add any new information: Another case is that the new range has a partial overlap with the existing first range. This happens when we’re adding ranges for all of the live-out virtual registers; the range for the predecessor block (say ) will abut the range for the successor block (say ). We merge these ranges into one big range (say, ): The last case is the case that gives us lifetime holes and happens when the new range is already completely disjoint from the existing first range. That is also a straightforward case: put the new range in at the start of the list. This is all fine and good. I added this to the register allocator to test out the lifetime hole finding but kept the rest of the same (changed the APIs slightly so the interval could pretend it was still one big range). The tests passed. Neat! I also verified that the lifetime holes were what we expected. This means our function works unmodified with the new implementation. That makes sense, given that we copied the implementation off of Wimmer2010, which can deal with lifetime holes. Now we would like to use this new information in the register allocator. It took a little bit of untangling, but the required modifications to support lifetime holes in the register assignment phase are not too invasive. To get an idea of the difference, I took the original Poletto1999 (PDF) algorithm and rewrote it in the style of the Mössenböck2002 (PDF) algorithm. For example, here is Poletto1999: And here it is again, reformatted a bit. The implicit and sets that don’t get names in Poletto1999 now get names. is inlined and gets a new name: Now we can pick out all of the bits of Mössenböck2002 that look like they are responsible for dealing with lifetime holes. For example, the algorithm now has a fourth set, . This set holds intervals that have holes that contain the current interval’s start position. These intervals are assigned registers that are potential candidates for the current interval to live (more on this in a sec). I say potential candidates because in order for them to be a home for the current interval, an inactive interval has to be completely disjoint from the current interval. If they overlap at all—in any of their ranges—then we would be trying to put two virtual registers into one physical register at the same program point. That’s a bad compile. We have to do a little extra bookkeeping in because now one physical register can be assigned to more than one interval that is still in the middle of being processed (active and inactive sets). If we choose to spill, we have to make sure that all conflicting uses of the register (intervals that overlap with the current interval) get reassigned locations. Note that this begins to depart from strictly linear (time) linear scan: the set is bounded not by the number of physical registers but instead by the number of virtual registers. Mössenböck2002 notes that the size of the set is generally very small, though, so “linear in practice”. EDIT: After re-reading Wimmer2010, I noticed that they say: […] introduced non-linear parts. Two of them are highlighted in Figure 6 where the set of inactive intervals is iterated. The set can contain an arbitrary number of intervals since it is not bound by the number of physical registers. Testing the current interval for intersection with all of them can therefore be expensive. When the lifetime intervals are created from code in SSA form, this test is not necessary anymore: All intervals in inactive start before the current interval, so they do not intersect with the current interval at their definition. They are inactive and thus have a lifetime hole at the current position, so they do not intersect with the current interval at its definition. SSA form therefore guarantees that they never intersect [7], making the entire loop that tests for intersection unnecessary. Unfortunately, splitting of intervals leads to intervals that no longer adhere to the SSA form properties because it destroys SSA form. Therefore, the intersection test cannot be omitted completely; it must be performed if the current interval has been split off from another interval. Which indicates to me that we may actually be able to leave off that loop over the inactive intervals after all? Unclear. I’ll have to come back and mess with this later. I left out the parts about register weights that are heuristics to improve register allocation. They are not core to supporting lifetime holes. You can add them back in if you like. Here is a text diff to make it clear what changed: This reformatting and diffing made it much easier for me to reason about what specifically had to be changed. There’s just one thing left after register assignment: resolution and SSA deconstruction. I’m pretty sure we can actually just keep the resolution the same. In our function, we are only making sure that the block arguments get parallel-moved into the block parameters. That hasn’t changed. Wimmer2010 says: Linear scan register allocation with splitting of lifetime intervals requires a resolution phase after the actual allocation. Because the control flow graph is reduced to a list of blocks, control flow is possible between blocks that are not adjacent in the list. When the location of an interval is different at the end of the predecessor and at the start of the successor, a move instruction must be inserted to resolve the conflict. That’s great news for us: we don’t do splitting. An interval, though it has lifetime holes, still only ever has one location for its entire life. So once an interval begins, we don’t need to think about moving its contents. So I was actually overly conservative in the previous post, which I have amended! Mössenböck2002 also tackles register constraints with this notion of “fixed intervals”—intervals that have been pre-allocated physical registers. Since I eventually want to use “register hinting” from Wimmer2005 and Wimmer2010, I’m going to ignore the fixed interval part of Mössenböck2002 for now. It seems like they work nicely together. We added lifetime holes to our register allocator without too much effort. This better maps the graph-like nature of the IR onto the linear sequence of instructions and should get us some better allocation for short-lived virtual registers. Maybe next time we will add interval splitting , which will help us a) address ABI constraints more cleanly in function calls and b) remove the dependence on reserving a scratch register.

0 views
Max Bernstein 2 months ago

Linear scan register allocation on SSA

Much of the code and education that resulted in this post happened with Aaron Patterson . The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space 1 . This is an example of a code snippet using virtual registers: And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps): Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren’t using R2 anymore). People use register allocators like they use garbage collectors: it’s an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it’s probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures. How do JIT compilers do register allocation? Well, “everyone knows” that “every JIT does its own variant of linear scan” 2 . This bothered me for some time because I’ve worked on a couple of JITs and still didn’t understand the backend bits. There are a couple different approaches to register allocation, but in this post we’ll focus on linear scan of SSA . I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA . Reading alone didn’t make a ton of sense—I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references! For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer’s Master’s thesis (PDF, 2004). There it is, all laid out at once. It’s very refreshing when compared to all of the compact research papers. I didn’t realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan—as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts. Throughout this post, we’ll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010’s code snippet is between the arrows and we add some filler (as alluded to in the paper): Virtual registers start with R and are defined either with an arrow or by a block parameter. Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I’ve also provided the same code in graphical form. Block names (and block parameters) are shaded with grey. We have one entry block, , that is implied in Wimmer2010. Its only job is to define and for the rest of the CFG. Then we have a loop between and with an implicit fallthrough. Instead of doing that, we instead generate a conditional branch with explicit jump targets. This makes it possible to re-order blocks as much as we like. The contents of are also just to fill in the blanks from Wimmer2010 and add some variable uses. Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately. For now, let’s rewind the clock and look at how linear scan came about. Linear scan register allocation (LSRA) has been around for awhile. It’s neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We’ll talk more about what that means in a minute.) It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called ‘C (TickC), for which a fast register allocator is quite useful. Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997. Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers. Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order. This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising. In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet: There are two definitions of and they each live for different amounts of time: In fact, the ranges are completely disjoint. It wouldn’t make sense for the register allocator to consider variables, because there’s no reason the two s should necessarily live in the same physical register. In SSA land, it’s a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We’ll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator 3 . Linear scan starts at the point in your compiler process where you already know these live ranges—that you have already done some kind of analysis to build a mapping. In this blog post, we’re going to back up to the point where we’ve just built our SSA low-level IR and have yet to do any work on it. We’ll do all of the analysis from scratch. Part of this analysis is called liveness analysis . The result of liveness analysis is a mapping of that tells you which virtual registers (remember, since we’re in SSA, instruction==vreg) are alive (used later) at the beginning of the basic block. This is called a live-in set. For example: We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition. In this case, at the end of B2, nothing is live. If we step backwards to the , we see a use: R16 becomes live. If we step once more, we see its definition—R16 no longer live—but now we see a use of R14 and R15, which become live. This leaves us with R14 and R15 being live-in to B2. This live-in set becomes B1’s live-out set because B1 is B2’s predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block. It gets more interesting, though, when we have branches: what does it mean when two blocks’ live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together. C C A A C->A B B C->B That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C. Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter . That’s right, we’re doing AI. In this liveness analysis, we’ll: We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby’s Integer class. Like most abstract interpretations, it doesn’t matter what order we iterate over the collection of basic blocks for correctness, but it does matter for performance. In this case, iterating backwards ( ) converges much faster than forwards ( ): We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now. The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop . By having this information available, it folds the liveness analysis into the live range building, which we’ll instead do separately in a moment. The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn’t. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method. For now, we will pick a schedule for the control-flow graph. In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range’s start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph: A couple interesting things to note: Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper. Since we’re not going to be messing with the order of the instructions within a block anymore, all we have to do going forward is make sure that we iterate through the blocks in . Finally, we have all that we need to compute live ranges. We’ll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We’ll have two main differences: I know I said we were going to be computing live ranges. So why am I presenting you with a function called ? That’s because early in the history of linear scan (Traub1998!), people moved from having a single range for a particular virtual register to having multiple disjoint ranges. This collection of multiple ranges is called an interval and it exists to free up registers in the context of branches. For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.) The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the interval for R12 (called i12) is . Interval holes are not strictly necessary—they exist to generate better code. So for this post, we’re going to start simple and assume 1 interval == 1 range. We may come back later and add additional ranges, but that will require some fixes to our later implementation. We’ll note where we think those fixes should happen. Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper: Another difference is that since we’re using block parameters, we don’t really have this thing. That’s just the block argument. The last difference is that since we’re skipping the loop liveness hack, we don’t modify a block’s set as we iterate through instructions. I know we said we’re building live ranges, so our class only has one on it. This is Ruby’s built-in range, but it’s really just being used as a tuple of integers here. Note that there’s some implicit behavior happening here: For example, if we have and someone calls , we end up with . There’s no gap in the middle. And if we have and someone calls , we end up with . After figuring out from scratch some of these assumptions about what the interval/range API should and should not do, Aaron and I realized that there was some actual code for in a different, earlier paper: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer. Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above. Finally we can start thinking about doing some actual register assignment. Let’s return to the 90s. Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997). I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar. Note that unlike in many programming languages these days, in the algorithm description represents a set , not a (hash-)map. In our Ruby code, we represent as an array: Internalizing this took us a bit. It is mostly a three-state machine: We would like to come back to this and incrementally modify it as we add lifetime holes to intervals. I finally understood, very late in the game, that Poletto1999 linear scan assigns only one location per virtual register. Ever . It’s not that every virtual register gets a shot in a register and then gets moved to a stack slot—that would be interval splitting and hopefully we get to that later—if a register gets spilled, it’s in a stack slot from beginning to end. I only found this out accidentally after trying to figure out a bug (that wasn’t a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mössenböck): However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely. In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately Let’s take a look at the code snippet again. Here it is before register allocation, using virtual registers: Let’s run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments: Some other things to note: If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space. In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea. Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use. This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code. Also, block ordering? You guessed it. Heuristic. Here is an example “slideshow” I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time 4 . At this point we have register assignments : we have a hash table mapping intervals to physical locations. That’s great but we’re still in SSA form: labelled code regions don’t have block arguments in hardware. We need to write some code to take us out of SSA and into the real world. We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now—interval splitting—but we can simplify. Because we don’t split intervals, we know that every interval live at the beginning of a block is either: For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes interval splitting, we would have to go back to the full Wimmer SSA resolution. This means that we’re going to iterate over every outbound edge from every block. For each edge, we’re going to insert some parallel moves. This already looks very similar to the RESOLVE function from Wimmer2010. Unfortunately, Wimmer2010 basically shrugs off with an eh, it’s already in the literature comment. What’s not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes: This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky’s implementation with bugfixes . We found Dmitry Stogov’s unmerged pull request to the same repository to fix another bug. We looked at Benoit Boissinot’s thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing. All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away. We ended up with this implementation which passes all of the tests from Sokolovsky’s repository as well as the example from Boissinot’s thesis (though, as we discussed in the email, the example solution in the thesis is incorrect 5 ). Leroy’s algorithm, which is shorter, passes almost all the tests—in one test case, it uses one more temporary variable than Boissinot’s does. We haven’t spent much time looking at why. Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the “swap problem”. That’s great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the “lost copy” problem? As it turns out, we still need to handle critical edge splitting. Let’s consider what it means to insert moves at an edge between blocks when the surrounding CFG looks a couple of different ways. These are the four (really, three) cases we may come across. In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn’t matter: at the end of A or at the beginning of B are both fine. In Case 2, if A has two successors, then we should insert the moves at the beginning of B. That way we won’t be mucking things up for the edge . In Case 3, if B has two predecessors, then we should insert the moves at the end of A. That way we won’t be mucking things up for the edge . Case 4 is the most complicated. There is no extant place in the graph we can insert moves. If we insert in A, we mess things up for . If we insert in , we mess things up for . Inserting in or doesn’t make any sense. What is there to do? As it turns out, Case 4 is called a critical edge . And we have to split it. We can insert a new block E along the edge and put the moves in E! That way they still happen along the edge without affecting any other blocks. Neat. In Ruby code, that looks like: Adding a new block invalidates the cached , so we also need to recompute that. We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in , we can clean up branches to empty blocks! (See also Nick’s post on critical edge splitting , which also links to Faddegon’s thesis, which I should at least skim.) And that’s it, folks. We have gone from virtual registers in SSA form to physical locations. Everything’s all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right? Not so fast… You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn’t really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately. Fun. We will start off by handling calls and method parameters separately, we will note that it’s not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally. We’ll call this new function after register allocation but before SSA resolution. We do it after register allocation so we know where each virtual register goes but before resolution so we can still inspect the virtual register operands. Its goal is to do a couple of things: We’ll also remove the operands since we’re placing them in special registers explicitly now. (Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.) Now, you may have noticed that we don’t do anything special for the incoming params of the function we’re compiling! That’s another thing we have to handle. Thankfully, we can handle it with yet another parallel move (wow!) at the end of . Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code. But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way. We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It’s possible to write tests that spot check sample programs of all shapes and sizes but it’s very difficult to anticipate every possible edge case that will appear in the real world. Even if the original algorithm you’re using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors. We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it’s very hard (impossible?) to scale that to sets of graphs. Maybe that’s enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It’s fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants. Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space. Some time after finding Chris’s blog post, I also stumbled across the very same thing in V8 ! I find this stuff so cool. I’ll also mention Boissinot’s Rust code again because it does something similar for parallel moves. It’s possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator , the LuaJIT register allocator , and Reverse Linear Scan Allocation is probably a good idea . By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though. We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints. The full Ruby code listing is not (yet?) public available under the Apache 2 license . UPDATE: See the post on lifetime holes . Thanks to Waleed Khan and Iain Ireland for giving feedback on this post. It’s not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms .  ↩ Well. As I said on one of the social media sites earlier this year, “All AOT compilers are alike; each JIT compiler is fucked up in its own way.” JavaScript: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mössenböck and Pfeiffer notes: Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals. Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time. SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes: One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment. Cliff Click (private communication, 2025) notes: It’s easier. Got it already, why lose it […] spilling always uses use/def and def/use edges. This is inspired by Rasmus Andersson ’s graph coloring visualization that I saw some years ago.  ↩ The example in the thesis is to sequentialize the following parallel copy: The solution in the thesis is: but we think this is incorrect. Solving manually, Aaron and I got: which is what the code gives us, too.  ↩

0 views
Max Bernstein 2 months ago

Liveness analysis with Datalog

After publishing Linear scan register allocation on SSA , I had a nice call with Waleed Khan where he showed me how to Datalog. He thought it might be useful to try implementing liveness analysis as a Datalog problem. We started off with the Wimmer2010 CFG example from that post, sketching out manually which variables were live out of each block: R10 out of B1, R12 out of B2, etc. The graph from Wimmer2010 has come back! Remember, we’re using block arguments instead of phis, so defines R10 and R11 before the first instruction in B1. Then we tried to formulate liveness as a Datalog relation. Liveness is normally (at least for me) defined in terms of two relations: live-in and live-out. Live-out is “what is needed” from all of the successors of a block and live-in is the “what is needed” summary for a block. So, in fake math notation: where each of the component parts of that expression represent sets of variables: We ended up computing the live-in sets for blocks in the register allocator post but then using the live-out sets instead. So today let’s compute both live-in and live-out sets with Datalog! Datalog is a logic programming language. It probably looks and feels different from every other programming language you have used… except for maybe SQL. It might feel similar to SQL, except SQL has a certain order to it that Datalog does not. We’ll be using Souffle here because Waleed mentioned it and also I learned a bit about it in my databases class. The thing you do first is define your relations, which is what Datalog calls a table. In this case, if we want to compute liveness information, we have to know information about what a block uses, defines, and what successors it has. First, the thing you have to know about Datalog, is that it’s kind of like the opposite of array programming. We’re going to express things about sets by expressing facts about individual items in a set. For example, we’re not going to say “this block B4 uses [R10, R12, R16]”. We’re going to say three separate facts: “B4 uses R10”, “B4 uses R12”, “B4 uses R16”. You can think about it like each relation being a database table where each parameter is a column name. Here are the relations for block uses, block defs, and which blocks follow other blocks: Where here means string. We can then embed some facts inline. For example, this says “A defines R0 and R1 and uses R0”: You can also provide facts as a TSV but this file format is so irritating to construct manually and has given me silently wrong answers in Souffle before so I am not doing that for this example. You can, for your edification, manually encode all the use/def/successor facts from the previous post into Souffle—or you can copy this chunk into your file: We can declare our live-in and live-out relations similarly to our use/def/succ relations. We mark them as being so that Souffle presents us with the results. Now it’s time to define our relations. You may notice that the Souffle definitions look very similar to our earlier definitions. This is no mistake; Datalog was created for dataflow and graph problems. We’ll start with live-out: We read this left to right as “a variable is live-out of block if block is a successor of and is live-in to ”. The defines the left side in terms of the right side. The comma between and means it’s a conjunction— and . Where’s the union? Well, remember what I said about array programming? We’re not thinking in terms of sets. We’re thinking one program variable at a time. As Souffle executes our relations, will incrementally build up a table. It’s also a little weird to program in this style because wasn’t textually defined anywhere like a parameter or a variable. You kind of have to think of as connector, a binder, a foreign key—what have you. It’s a placeholder. (I don’t know how to explain this well. Sorry.) Then we can define live-in. This on the surface looks more complicated but I think that is only because of Souffle’s choice of syntax. It reads as “a variable is live-in to if it is either live-out of or used in , and not defined in . The semicolons are disjunctions— or —and the exclamation points negations— not . These relations look endlessly mutually recursive but you have to keep in mind that we’re not running functions on data, exactly. We’re declaratively expressing definitions of rules—relations. in the body of is not calling a function but instead making a query—is the row in the table ? Datalog builds the tables until saturation. Now we can run Souffle! We tell it to dump to standard output with but you could just as easily have it dump each output relation in its own separate file in the current directory by specifying . That’s neat. We got nicely formatted tables and it only took us two lines of code! Let’s compare to our Ruby code from the previous post to underscore the point: This is because we have separated the iteration-to-fixpoint bit from the main bit of the dataflow analysis: the equation. If we let Datalog do the data movement for us, we can work on defining the rules—and only the rules. This is probably why, in the fullness of time, many static analysis and compiler tools end up growing some kind of embedded (partial) Datalog engine. Call it Scholz’s tenth rule. Souffle also has the ability to compile to C++, which gives you two nice things: I don’t have any experience with this API. This is when Waleed mentioned offhandedly that he had heard about some embedded Rust datalog called Ascent . The front page of the Ascent website is a really great sell if you show up thinking “gee, I wish I had Datalog to use in my Rust program”. Right out the gate, you get reasonable-enough Datalog syntax via a proc macro. For example, here is the canonical path example for Souffle: and in Ascent: We weren’t sure if the Souffle liveness would port cleanly to Rust, but it sure did! It even lets you use your own datatypes instead of just (which the front-page example uses). Notice how we don’t have an or annotation like we did in Datalog. That’s because this is designed to be embedded in an existing program, which probably doesn’t to deal with the disk (or at least wants to read/write in its own format). Ascent lets us give it some vectors of data and then at the end lets us read some vectors of data too. Then we need only run and —both of which worked with zero issues—and see the results. It’s not a fancy looking table, but it’s very close to my program, which is neat. This is similar to embedding Souffle in C++ and then calling the C++. One difference, though, is the Souffle process has two steps. It’s a slight build system complication. But this isn’t meant to be a Datalog comparison post! Can we model all of linear scan this way? Maybe. I’m new to all this stuff. Ascent also seems to support lattices, which means we can use it to do abstract interpretation on some cool domains. Maxime Chevalier-Boisvert and I prototyped loupe , an interprocedural type analysis in Rust. We had to build our own iterate-to-fixpoint engine, which was non-trivial. I wonder how it would look to build something similar on top of Ascent. I kind of want to check out Frank McSherry ’s datatoad . That’s all for now, folks. Just a couple Datalog snippets. Happy hacking.

0 views
Max Bernstein 2 months ago

Compiling a Lisp: Closure conversion

first – previous EDIT: /u/thunderseethe correctly points out that this is closure conversion, not lambda lifting, so I have adjusted the post title from “lambda lifting” to “closure conversion” accordingly. Thanks! I didn’t think this day would come, but I picked up the Ghuloum tutorial (PDF) again and I got a little bit further. There’s just one caveat: I have rewritten the implementation in Python. It’s available in the same repo in compiler.py . It’s brief, coming in at a little over 300 LOC + tests (compared to the C version’s 1200 LOC + tests). I guess there’s another caveat, too, which is that the Python version has no S-expression reader. But that’s fine: consider it an exercise for you, dear reader. That’s hardly the most interesting part of the tutorial. Oh, and I also dropped the instruction encoding. I’m doing text assembly now. Womp womp. Anyway, converting the lambdas as required in the paper requires three things: We have two forms that can bind variables: and . This means that we need to recognize the names in those special expressions and modify the environment. What environment, you ask? Well, I have this little class. We keep the same dict for the entire recursive traversal of the program, but we modify at each binding site and only at lambdas. To illustrate how they are used, let’s fill in some sample expressions: , , and : Well, okay, sure, we don’t actually need to think about variable names when we are dealing with simple constants. So let’s look at variables: We don’t want to actually transform the variable uses, just add some metadata about their uses. If we have some variable bound by a or a expression, we can leave it alone. Otherwise, we need to mark it. There’s one irritating special case here which is that we don’t want to consider (for example) as a free variable: it is a special language primitive. So we consider and the others as always bound. Armed with this knowledge, we can do our first recursive traversal: expressions. Since they have recursive parts and don’t bind any variables, they are the second-simplest form for this converter. This test doesn’t tell us much yet (other than adding an empty and not raising an exception). But it will soon. Let’s think about what does. It’s a bunch of features in a trench coat: To handle the closure conversion, we have to reason about all three. First, the lambda binds its parameters as new names. In fact, those are the only bound variables in a lambda. Consider: is a free variable in that lambda! We’ll want to transform that lambda into: Even if were bound by some outside the lambda, it would be free in the lambda: That means we don’t thread through the parameter to the lambda body; we don’t care what names are bound outside the lambda. We also want to keep track of the set of variables that are free inside the lambda: we’ll need them to create a form. Therefore, we also pass in a new set for the lambda body’s set. So far, all of this environment wrangling gives us: There’s also in there because any variable free in a lambda expression is also free in the current expression—well, except for the variables that are currently bound. Last, we’ll make a form and a form. The gets appended to the global list with a new label and the label gets threaded through to the . This is finicky! I think my first couple of versions were subtly wrong for different reasons. Tests help a lot here. For every place in the code where I mess with or in a recursive call, I tried to have a test that would fail if I got it wrong. Now let’s talk about the other binder. Let’s think about what does by examining a confusing let expression: Inside this expression, there are two s. One of them is bound inside the let, but the other is free inside the let! This is because evaluates all of its bindings without access to the bindings as they are being built up (for that, we would need ). So this must mean that: Which gives us, in code: Last, and somewhat boringly, we have function calls. The only thing to call out is again handling these always-bound primitive operators like , which we don’t want to have a : Now that we have these new , and forms we have to compile them into assembly. Compiling closure forms is very similar to allocating a string or a vector. In the first cell, we want to put a pointer to the code that backs the closure (this will be some label like ). We can get a reference to that using , since it will be a label in the assembly. Then we write it to the heap. Then for each free variable, we go find out where it’s defined. Since we know by construction that these are all strings, we don’t need to worry about having weird recursion issues around keeping track of a moving heap pointer. Instead, we know it’s always going to be an indirect from the stack or from the current closure. Then we write that to the heap. Then, since a closure is an object, we need to give it a tag. So we tag it with because I felt cute. You could also use or . We store the result in because that’s our compiler contract. Last, we bump the heap pointer by the size of the closure. So compiles to: and if we had a closure variable, for example : One nicety of emitting text assembly is that I can add inline comments very easily. That’s what my function is for: it just prefixes a . …wait, hold on, why are we reading from for a closure variable? That doesn’t make any sense, right? That’s because while we are reading off the closure, we are reading from a tagged pointer. Since we know the index into the closure and also the tag at compile-time, we can fold them into one neat indirect. Now let’s call some closures…! I’ll start by showing the code for because it’s a good stepping stone toward (nice job, Dr Ghuloum!). The main parts are: I think in my last version (the C version) I did this recursively because looping felt challenging to do neatly in C with the data structures I had built but since this is Python and the wild west, we’re looping. A lot of this carries over exactly to , with a couple differences: I think the stack adjustment math was by and away the most irritating thing to get right here. Oh, and also remembering to untag the closure when trying to call it. So compiles to: Not bad for a 300 line compiler! I think that’s all there is for today, folks. We got closures, free variable analysis, and indirect function calls. That’s pretty good. Happy hacking!

0 views
Max Bernstein 2 months ago

How to use snprintf

The family of functions ( , , , …) have this little-known feature to what size your buffer should be. In cases where you don’t have a fixed upper bound, this is really useful. For example: I have because man pages say The functions and write at most bytes (including the terminating null byte (‘\0’)) to . If you like, check out this tiny header-only library I wrote a couple of years ago and promptly forgot about. Go forth and please stop manually computing buffer sizes.

0 views
Max Bernstein 3 months ago

ClassDistribution from S6 JIT is really neat

One unassuming week of September 2022, Google DeepMind dropped a fully-fledged CPython JIT called S6 squashed to one commit. I had heard nothing of its development even though I was working on Cinder at the time and generally heard about new JIT efforts. I started poking at it. The README has some excellent structural explanation of how they optimize Python, including a nice introduction to hidden classes (also called shapes, layouts, and maps elsewhere). Hidden classes are core to making dynamic language runtimes fast: they allow for what is normally a hashtable lookup to become an integer comparison and a memory load. They rely on the assumption that even in a dynamic language, programmers are not very creative, and therefore for a given location in the code (PC), the number of types seen will be 1 or small. See a great tutorial by CF Bolz-Tereick on how to build a hidden class based object model. Hidden classes give you the ability to more quickly read from objects, but you, the runtime implementor, have to decide what kind of cache you want to use. Should you have a monomorphic cache? Or a polymorphic cache? In an interpreter, a common approach is to do some kind of state-machine-based bytecode rewriting . Your generic opcodes (load an attribute, load a method, add) start off unspecialized, specialize to monomorphic when they first observe a hidden class HC, rewrite themselves to polymorphic when they observe the next hidden class HC’, and may again rewrite themselves to megamorphic (the sad case) when they see the K+1th hidden class. Pure interpreters take this approach because they want to optimize as they go and the unit of optimization is normally (PDF) one opcode at a time. One interesting observation here is that while the bytecoder rewriting is used to help interpreter performance, you can reuse this specialized bytecode and its cache contents as a source of profiling information when the JIT kicks in. It’s a double use, which is a win for storage and run-time overhead. In an optimizing JIT world that cares a little less about interpreter/baseline compiler performance, the monomorphic/polymorphic split may look a little different: If you go for monomorphic and that code never sees any other hidden class, you’ve won big: the generated code is small and generally you can use these very strong type assumptions from having burned it into the code from the beginning. If you’re wrong, though, and the that ends up being a polymorphic site in the code, you lose on performance: it will be constantly jumping into the interpreter. If you go for polymorphic but the code is mostly monomorphic, then you mostly just lose on peak performance. Your code may need to walk the cmp+jcc chain in the JIT and the operation’s inferred type in your IR will not be as fine-grained as the monomorphic case. But you might side-exit less into the interpreter, which is nice. But “polymorphic” and “megamorphic” are very coarse summaries of the access patterns at that site. Yes, side exits are slow, but if a call site S is specialized only for hidden class HC and mostly sees HC but sometimes sees HC’, that’s probably fine! We can take a few occasional side exits if the primary case is fast. Let’s think about the information our caches give us right now: But we want more information than that: we want to know if the access patterns are skewed in some way. What if at some PC the interpreter sees 100x hidden class A and only 2x hidden class B? This would unfortunately look like a boring polymorphic cache. Or, maybe more interesting, what if we have a megamorphic site but one class more or less dominates? This would unfortunately look like a total bummer case even though it might be salvageable. If only we had a nice data structure for this… S6 has this small C++ class called that the interpreter uses to register what hidden classes it sees during execution profiling. It dispenses with the implicit seen order that a polymorphic cache keeps in its cmp-jcc chain and instead uses two fixed-size (they chose K=4) parallel arrays: and . Every time the interpreter captures a profile, it calls , which increments the corresponding count associated with that ID. There are a couple of interesting things that this function does: That is not much more additional space and it gets you a totally different slice of the picture than a “normal” IC and bytecode rewriting. I find the bubbling up, the other count, and the running difference especially fun. After a while, some bit of policy code decides that it’s time to switch execution modes for a given function and compile. The compiler would like to make use of this profile information. Sure, it can fiddle around with it in its raw state, but the S6 devs found a better API that random compiler passes can consume: the . The is another very small C++ class. It has only three fields: the class IDs from the (but not their counts), a field, and a field. We don’t need their counts because that’s not really the question the optimizer should be asking. The thing the optimizer actually wants to know is “how should I speculate at this PC?” and it can outsource the mechanism for that to the ’s kind (and the information implicit in the ordering of the class IDs, where the hottest class ID is in index 0). The kind can be one of five options: Empty , Monomorphic , Polymorphic , SkewedMegamorphic , and Megamorphic , each of which imply different things about how to speculate. Empty, monomorphic and polymorphic are reasonably straightforward (did we see 0, 1, or <= K class IDs?) but SkewedMegamorphic is where it gets interesting. Their heuristic for if a megamorphic PC is skewed is if the class ID in bucket 0—the most popular class ID—is over 75% of the total recorded events. This means that the optimizer still has a shot at doing something interesting at the given PC. I wonder why they didn’t also have SkewedPolymorphic. I think that’s because for polymorphic PCs, they inline the entire compare-jump chain eagerly, which puts the check for the most popular ID in the first position. Still, I think there is potentially room to decide to monomorphize a polymorphic call site. There’s some ad-hoc checking for this kind of thing in , for example to specialize where is historically either a or a . Also, sadly, they did not get to implemented SkewedMegamorphic before the project shut down, so they only handle monomorphic and polymorphic cases all across the optimizer. Ah well. Some of the time-shift profiling that you can do with a ClassDistribution seems really cool and I had not seen it before. It feels like it could help with the issues brought up in Why Aren’t More Users More Happy With Our VMs? . Maybe. Understanding the behavior of a program through a tiny lens over a small snapshot of time is challenging. (Kind of a “bits and bobbles” section a la Phil Zucker . I’m trying it out.) FeedbackVector in V8. See blog post by Benedikt Meurer , which explains how they profile generic instruction operands using a feedback lattice. Speculation in JavaScriptCore , which continues to be a fantastic resource for fast runtime development. In it, Fil argues that the cost of speculating wrong is so high that you better be darn sure that is true in See a blog post by Jan de Mooij and a blog post by Matthew Gaudet on CacheIR in SpiderMonkey (and paper! (PDF)) → helpful for trial inlining? See warp improvement blog post Tracing is just different Basic block versioning What if we had more context? Info from caller

0 views
Max Bernstein 3 months ago

Heating water from afar

Please do not take anything you read in any of my posts (but especially not this post) as engineering advice. My parents’ house has an on-demand water heater for energy efficiency reasons. This has a small drawback: you have to press a button to prime the water heater and then wait 2-5 minutes before showering. This turns into a somewhat bigger drawback for the one room for which it’s just Not Possible to wire up a button. The water heater company hypothetically sells a plug-and-play wireless solution for this sort of thing, but that is seemingly incompatible with my parents’ walls (???). Thankfully, I have a have a playbook for this. There’s just one problem: how do I make a Raspberry Pi talk to the water heater? I investigated a couple of different approaches: but after a couple of discussions with my dad and a support tech from the company, we determined that we should instead emulate a button press. To find out what that means, let’s take a look at a very sketchy wiring diagram: This means we would have to add a new “button” and have it briefly connect two wires. Because the last time I actually touched some wires was over 10 years ago in robotics and I don’t want to start any fires, I reach out to the usual suspects: Tom and Logan. They inform me that the thing I am looking for is called a relay and that companies sell pre-built relay hats for the Pi. Super. I ended up buying: At some point I sit down to build the thing and realize that I don’t actually know how relays work. The relay I bought had three ports: NC, NO, COM. After some searching, I figure out that I want one wire in NO (“normally open”) and one in COM (“common”). This means that the relay, when activated, will close the circuit. I downloaded the sample code from the company that sells the relay hats and realized that it is an extremely thin (~10 LOC) wrapper over the existing Python GPIO library provided and pre-installed by Raspberry Pi, so I just manually inlined it: If you read my previous post (linked above), you will know that is is, of course, a CGI script that is triggered on a website button press: All of the rest of the software is the same as in the previous post. Very boring stuff: httpd, systemd. Hopefully nothing goes wrong. But if it does and I need to administer this device from afar, I also set up Tailscale (no, this is not an ad; just happy). The total bill for this came to ~$40 or so, which isn’t half bad. It could probably be done for 35 cents using an old microcontroller and a paperclip or something but I wanted an exceptionally boring (to me) approach. That’s all for now. Thanks for reading!

0 views
Max Bernstein 3 months ago

Travel notes: PLDI Seoul

I went to PLDI in Seoul, South Korea. I had a nice time. Here are some unstructured thoughts. It was great to see old familiar faces, meet new people, and unexpected (though not unwelcome) to have multiple people approach and say they read my blog. Hello, dear readers! Vegetarian food is tricky, but if you look for a “temple meal” or restaurants specifically called out as vegetarian friendly (or even full vegan), you will be okay. A lot of food that otherwise looks vegetarian probably contains fish products. If you are okay with a bit of a grey area, the Myeong-dong night market has a good fried potato cheese stick thing and a cream cheese garlic bread. Go to a chicken and beer place with friends. We spent a good deal of time looking for a cute bar vibe (and did find one secret bar!) but the best bang for buck is definitely fries and 4000 Won Cass beer. I had a nice time walking the Cheonggyecheon river (creek?). It’s separated from traffic, has fish and birds, and many people enjoy sitting alongside it. Don’t go on rainy day because it is also where rainwater flows to, but on an overcast or sunny day it’s great. Not many people online suggest the city wall walk—in fact, I did not see it mentioned on any website I looked at—but I stumbled across it coming out of the Cheonggyecheon walk and in search of a 7-Eleven. It’s a nice 15 mile circular tour of old Seoul. I didn’t do the whole thing because it was already so late in the day but I will next time. It’s a lovely way to escape the huge buildings and constant traffic and see the small homes where people actually live. Bring a cold beverage (corn silk tea is nice). It’s hilly. On the city wall walk by Naksan there is a cute cafe called cafe cogito . I had ice cream there and looked out over the city. It’s a lovely place to sit and cool off. The neighbors have a cat that goes adventuring around. After dinner, I stumbled across yet another thing that I had somehow not seen mentioned anywhere: Jongno! I accidentally wandered into the Ikseon-dong Hanok village at night, where there are restaurants and shops in a reconstructed stone-and-timber village. As I departed for the train, I also wandered into the night market on Donhwamun-ro 11-gil, which was packed and also looked excellent. It had tables and seating, giving it a very different feel than Myeong-dong. Pity I had already eaten… this looked like the place to be. Getting to and from the airport (ICN) is irritating. You can do one straight shot from ICN to Seoul Station on the AREX train (which I recommend if it’s in budget for you and/or your employer; the slower train is fine too), but you if you need to transfer anywhere, it’s up, down, and around in circles in the train station. There is a lot of transfer friction. Until next time, South Korea!

0 views
Max Bernstein 4 months ago

What I talk about when I talk about IRs

I have a lot of thoughts about the design of compiler intermediate representations (IRs). In this post I’m going to try and communicate some of those ideas and why I think they are important. The overarching idea is being able to make decisions with only local information. That comes in a couple of different flavors. We’ll assume that we’re compiling a method at a time, instead of a something more trace-like (tracing, tracelets, basic block versioning, etc). A function will normally have some control-flow: , , , any amount of jumping around within a function. Let’s look at an example function in a language with advanced control-flow constructs: Most compilers will deconstruct this , with its many nested expressions, into simple comparisons and jumps. In order to resolve jump targets in your compiler, you may have some notion of labels (in this case, words ending with a colon): This looks kind of like a pseudo-assembly language. It has its high-level language features decomposed into many smaller instructions. It also has implicit fallthrough between labeled sections (for example, into ). I mention these things because they, like the rest of the ideas in this post, are points in an IR design space. Representing code this way is an explicit choice, not a given. For example, one could make the jumps explicit by adding a at the end of . As soon as we add that instruction, the code becomes position-independent: as long as we start with , the chunks of code between labels could be ordered any which way: they are addressed by name and have no implicit ordering requirements. This may seem arbitrary, but it gives the optimizer more flexibility. If some optimization rule decides, for example, that a branch to may rarely be taken, it can freely re-order it toward the end of the function (or even on a different page!) so that more hot code can be on a single cache line. Explicit jumps and labels turn the code from a strictly linear assembly into a control-flow graph (CFG). Each sequence of code without internal control-flow is a called basic block and is a vertex in this graph. The directed edges represent jumps between blocks. See for example this crude GraphViz representation: We’re actually kind of looking at extended basic blocks (EBBs), which allow for multiple control exits per block but only one control entry. A strict basic block representation of the above code would look, in text form, something like this: Notice how each block has exactly one terminator (control-flow instruction), with (in this case) 0 or 2 targets. Opinions differ about the merits and issues of extended vs normal basic blocks. Most compilers I see use normal basic blocks. In either case, bringing the IR into a graph form gives us an advantage: thanks to Cousot and Cousot, our favorite power couple, we know how to do abstract interpretation on graphs and we can use this to build an advanced optimizer. See, for example, my intro to abstract interpretation post . Some IRs are stack based. For concatenative languages or some newer JIT compilers, IRs are formatted in such a way that each opcode reads its operands from a stack and writes its outputs to a stack. This is reminiscent of a point-free coding style in languages such as Haskell or OCaml. In this style, there is an implicit shared state: the stack. Dataflow is explicit (pushes and pops) and instructions can only be rearranged if the stack structure is preserved. This requires some non-local reasoning: to move an instruction, one must also rearrange the stack. By contrast, in a register-based IR, things are more explicit. Instructions take named inputs ( , , etc) and produce named outputs. Instructions can be slightly more easily moved around (modulo effects) as long as inputs remain defined. Local variables do not exist. The stack does not exist. Everything is IR “variables”. The constraints (names being defined) are part of the IR . This gets a little bit tricky if it’s possible to define a name multiple times. What does mean in the instruction for ? Which definition does it refer to? In order to reason about the instruction , we have to keep around some context. This is non-trivial: requiring compiler writers to constantly truck around side-tables and update them as they do analysis is slow and error-prone. Fortunately, if we enforce some interesting rules, we can push that analysis work into one pass up-front… Static single assignment (SSA) was introduced by a bunch of folks at IBM (see my blog post about the different implementations). In SSA-based IRs, each variable can only be defined once. Put another way, a variable is its defining instruction; alternately, a variable and its defining instruction are addressed by the same name. The previous example is not valid SSA; has two definitions. If we turn the previous example into SSA, we can now use a different name for each instruction. This is related to the unique name assumption or the global names property: names do not depend on context. Now we can identify each different instruction by the variable it defines. This is useful in analysis… I’d be remiss if I did not mention continuation-passing style (CPS) based IRs (and in fact, I had forgotten in the original draft of the post). As an IR, CPS is normally used in the analysis and optimization of functional programs, for example in the OCaml and Racket compilers. It is not required, however; MLton, for example, uses SSA in its compiler for Standard ML. SSA and CPS can more or less represent the same programs, but they can each feel a natural fit for different languages (and different compiler authors). I don’t feel qualified to say much more here. For a more informed opinion, check out Andy Wingo’s approaching cps soup , especially the benefits and drawbacks near the end. Speaking of CPS, I took a class with Olin Shivers and he described abstract interpretation as “automated theorem finding”. Unlike theorem provers such as Lean and Rocq, where you have to manually prove the properties you want, static analysis finds interesting facts that already exist in your program (and optimizers use them to make your program faster). Your static analysis pass(es) can annotate your IR nodes with little bits of information such as: If your static analysis is over SSA, then generally the static analysis is easier and (potentially) storing facts is easier. This is due to this property called sparseness . Where a static analysis over non-SSA programs has to store facts about all variables at all program points , an analysis over SSA need only store facts about all variables, independent of context. I sometimes describe this as “pushing time through the IR” but I don’t know that that makes a ton of sense. Potentially more subtle here is that we could represent the above IR snippet as a list of tuples, where instructions are related via some other table (say, a “variable environment”): Instead, though, we could allocate an object for each instruction and let them refer to one another by pointer (or index, if using Rust or something). Then they directly refer to one another (no need for a side-table), which might be faster and more compact. We can re-create nice names as needed for printing. Then, when optimizing, we look up the type information of an operand by directly reading a field ( or similar). Another thing to note: when you start adding type information to your IR, you’re going to start asking type information questions in your analysis. Questions such as “what type is this instruction?”, where “type” could span a semilattice, and even refer to a specific run-time object by its pointer. In that case, it’s important to ask the right questions . For example: instructions are likely not the only opcodes that could produce specific objects; if you have an instruction like , for example, that burns a specific expected pointer into the generated code, the type (and therefore the pointer) will come from the instruction. The big idea is that types represent a different slice of your IR than the opcodes and should be treated as such. Anyway, SSA only stores type information about instructions and does not encode information that we might later learn in the IR. With basic SSA, there’s not a good way to encode refinements… Static single information (SSI) form gives us new ways to encode metadata about instructions (variables). It was introduced by C. Scott Ananian in 1999 in his MS thesis (PDF). (I also discussed it briefly in the Scrapscript IR post .) Consider the following SSA program (represented as pseudo-Python): is undefined at . is defined and an integer at . But then we do something interesting: we split control flow based on the run-time value of . We can take this split to add new and interesting information to . For non-sparse analysis, we can record some fact on the side. That’s fine. When doing a dataflow analysis, we can keep track of the fact that at , is nonnegative, and at , is negative. This is neat: we can then determine that all paths to this function return a positive integer. Importantly, does not override the existing known type of . Instead, it is a refinement: a set intersection. A lattice meet. The middle bit of a Venn diagram containing two overlapping circles, and . If we want to keep our information sparse, though, we have to add a new definition to the IR. This is complicated (choose which variables to split, replace all uses, to maintain SSA, etc) but gives us new places to store information inside the IR . It means that every time we refer to , we know that it is nonnegative and every time we refer to , we know that it is negative. This information is independent of context! I should note that you can get a lot of the benefits of SSI without going “full SSI”. There is no need to split every variable at every branch, nor add a special new merge instruction. Okay, so we can encode a lot of information very sparsely in the IR. That’s neat. It’s powerful. But we should also be mindful that even in this very sparse representation, we are encoding information implicitly that we may not want to: execution order. In a traditional CFG representation, the instructions are already scheduled , or ordered. Normally this comes from the programmer in the original source form and is faithfully maintained. We get data use edges in an IR like SSA, but the control information is left implicit. Some forms of IR, however, seek to reify both data and control dependencies into the IR itself. One such IR design is sea of nodes (SoN), which was originally designed by Cliff Click during his PhD. In sea of nodes, every instruction gets its own vertex in the graph. Instructions have use edges to their operands, which can be either data or some other ordering property (control, effects, etc). The main idea is that IR nodes are by default unordered and are only ordered later, after effect analysis has removed a bunch of use edges. Per Vognsen also notes that there is another motivating example of sea of nodes: in the previous SSI example, the cannot be validly hoisted above the check. In a “normal” IR, this is implicit in the ordering. In a sea of nodes world, this is explicitly marked with an edge from the to the . I think Graal, for example, calls these nodes “Pi nodes”. I think I need to re-read the original paper, read a modern implementation (I get the feeling it’s not done exactly the same way anymore), and then go write more about it later. For now, see Simple , by Cliff Click and friends. It is an implementation in Java and a little book to go with it. Design neighbors include value dependence graphs (VDG), value state dependence graphs (VSDG), region value state dependence graphs (RVSDG), and program dependence graphs (PDG). Speaking of Cliff Click, I once heard/read something he said that sounded really interesting. Roughly, it was “elaborate the full semantics of the operation into the IR and let the optimizer sort it out”. That is, “open code” or “inline” your semantics. For example, don’t emit code for a generic add operation that you later specialize: Instead, emit code that replicates the written semantics of the operation, whatever that is for your local language. This can include optimistic fast paths: This has the advantage that you may end up with fewer specialized rewrite rules because constant propagation and branch folding take care of these specializations “for free”. You can even attach probabilities to more or less likely branches to offer outlining hints in case all of this is never specialized. Sure, the downside of this is that the generated IR might be bigger, so your optimizer might be slower—or worse, that your resulting generated code at the end might be bigger. But outlining, deduplication (functionalization?), and probably some other clever methods can help here. Similarly, Maxime Chevalier-Boisvert and Marc Feeley write about this (PDF) in the context of basic block versioning. If the runtime’s generic add functions is written in IR, then callers to that function can specialize “through it” by calling it in different basic block contexts. That more or less gets you call-site specialization “for free”. See Figure 4 from their paper (lightly edited by me), where I think dollar-prefixed variable names indicate special functions known to the compiler: This is nice if you are starting a runtime from scratch or have resources to devote to re-writing chunks of the runtime in your IR. Then, even in a method JIT, you can get your inlined language semantics by function (partial) inlining. There’s probably more in this vein to be explored right now and probably more to invent in the future, too. Some other potentially interesting concepts to think about include: Thank you to Chris Fallin , Hunter Goldstein, and Per Vognsen for valuable feedback on drafts of this post.

0 views
Max Bernstein 1 years ago

Sorry for marking all the posts as unread

I noticed that the URLs were all a little off (had two slashes instead of one) and went in and fixed it. I did not think everyone's RSS software was going to freak out the way it did. PS: this is a special RSS-only post that is not visible on the site. Enjoy.

0 views