Latest Posts (10 found)
mcyoung 1 months ago

Why SSA?

If you’ve read anything about compilers in the last two decades or so, you have almost certainly heard of SSA compilers , a popular architecture featured in many optimizing compilers, including ahead-of-time compilers such as LLVM, GCC, Go, CUDA (and various shader compilers), Swift 1 , and MSVC 2 , and just-in-time compilers such as HotSpot C2 3 , V8 4 , SpiderMonkey 5 , LuaJIT, and the Android Runtime 6 . SSA is hugely popular, to the point that most compiler projects no longer bother with other IRs for optimization 7 . This is because SSA is incredibly nimble at the types of program analysis and transformation that compiler optimizations want to do on your code. But why ? Many of my friends who don’t do compilers often say that compilers seem like opaque magical black boxes, and SSA, as it often appears in the literature, is impenetrably complex. But it’s not! SSA is actually very simple once you forget everything you think your programs are actually doing. We will develop the concept of SSA form, a simple SSA IR, prove facts about it, and design some optimizations on it. I have previously written about the granddaddy of all modern SSA compilers, LLVM. This article is about SSA in general, and won’t really have anything to do with LLVM. However, it may be helpful to read that article to make some of the things in this article feel more concrete. SSA is a property of intermediate representations (IRs), primarily used by compilers for optimizing imperative code that target a register machine . Register machines are computers that feature a fixed set of registers that can be used as the operands for instructions: this includes virtually all physical processors, including CPUs, GPUs, and weird tings like DSPs. SSA is most frequently found in compiler middle-ends , the optimizing component between the frontend (which deals with the surface language programmers write, and lowers it into the middle-end’s IR), and the backend (which takes the optimized IR and lowers it into the target platform’s assembly). SSA IRs, however, often have little resemblance to the surface language they lower out of, or the assembly language they target. This is because neither of these representations make it easy for a compiler to intuit optimization opportunities. Imperative code consists of a sequence of operations that mutate the executing machine’s state to produce a desired result. For example, consider the following C program: This program returns no matter what its input is, so we can optimize it down to this: But, how would you write a general algorithm to detect that all of the operations cancel out? You’re forced to keep in mind program order to perform the necessary dataflow analysis, following mutations of and through the program. But this isn’t very general, and traversing all of those paths makes the search space for large functions very big. Instead, you would like to rewrite the program such that and gradually get replaced with the expression that calculates the most recent value, like this: Then we can replace each occurrence of a variable with its right-hand side recursively… Then fold the constants together… And finally, we see that we’re returning , and can replace it with . All the other variables are now unused, so we can delete them. The reason this works so well is because we took a function with mutation, and converted it into a combinatorial circuit , a type of digital logic circuit that has no state, and which is very easy to analyze. The dependencies between nodes in the circuit (corresponding to primitive operations such as addition or multiplication) are obvious from its structure. For example, consider the following circuit diagram for a one-bit multiplier: This graph representation of an operation program has two huge benefits: The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. The reason combinatorial circuits are the best circuits is because they are directed acyclic graphs (DAGs) which admit really nice algorithms. For example, longest path in a graph is NP-hard (and because P ≠ N P P \neq NP P  = NP 8 , has complexity O ( 2 n ) O(2^n) O ( 2 n ) ). However, if the graph is a DAG, it admits an O ( n ) O(n) O ( n ) solution! To understand this benefit, consider another program: Suppose we wanted to replace each variable with its definition like we did before. We can’t just replace each constant variable with the expression that defines it though, because we would wind up with a different program! Now, we pick up an extra term because the squaring operation is no longer unused! We can put this into circuit form, but it requires inserting new variables for every mutation. But we can’t do this when complex control flow is involved! So all of our algorithms need to carefully account for mutations and program order, meaning that we don’t get to use the nice graph algorithms without careful modification. SSA stands for “static single assignment”, and was developed in the 80s as a way to enhance the existing three-argument code (where every statement is in the form ) so that every program was circuit-like, using a very similar procedure to the one described above. The SSA invariant states that every variable in the program is assigned to by precisely one operation. If every operation in the program is visited once, they form a combinatorial circuit. Transformations are required to respect this invariant. In circuit form, a program is a graph where operations are nodes, and “registers” (which is what variables are usually called in SSA) are edges (specifically, each output of an operation corresponds to a register). But, again, control flow. We can’t hope to circuitize a loop, right? The key observation of SSA is that most parts of a program are circuit-like. A basic block is a maximal circuital component of a program. Simply put, it is a sequence of non-control flow operations, and a final terminator operation that transfers control to another basic block. The basic blocks themselves form a graph, the control flow graph , or CFG. This formulation of SSA is sometimes called SSA-CFG 9 . This graph is not a DAG in general; however, separating the program into basic blocks conveniently factors out the “non-DAG” parts of the program, allowing for simpler analysis within basic blocks. There are two equivalent formalisms for SSA-CFG. The traditional one uses special “phi” operations (often called phi nodes , which is what I will call them here) to link registers across basic blocks. This is the formalism LLVM uses. A more modern approach, used by MLIR, is block arguments : each basic block specifies parameters, like a function, and blocks transferring control flow to it must pass arguments of those types to it. Let’s look at some code. First, consider the following C function which calculates Fibonacci numbers using a loop. How might we express this in an SSA-CFG IR? Let’s start inventing our SSA IR! It will look a little bit like LLVM IR, since that’s what I’m used to looking at. Every block ends in a , which transfers control to one of several possible blocks. In the process, it calls that block with the given arguments. One can think of a basic block as a tiny function which tails 10 into other basic blocks in the same function. aside Phi Nodes LLVM IR is… older, so it uses the older formalism of phi nodes. “Phi” comes from “phony”, because it is an operation that doesn’t do anything; it just links registers from predecessors. A operation is essentially a switch-case on the predecessors, each case selecting a register from that predecessor (or an immediate). For example, has two predecessors, the implicit entry block , and . In a phi node IR, instead of taking a block argument for , it would specify The value of the operation is the value from whichever block jumped to this one. This can be awkward to type out by hand and read, but is a more convenient representation for describing algorithms (just “add a phi node” instead of “add a parameter and a corresponding argument”) and for the in-memory representation, but is otherwise completely equivalent. It’s a bit easier to understand the transformation from C to our IR if we first rewrite the C to use goto instead of a for loop: However, we still have mutation in the picture, so this isn’t SSA. To get into SSA, we need to replace every assignment with a new register, and somehow insert block arguments… The above IR code is already partially optimized; the named variables in the C program have been lifted out of memory and into registers. If we represent each named variable in our C program with a pointer, we can avoid needing to put the program into SSA form immediately. This technique is used by frontends that lower into LLVM, like Clang. We’ll enhance our IR by adding a declaration for functions, which defines scratch space on the stack for the function to use. Each stack slot produces a pointer that we can from and to. Our Fibonacci function would now look like so: Any time we reference a named variable, we load from its stack slot, and any time we assign it, we store to that slot. This is very easy to get into from C, but the code sucks because it’s doing lots of unnecessary pointer operations. How do we get from this to the register-only function I showed earlier? aside Program Order We want program order to not matter for the purposes of reordering, but as we’ve written code here, program order does matter: loads depend on prior stores but stores don’t produce a value that can be used to link the two operations. We can restore not having program order by introducing operands representing an “address space”; loads and stores take an address space as an argument, and stores return a new address space. An address space, or , represents the state of some region of memory. Loads and stores are independent when they are not connected by a argument. This type of enhancement is used by Go’s SSA IR, for example. However, it adds a layer of complexity to the examples, so instead I will hand-wave this away. Now we need to prove some properties about CFGs that are important for the definition and correctness of our optimization passes. First, some definitions. The predecessors (or “preds”) of a basic block is the set of blocks with an outgoing edge to that block. A block may be its own predecessors. Some literature calls the above “direct” or immediate predecessors. For example, the preds of in our example are are (the special name for the function entry-point) . The successors (no, not “succs”) of a basic block is the set of blocks with an outgoing edge from that block. A block may be its own successors. The sucessors of are and . The successors are listed in the loop’s . If a block is a transitive pred of a block , we say that weakly dominates , or that it is a weak dominator of . For example, , and both weakly dominate . However, this is not usually an especially useful relationship. Instead, we want to speak of dominators: A block is a dominator (or dominates ) if every pred of is dominated by , or if is itself. Equivalently, the dominator set of is the intersection of the dominator sets of its preds, plus . The dominance relation has some nice order properties that are necessary for defining the core graph algorithms of SSA. We only consider CFGs which are flowgraphs, that is, all blocks are reachable from the root block , which has no preds. This is necessary to eliminate some pathological graphs from our proofs. Importantly, we can always ask for an acyclic path 11 from to any block . An equivalent way to state the dominance relationship is that from every path from to contains all of ’s dominators. proposition dominates iff every path from to contains . First, assume every to path contains . If is , we’re done. Otherwise we need to prove each predecessor of is dominated by ; we do this by induction on the length of acyclic paths from to . Consider preds of that are not , and consider all acyclic paths p p p from to ; by appending to them, we have an acyclic path p ′ p' p ′ from to , which must contain . Because both the last and second-to-last elements of this are not , it must be within the shorter path p p p which is shorter than p ′ p' p ′ . Thus, by induction, dominates and therefore Going the other way, if dominates , and consider a path p p p from to . The second-to-last element of p p p is a pred of ; if it is we are done. Otherwise, we can consider the path p p p made by deleting at the end. is dominated by , and p ′ p' p ′ is shorter than p p p , so we can proceed by induction as above. Onto those nice properties. Dominance allows us to take an arbitrarily complicated CFG and extract from it a DAG, composed of blocks ordered by dominance. The dominance relation is a partial order. Dominance is reflexive and transitive by definition, so we only need to show blocks can’t dominate each other. Suppose distinct and dominate each other.Pick an acyclic path p p p from to . Because dominates , there is a prefix p ′ p' p ′ of this path ending in . But because dominates , some prefix p ′ ′ p'' p ′′ of p ′ p' p ′ ends in . But now p p p must contain twice, contradicting that it is acyclic. This allows us to write when dominates . There is an even more refined graph structure that we can build out of dominators, which follows immediately from the partial order theorem. The dominators of a basic block are totally ordered by the dominance relation. Suppose and , but neither dominates the other. Then, there must exist acyclic paths from to which contain both, but in different orders. Take the subpaths of those paths which follow , and , neither of which contains . Concatenating these paths yields a path from to that does not contain , a contradiction. This tells us that the DAG we get from the dominance relation is actually a tree, rooted at . The parent of a node in this tree is called its immediate dominator . Computing dominators can be done iteratively: the dominator set of a block is the intersection the dominator sets of its preds, plus . This algorithm runs in quadratic time. A better algorithm is the Lengauer-Tarjan algorithm[^lta]. It is relatively simple, but explaining how to implement it is a bit out of scope for this article. I found a nice treatment of it here . What’s important is we can compute the dominator tree without breaking the bank, and given any node, we can ask for its immediate dominator. Using immediate dominators, we can introduce the final, important property of dominators. The dominance frontier of a block is the set of all blocks not dominated by with at least one pred which dominates. These are points where control flow merges from distinct paths: one containing and one not. The dominance frontier of is , whose preds are and . There are many ways to calculate dominance frontiers, but with a dominance tree in hand, we can do it like this: algorithm Dominance Frontiers. For each block with more than one pred, for each of its preds, let be that pred. Add to the dominance frontier of and all of its dominators, stopping when encountering ’ immediate dominator. We need to prove that every block examined by the algorithm winds up in the correct frontiers. First, we check that every examined block is added to the correct frontier. If , where is a pred of , and a is ’s immediate dominator, then if , is not in its frontier, because must dominate . Otherwise, must be in ’s frontier, because dominates a pred but it cannot dominate , because then it would be dominated by , a contradiction. Second, we check that every frontier is complete. Consider a block . If an examined block is in its frontier, then must be among the dominators of some pred , and it must be dominated by ’s immediate dominator; otherwise, would dominate (and thus would not be in its frontier). Thus, gets added to ’s dominator. You might notice that all of these algorithms are quadratic. This is actually a very good time complexity for a compilers-related graph algorithm. Cubic and quartic algorithms are not especially uncommon, and yes, your optimizing compiler’s time complexity is probably cubic or quartic in the size of the program! Ok. Let’s construct an optimization. We want to figure out if we can replace a load from a pointer with the most recent store to that pointer. This will allow us to fully lift values out of memory by cancelling out store/load pairs. This will make use of yet another implicit graph data structure. The dataflow graph is the directed graph made up of the internal circuit graphs of each each basic block, connected along block arguments. To follow a use-def chain is to walk this graph forward from an operation to discover operations that potentially depend on it, or backwards to find operations it potentially depends on. It’s important to remember that the dataflow graph, like the CFG, does not have a well defined “up” direction. Navigating it and the CFG requires the dominator tree. One other important thing to remember here is that every instruction in a basic block always executes if the block executes. In much of this analysis, we need to appeal to “program order” to select the last load in a block, but we are always able to do so. This is an important property of basic blocks that makes them essential for constructing optimizations. For a given , we want to identify all loads that depend on it. We can follow the use-def chain of to find which blocks contain loads that potentially depend on the store (call it ). First, we can eliminate loads within the same basic block (call it ). Replace all instructions after (but before any other s, in program order) with ’s def. If is not the last store in this block, we’re done. Otherwise, follow the use-def chain of to successors which use , i.e., successors whose case has as at least one argument. Recurse into those successors, and now replacing the pointer of interest with the parameters of the successor which were set to (more than one argument may be ). If successor loads from one of the registers holding , replace all such loads before a store to . We also now need to send into somehow. This is where we run into something of a wrinkle. If has exactly one predecessor, we need to add a new block argument to pass whichever register is holding (which exists by induction). If is already passed into by another argument, we can use that one. However, if has multiple predecessors, we need to make sure that every path from to sends , and canonicalizing those will be tricky. Worse still, if is in ’s domination frontier, a different store could be contributing to that load! For this reason, dataflow from stores to loads is not a great strategy. Instead, we’ll look at dataflow from loads backwards to stores (in general, dataflow from uses to defs tends to be more useful), which we can use to augment the above forward dataflow analysis to remove the complex issues around domination frontiers. Let’s analyze loads instead. For each in , we want to determine all stores that could potentially contribute to its value. We can find those stores as follows: We want to be able to determine which register in a given block corresponds to the value of , and then find its last store in that block. To do this, we’ll flood-fill the CFG backwards in BFS order. This means that we’ll follow preds (through the use-def chain) recursively, visiting each pred before visiting their preds, and never revisiting a basic block (except we may need to come back to at the end). Determining the “equivalent” 12 of in (we’ll call it ) can be done recursively: while examining , follow the def of . If is a block parameter, for each pred , set to the corresponding argument in the case in ’s . Using this information, we can collect all stores that the load potentially depends on. If a predecessor stores to , we add the last such store in (in program order) to our set of stores, and do not recurse to ’s preds (because this store overwrites all past stores). Note that we may revisit in this process, and collect a store to from it occurs in the block. This is necessary in the case of loops. The result is a set of pairs. In the process, we also collected a set of all blocks visited, , which are dominators of which we need to plumb a through. This process is called memory dependency analysis , and is a key component of many optimizations. Not all contributing operations are stores. Some may be references to globals (which we’re disregarding), or function arguments or the results of a function call (which means we probably can’t lift this load). For example gets traced all the way back to a function argument, there is a code path which loads from a pointer whose stores we can’t see. It may also trace back to a stack slot that is potentially not stored to. This means there is a code path that can potentially load uninitialized memory. Like LLVM, we can assume this is not observable behavior, so we can discount such dependencies. If all of the dependencies are uninitialized loads, we can potentially delete not just the load, but operations which depend on it (reverse dataflow analysis is the origin of so-called “time-traveling” UB). Now that we have the full set of dependency information, we can start lifting loads. Loads can be safely lifted when all of their dependencies are stores in the current function, or dependencies we can disregard thanks to UB in the surface language (such as loads or uninitialized loads). There is a lot of fuss in this algorithm about plumbing values through block arguments. A lot of IRs make a simplifying change, where every block implicitly receives the registers from its dominators as block arguments. I am keeping the fuss because it makes it clearer what’s going on, but in practice, most of this plumbing, except at dominance frontiers, would be happening in the background. Suppose we can safely lift some load. Now we need to plumb the stored values down to the load. For each block in (all other blocks will now be in unless stated otherwise). We will be building two mappings: one , which is the register equivalent to in that block. We will also be building a map , which is the value that must have in that block. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Once we’re done, if is the block that contains the load, we can now replace all loads to before any stores to with . There are cases where this whole process can be skipped, by applying a “peephole” optimization. For example, stores followed by loads within the same basic block can be optimized away locally, leaving the heavy-weight analysis for cross-block store/load pairs. Here’s the result of doing dependency analysis on our Fibonacci function. Each load is annotated with the blocks and stores in . Let’s look at . Is contributing loads are in and . So we add a new parameter : in , we call that parameter with (since that’s stored to it in ), while in , we pass . What about L4? The contributing loads are also in and , but one of those isn’t a pred of . is also in the subgraph for this load, though. So, starting from , we add a new parameter to and feed (the stored value, an immediate this time) through it. Now looking at , we see there is already a parameter for this load ( ), so we just pass as that argument. Now we process , which pushed onto the queue. gets a new parameter , which is fed ’s own . We do not re-process , even though it also appears in ’s gotos, because we already visited it. After doing this for the other two loads, we get this: After lifting, if we know that a stack slot’s pointer does not escape (i.e., none of its uses wind up going into a function call 13 ) or a write to a global (or a pointer that escapes), we can delete every store to that pointer. If we delete every store to a stack slot, we can delete the stack slot altogether (there should be no loads left for that stack slot at this point). This analysis is simple, because it assumes pointers do not alias in general. Alias analysis is necessary for more accurate dependency analysis. This is necessary, for example, for lifting loads of fields of structs through subobject pointers, and dealing with pointer arithmetic in general. However, our dependency analysis is robust to passing different pointers as arguments to the same block from different predecessors. This is the case that is specifically handled by all of the fussing about with dominance frontiers. This robustness ultimately comes from SSA’s circuital nature. Similarly, this analysis needs to be tweaked to deal with something like (a ternary, essentially). s of pointers need to be replaced with s of the loaded values, which means we need to do the lifting transformation “all at once”: lifting some liftable loads will leave the IR in an inconsistent state, until all of them have been lifted. Many optimizations will make a mess of the CFG, so it’s useful to have simple passes that “clean up” the mess left by transformations. Here’s some easy examples. If an operation’s result has zero uses, and the operation has no side-effects, it can be deleted. This allows us to then delete operations that it depended on that now have no side effects. Doing this is very simple, due to the circuital nature of SSA: collect all instructions whose outputs have zero uses, and delete them. Then, examine the defs of their operands; if those operations now have no uses, delete them, and recurse. This bubbles up all the way to block arguments. Deleting block arguments is a bit trickier, but we can use a work queue to do it. Put all of the blocks into a work queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. There are many CFG configurations that are redundant and can be simplified to reduce the number of basic blocks. For example, unreachable code can help delete blocks. Other optimizations may cause the at the end of a function to be empty (because all of its successors were optimized away). We treat an empty as being unreachable (since it has no cases!), so we can delete every operation in the block up to the last non-pure operation. If we delete every instruction in the block, we can delete the block entirely, and delete it from its preds’ s. This is a form of dead code elimination , or DCE, which combines with the previous optimization to aggressively delete redundant code. Some jumps are redundant. For example, if a block has exactly one pred and one successor, the pred’s case for that block can be wired directly to the successor. Similarly, if two blocks are each other’s unique predecessor/successor, they can be fused , creating a single block by connecting the input blocks’ circuits directly, instead of through a . If we have a ternary operation, we can do more sophisticated fusion. If a block has two successors, both of which the same unique successor, and those successors consist only of gotos, we can fuse all four blocks, replacing the CFG diamond with a . In terms of C, this is this transformation: LLVM’s CFG simplification pass is very sophisticated and can eliminate complex forms of control flow. I am hoping to write more about SSA optimization passes. This is a very rich subject, and viewing optimizations in isolation is a great way to understand how a sophisticated optimization pipeline is built out of simple, dumb components. It’s also a practical application of graph theory that shows just how powerful it can be, and (at least in my opinion), is an intuitive setting for understanding graph theory, which can feel very abstract otherwise. In the future, I’d like to cover CSE/GVN, loop optimizations, and, if I’m feeling brave, getting out of SSA into a finite-register machine (backends are not my strong suit!). Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩ The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩

0 views
mcyoung 3 months ago

Default Methods in Go

Go’s interfaces are very funny. Rather than being explicitly implemented, like in Java or Rust, they are simply a collection of methods (a “method set”) that the concrete type must happen to have. This is called structural typing, which is the opposite of nominal typing. Go interfaces are very cute, but this conceptual simplicity leads to a lot of implementation problems (a theme with Go, honestly). It removes a lot of intentionality from implementing interfaces, and there is no canonical way to document that satisfies 1 , nor can you avoid conforming to interfaces, especially if one forces a particular method on you. It also has very quirky results for the language runtime. To cast an interface value to another interface type (via the type assertion syntax ), the runtime essentially has to use reflection to go through the method set of the concrete type of . I go into detail on how this is implemented here . Because of their structural nature, this also means that you can’t add new methods to an interface without breaking existing code, because there is no way to attach default implementations to interface methods. This results in very silly APIs because someone screwed up an interface. For example, in the standard library’s package , the interface represents a value which can be parsed as a CLI flag. It looks like this: also has an optional method, which is only specified in the documentation. If the concrete type happens to provide , it will be queries for determining if the flag should have bool-like behavior. Essentially, this means that something like this exists in the flag library: The package already uses reflection, but you can see how it might be a problem if this interface-to-interface cast happens regularly, even taking into account Go’s caching of cast results. There is also , which exists because they messed up and didn’t provide a way for a to unwrap into the value it contains. For example, if a flag is defined with , and then that flag is looked up with , there’s no straightforward way to get the int out of the returned . Instead, you have to side-cast to : As a result, needs to do a lot more work than if had just added , with a default return value of . It turns out that there is a rather elegant workaround for this. Go has this quirky feature called embedding , where a a field in a struct is declared without a name: The -typed embedded field behaves as if we had declared the field , but selectors on will search in if they do not match something on the level. For example, if has a method , and does not, will resolve to . However, if has a method , resolves to , not , because has a field . Importantly, any methods from which does not already have will be added to ’s method set. So this works: Now, suppose that we were trying to add to . Let’s suppose that we had also defined , a type that all satisfiers of must embed. Then, we can write the following: Then, no code change is required for all clients to pick up the new implementation of . Now, this only works if we had required in the first place that anyone satisfying embeds . How can we force that? A little-known Go feature is that interfaces can have unexported methods. The way these work, for the purposes of interface conformance, is that exported methods are matched just by their name, but unexported methods must match both name and package. So, if we have an interface like , then will only match methods defined in the same package that this interface expression appears. This is useful for preventing satisfaction of interfaces. However, there is a loophole: embedding inherits the entire method set, including unexported methods. Therefore, we can enhance to account for this: Now, it’s impossible for any type defined outside of this package to satisfy , without embedding (either directly or through another embedded ). Now, another problem is that you can’t control the name of embedded fields. If the embedded type is , the field’s name is . Except, it’s not based on the name of the type itself; it will pick up the name of a type alias. So, if you want to unexport the defaults struct, you can simply write: This also has the side-effect of hiding all of ’ methods from ’s documentation, despite the fact that exported and fields methods are still selectable and callable by other packages (including via interfaces). As far as I can tell, this is simply a bug in , since this behavior is not documented. There is still a failure mode: if a user type satisfying happened to define a method with a different interface. In this case, that takes precedence, and changes to will break users. There are two workarounds: Tell people not to define methods on their satisfying type, and if they do, they’re screwed. Because satisfying is now explicit, this is not too difficult to ask for. Pick a name for new methods that is unlikely to collide with anything. Unfortunately, this runs into a big issue with structural typing, which is that it is very difficult to avoid making mistakes when making changes, due to the lack of intent involved. A similar problem occurs with C++ templates, where the interfaces defined by concepts are implicit, and can result in violating contract expectations. Go has historically be relatively cavalier about this kind of issue, so I think that breaking people based on this is fine. And of course, you cannot retrofit a default struct into a interface; you have to define it from day one. Now that we have defaults, we can also enhance with bool flag detection: Now is more than just a random throw-away comment on a type. We can also use defaults to speed up side casts. Many functions around the package will cast an into an or to perform more efficient I/O. In a hypothetical world where we had defaults structs for all the interfaces, we can enhance with a default method that by default returns an error. We can do something similar for , but because it’s a rather general interface, it’s better to keep as-is. So, we can add a conversion method: Here, converts to an , returning if that’s not possible. How is this faster than ? Well, consider what this would look like in user code: Calling , if is an containing a , lowers to the following machine code: The main cost of this conversion is the indirect jump, compared to, at minimum, hitting a hashmap lookup loop for the cache for . Does this performance matter? Not for I/O interfaces, probably, but it can matter for some uses! Yes, it should, but here we are. Although making it a language feature has a few rather unfortunate quirks that we need to keep in mind. Suppose we can define defaults on interface methods somehow, like this: Then, any type which provides automatically satisfies . Suppose satisfies , but does not provide . Then we have a problem: Now, we might consider looking past that, but it becomes a big problem with reflection. If we passed into , the resulting conversion would discard the defaulted method, meaning that it would not be findable by . Oops. So we need to somehow add to ’s method set. Maybe we say that if is ever converted into , it gets the method. But this doesn’t work, because the compiler might not be able to see through something like . This means that must be applied unconditionally. But, now we have the problem that if we have another interface , would need to receive incompatible signatures for . Again, we’re screwed by the non-intentionality of structural typing. Ok, let’s forget about default method implementations, that doesn’t seem to be workable. What if we make some methods optional, like earlier? Let’s invent some syntax for it. Then, suppose that provides but not (or with the wrong signature). Then, the entry in the itab for would contain a nil function pointer, such that panics! To determine if is safe to call, we would use the following idiom: The compiler is already smart enough to elide construction of funcvals for cases like this, although it does mean that in general, for an interface value , requires an extra or similar to make sure that is nil when it’s a missing method. All of the use cases described above would work Just Fine using this construction, though! However, we run into the same issue that appears to have a larger method set than . It is not clear if should conform to , where is required. My intuition would be no: is a strictly weaker interface. Perhaps it might be necessary to avoid the method access syntax for optional methods, but that’s a question of aesthetics. This idea of having nulls in place of function pointers in a vtable is not new, but to my knowledge is not used especially widely. It would be very useful in C++, for example, to be able to determine if no implementation was provided for a non-pure virtual function. However, the nominal nature of C++’s virtual functions does not make this as big of a need. Another alternative is to store a related interfaces’ itabs on in an itab. For example, suppose that we invent the syntax within an to indicate that that interface will likely get cast to . For example: Satisfying does not require satisfying . However, the must be part of public API, because a cannot be used in place of an Within ’s itab, after all of the methods, there is a pointer to an itab for , if the concrete type for this itab also happens to satisfy . Then, a cast from to is just loading a pointer from the itab. If the cast would fail, the loaded pointer will be . I had always assumed that Go did an optimization like this for embedding interfaces, but no! Any inter-interface conversion, including upcasts, goes through the whole type assertion machinery! Of course, Go cannot hope to generate an itab for every possible subset of the method set of an interface (exponential blow-up), but it’s surprising that they don’t do this for embedded interfaces, which are Go’s equivalent of superinterfaces (present in basically every language with interfaces). Using this feature, we can update to look like this: Unfortunately, because changes the ABI of an interface, it does not seem possible to actually add this to existing interfaces, because the following code is valid: Even though this fix seems really clean, it doesn’t work! The only way it could work is if PGO determines that a particular interface conversion to happens a lot, and updates the ABI of all interfaces with the method set of , program-globally, to contain a pointer to a itab if available. Go’s interfaces are pretty bad; in my opinion, a feature that looks good on a slide, but which results in a lot of mess due to its granular and intention-less nature. We can sort of patch over it with embeds, but there’s still problems. Due to how method sets work in Go, it’s very hard to “add” methods through an interface, and honestly at this point, any interface mechanism that makes it impossible (or expensive) to add new functions is going to be a huge problem. Missing methods seems like the best way out of this problem, but for now, we can stick to the janky embedded structs. Go uses the term “implements” to say that a type satisfies an interface. I am instead intentionally using the term “satisfies”, because it makes the structural, passive nature of implementing an interface clearer. This is also more in-line with interfaces’ use as generic constraints. Swift uses the term “conform” instead, which I am avoiding for this reason.  ↩ Tell people not to define methods on their satisfying type, and if they do, they’re screwed. Because satisfying is now explicit, this is not too difficult to ask for. Pick a name for new methods that is unlikely to collide with anything. Load the function pointer for out of ’s itab. Perform an indirect jump on that function pointer. Inside of , load a pointer to the itab symbol and into the return slots. Go uses the term “implements” to say that a type satisfies an interface. I am instead intentionally using the term “satisfies”, because it makes the structural, passive nature of implementing an interface clearer. This is also more in-line with interfaces’ use as generic constraints. Swift uses the term “conform” instead, which I am avoiding for this reason.  ↩

0 views
mcyoung 4 months ago

Parsing Protobuf Like Never Before

Historically I have worked on many projects related to high-performance Protobuf, be that on the C++ runtime, on the Rust runtime, or on integrating UPB , the fastest Protobuf runtime, written by my colleague Josh Haberman . I generally don’t post directly about my current job, but my most recent toy-turned-product is something I’m very excited to write about: . Here’s how we measure up against other Go Protobuf parsers. This is a subset of my benchmarks, since the benchmark suite contains many dozens of specimens. This was recorded on an AMD Zen 4 machine. Throughput for various configurations of (colored bars) vs. competing parsers (grey bars). Each successive includes all previous optimizations, corresponding to zerocopy mode , arena reuse , and profile-guided optimization . Bigger is better. Traditionally, Protobuf backends would generate parsers by generating source code specialized to each type. Naively, this would give the best performance, because everything would be “right-sized” to a particular message type. Unfortunately, now that we know better, there are a bunch of drawbacks: These effects are not directly visible in normal workloads, but other side-effects are visible: for example, giant switches on field numbers can turn into chains of branch instructions, meaning that higher-numbered fields will be quite slow. Even binary-searching on field numbers isn’t exactly ideal. However, we know that every Protobuf codec ever emits fields in index order (i.e., declaration order in the file), which is a data conditioning fact we don’t take advantage of with a switch. UPB solves this problem. It is a small C kernel for parsing Protobuf messages, which is completely dynamic: a UPB “parser” is actually a collection of data tables that are evaluated by a table-driven parser . In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode. UPB also contains many arena optimizations to improve allocation throughput when parsing complex messages. is a brand new library, written in the most cursed Go imaginable, which brings many of the optimizations of UPB to Go, and many new ones, while being tuned to Go’s own weird needs. The result leaves the competition in the dust in virtually every benchmark, while being completely runtime-dynamic. This means it’s faster than Protobuf Go’s own generated code, and (a popular but non-conforming 1 parser generator for Go). This post is about some of the internals of . I have also prepared a more sales-ey writeup, which you can read on the Buf blog . UPB is awesome. It can slot easily into any language that has C FFI, which is basically every language ever. Unfortunately, Go’s C FFI is really, really bad. It’s hard to overstate how bad cgo is. There isn’t a good way to cooperate with C on memory allocation (C can’t really handle Go memory without a lot of problems, due to the GC). Having C memory get cleaned up by the GC requires finalizers, which are very slow. Calling into C is very slow, because Go pessimistically assumes that C requires a large stack, and also calling into C does nasty things to the scheduler. All of these things can be worked around, of course. For a while I considered compiling UPB to assembly, and rewriting that assembly into Go’s awful assembly syntax 2 , and then having Go assemble UPB out of that. This presents a few issues though, particularly because Go’s assembly calling convention is still in the stone age 3 (arguments are passed on the stack), and because we would still need to do a lot of work to get UPB to match the API. Go also has a few… unique qualities that make writing a Protobuf interpreter an interesting challenge with exciting optimization opportunities. First, of course, is the register ABI, which on gives us a whopping nine argument and return registers, meaning that we can simply pass the entire parser state in registers all the time. Second is that Go does not have much UB to speak of, so we can get away with a lot of very evil pointer crimes that we could not in C++ or Rust. Third is that Protobuf Go has a robust reflection system that we can target if we design specifically for it. Also, the Go ecosystem seems much more tolerant of less-than-ideal startup times (because the language loves life-before-main due to functions), so unlike UPB, we can require that the interpreter’s program be generated at runtime, meaning that we can design for online PGO. In other words, we have the perfect storm to create the first-ever Protobuf JIT compiler (which we also refer to as “online PGO” or “real-time PGO”). Right now, ’s API is very simple. There are functions that accept some representation of a message descriptor, and return a , which implements the type APIs. This can be used to allocate a new , which you can shove into and do reflection on the result. However, you can’t mutate s currently, because the main use-cases I am optimizing for are read-only. All mutations panic instead. The hero use-case, using Buf’s library, uses reflection to execute validation predicates. It looks like this: We tell users to make sure to cache the compilation step because compilation can be arbitrarily slow: it’s an optimizing compiler! This is not unlike the same warning on , which makes it easy to teach users how to use this API correctly. In addition to the main API, there’s a bunch of performance tuning knobs for the compiler, for unmarshaling, and for recording profiles. Types can be recompiled using a recorded profile to be more optimized for the kinds of messages that actually come on the wire. PGO 4 affects a number of things that we’ll get into as I dive into the implementation details. Most of the core implementation lives under . The main components are as follows: This article won’t be a deep-dive into everything in the parser, and even this excludes large portions of . For example, the package is already described in a different blogpost of mine. I recommend taking a look at that to learn about how we implement a GC-friendly arena for . Instead, I will give a brief overview of how the object code is organized and how the parser interprets it. I will also go over a few of the more interesting optimizations we have. Every that is reachable from the root message (either as a field or as an extension) becomes a . This contains the dynamic size of the corresponding message type, a pointer to the type’s default parser (there can be more than one parser for a type) and a variable number of values. These specify the offset of each field and provide accessor thunks, for actually extracting the value of the field. A is what the parser VM interprets alongside encoded Protobuf data. It contains all of the information needed for decoding a message in compact form, including s for each of its fields (and extensions), as well as a hashtable for looking up a field by tag, which is used by the VM as a fallback. The s each contain: Each actually corresponds to a possible tag on a record for this message. Some fields have multiple different tags: for example, a can have a -type tag for the repeated representation, and a -type tag for the packed representation. Each field specifies which fields to try next. This allows the compiler to perform field scheduling , by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess. I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before. The offset information for a field is more than just a memory offset. A includes a bit offset, for fields which request allocation of individual bits in the message’s bitfields. These are used to implement the hasbits of fields (and the values of fields). It also includes a byte offset for larger storage. However, this byte offset can be negative, in which case it’s actually an offset into the cold region . In many messages, most fields won’t be set, particularly extensions. But we would like to avoid having to allocate memory for the very rare (i.e., “cold”) fields. For this, a special “cold region” exists in a separate allocation from the main message, which is referenced via a compressed pointer. If a message happens to need a cold field set, it takes a slow path to allocate a cold region only if needed. Whether a field is cold is a dynamic property that can be affected by PGO. The parser is designed to make maximal use of Go’s generous ABI without spilling anything to the stack that isn’t absolutely necessary. The parser state consists of eight 64-bit integers, split across two types: and . Unfortunately, these can’t be merged due to a compiler bug , as documented in . Every parser function takes these two structs as its first two arguments, and returns them as its first two results. This ensures that register allocation tries its darnedest to keep those eight integers in the first eight argument registers, even across calls. This leads to the common idiom of Overwriting the parser state like this ensures that future uses of p1 and p2 use the values that places in registers for us. I spent a lot of time and a lot of profiling catching all of the places where Go would incorrectly spill parser state to the stack, which would result in stalls. I found quite a few codegen bugs in the process. Particularly notable (and shocking!) is #73589 . Go has somehow made it a decade without a very basic pointer-to-SSA lifting pass (for comparison, this is a heavy-lifting cleanup pass ( ) in LLVM). The core loop of the VM goes something like this: Naturally, this is implemented as a single function whose control flow consists exclusively of s and s, because getting Go to generate good control flow otherwise proved too hard. Now, you might be wondering why the hot loop for the parser includes calling a virtual function . Conventional wisdom holds that virtual calls are slow. After all, the actual virtual call instruction is quite slow, because it’s an indirect branch, meaning that it can easily stall the CPU. However, it’s actually much faster than the alternatives in this case, due to a few quirks of our workload and how modern CPUs are designed: In fact, the “optimized” form of a large switch is a jump table, which is essentially an array of function pointers. Rather than doing a large number of comparisons and direct branches, a jump table turns a switch into a load and an indirect branch. This is great news for us, because it means we can make use of a powerful assumption about most messages: most messages only feature a handful of field archetypes. How often is it that you see a message which has more in it than , , , and submessages? In effect, this allows us to have a very large “instruction set”, consisting of all of the different field archetypes, but a particular message only pays for what it uses. The fewer archetypes it uses at runtime , the better the CPU can predict this indirect jump. On the other hand, we can just keep adding archetypes over time to specialize for common parse workloads, which PGO can select for. Adding new archetypes that are not used by most messages does not incur a performance penalty. We’ve already discussed the hot/cold split, and briefly touched on the message bitfields used for s and hasbits. I’d like to mention a few other cool optimizations that help cover all our bases, as far as high-performance parsing does. The fastest implementation is the one you don’t call. For this reason, we try to, whenever possible, avoid copying anything out of the input buffer. s and are represented as s , which are a packed pair of offset+length in a . Protobuf is not able to handle lengths greater than 2GB properly, so we can assume that this covers all the data we could ever care about. This means that a field is 8 bytes, rather than 24, in our representation. Zerocopy is also used for packed fields. For example, a will typically be encoded as a record. The number of s in this record is equal to its length divided by 8, and the s are already encoded in IEEE754 format for us. So we can just retain the whole repeated fields as a . Of course, we need to be able to handle cases where there are multiple disjoint records, so the backing can also function as a 24-byte arena slice. Being able to switch between these modes gracefully is a delicate and carefully-tested part of the repeated field thunks. Surprisingly, we also use zerocopy for varint fields, such as . Varints are variable-length, so we can’t just index directly into the packed buffer to get the th element… unless all of the elements happen to be the same size. In the case that every varint is one byte (so, between 0 and 127), we can zerocopy the packed field. This is a relatively common scenario, too, so it results in big savings 6 . We already count the number of varints in the packed field in order to preallocate space for it, so this doesn’t add extra cost. This counting is very efficient because I have manually vectorized the loop. PGO records the median size of each repeated/map field, and that is used to calculate a “preload” for each repeated field. Whenever the field is first allocated, it is pre-allocated using the preload to try to right-size the field with minimal waste. Using the median ensures that large outliers don’t result in huge memory waste; instead, this guarantees that at least 50% of repeated fields will only need to allocate from the arena once. Packed fields don’t use the preload, since in the common case only one record appears for packed fields. This mostly benefits string- and message-typed repeated fields, which can’t be packed. We don’t use Go’s built-in map, because it has significant overhead in some cases: in particular, it has to support Go’s mutation-during-iteration semantics, as well as deletion. Although both are Swisstables 7 under the hood, my implementation can afford to take a few shortcuts. It also allows our implementation to use arena-managed memory. s are used both for the backing store of fields, and for maps inside of s. Currently, the hash used is the variant of fxhash used by the Rust compiler . This greatly out-performs Go’s maphash for integers, but maphash is better for larger strings. I hope to maybe switch to maphash at some point for large strings, but it hasn’t been a priority. Hitting the Go allocator is always going to be a little slow, because it’s a general-case allocator. Ideally, we should learn the estimated memory requirements for a particular workload, and then allocate a single block of that size for the arena to portion out. The best way to do this is via arena reuse In the context of a service, each request has a bounded lifetime on the message that it parses. Once that lifetime is over (the request is complete), the message is discarded. This gives the programmer an opportunity to reset the backing arena, so that it keeps its largest memory block for re-allocation. You can show that over time, this will cause the arena to never hit the Go allocator. If the largest block is too small for a message, a block twice as large will wind up getting allocated. Messages that use the same amount of memory will keep doubling the largest block, until the largest block is large enough to fit the whole message. Memory usage will be at worst 2x the size of this message. Note that, thanks to extensive use of zero-copy optimizations, we can often avoid allocating memory for large portions of the message. Of course, arena re-use poses a memory safety danger, if the previously allocated message is kept around after the arena is reset. For this reason, it’s not the default behavior. Using arena resets is a double-digit percentage improvement, however. Go does not properly support unions, because the GC does not keep the necessary book-keeping to distinguish a memory location that may be an integer or a pointer at runtime. Instead, this gets worked around using interfaces, which is always a pointer to some memory. Go’s GC can handle untyped pointers just fine, so this just works. The generated API for Protobuf Go uses interface values for s. This API is… pretty messy to use, unfortunately, and triggers unnecessary allocations, (much like fields do in the open API). However, my arena design ( read about it here ) makes it possible to store arena pointers on the arena as if they are integers, since the GC does not need to scan through arena memory. Thus, our s are true unions, like in C++. is really exciting because its growing JIT capabilities offer an improvement in the state of the art over UPB. It’s also been a really fun challenge working around Go compiler bugs to get the best assembly possible. The code is already so well-optimized that re-building the benchmarks with the Go compiler’s own PGO mode (based on a profile collected from the benchmarks) didn’t really seem to move the needle! I’m always working on making better (I get paid for it!) and I’m always excited to try new optimizations. If you think of something, file an issue! I have meticulously commented most things within , so it should be pretty easy to get an idea of where things are if you want to contribute. I would like to write more posts diving into some of the weeds of the implementation. I can’t promise anything, but there’s lots to talk about. For now… have fun source-diving! There’s a lot of other things we could be doing: for example, we could be using SIMD to parse varints, we could have smarter parser scheduling, we could be allocating small submessages inline to improve locality… there’s still so much we can do! And most importantly, I hope you’ve learned something new about performance optimization! vtprotobuf gets a lot of things wrong that make it beat us in like two benchmarks, because it’s so sloppy. For example, vtprotobuf believes that it’s ok to not validate UTF-8 strings. This is non-conforming behavior. It also believes that map entries’ fields are always in order and always populated, meaning that valid Protobuf messages containing maps can be parsed incorrectly . This sloppiness is unacceptable, which is why goes to great lengths to implement all of Protobuf correctly.  ↩ Never gonna let Rob live that one down. Of all of Rob’s careless design decisions, the assembler is definitely one of the least forgivable ones.  ↩ There are only really two pieces of code in that could benefit from hand-written assembly: varint decoding and UTF-8 validation. Both of these vectorize well, however, is so inefficient that no hand-written implementation will be faster. If I do wind up doing this, it will require a build tag like , along with something like to treat the assembly implementations as part of the Go runtime, allowing the use of . But even then, this only speeds up parsing of large (>2 byte) varints.  ↩ This is PGO performed by itself; this is unrelated to ’s own PGO mode, which seems to not actually make faster.  ↩ Yes, the parser manages its own stack separate from the goroutine stack. This ensures that nothing in the parser has to be reentrant. The only time the stack is pushed to is when we “recurse” into a submessage.  ↩ Large packed repeated fields are where the biggest wins are for us. Being able to zero-copy large packed fields full of small values allows us to eliminate all of the overhead that the other runtimes are paying for; we also choose different parsing strategies depending on the byte-to-varint ratio of the record. Throughput for various repeated field benchmarks. This excludes the benchmarks, since those achieve such high throughputs (~20 Gbps) that they make the chart unreadable. These optimizations account for the performance difference between and in the first benchmark chart. The latter is a containing , Protobuf’s janky debuginfo format. It is dominated by fields. NB: This chart is currently missing the Y-axis, I need to have it re-made.  ↩ Map parsing performance has been a bit of a puzzle. cheats by rejecting some valid map entry encodings, such as (in Protoscope) (value is implied to be ), while mis-parsing others, such as (fields can go in any order), since they don’t actually validate the field numbers like does. Here’s where the benchmarks currently stand for maps: Throughput for various map parsing benchmarks. Maps, I’m told, are not very popular in Protobuf, so they’re not something I have tried to optimize as hard as packed repeated fields.  ↩ Every type you care about must be compiled ahead-of-time. Tricky for when you want to build something generic over schemas your users provide you. Every type contributes to a cost on the instruction cache, meaning that if your program parses a lot of different types, it will essentially flush your instruction cache any time you enter a parser. Worse still, if a parse involves enough types, the parser itself will hit instruction decoding throughput issues. , which defines the “object code format” for the interpreter. This includes definitions for describing types and fields to the parser. , which contains all of the code for converting a into a , which contains all of the types relevant to a particular parsing operation. defines what dynamic message types look like. The compiler does a bunch of layout work that gets stored in values, which a interprets to find the offsets of fields within itself. contains the core interpreter implementation, including the VM state that is passed in registers everywhere. It also includes hand-optimized routines for parsing varints and validating UTF-8. defines archetypes , which are classes of fields that all use the same layout and parsers. This corresponds roughly to a pair, but not exactly. There are around 200 different archetypes. The same offset information as a . The field’s tag, in a special format. A function pointer that gets called to parse the field. The next field(s) to try parsing after this one is parsed. Are we out of bytes to parse? If so, pop a parser stack frame 5 . If we popped the last stack frame, parsing is done; return success. Parse a tag. This does not fully decode the tag, because s contain a carefully-formatted, partially-decoded tag to reduce decoding work. Check if the next field we would parse matches the tag. If yes, call the function pointer ; update the current field to ; goto 1. If no, update the current field to ; goto 3. If no “enough times”, fall through. Slow path: hit to find the matching field for that tag. If matched, go to 3a. If not, this is an unknown field; put it into the unknown field set; parse a tag and goto 4. Modern CPUs are not great at traversing complex “branch mazes”. This means that selecting one of ~100 alternatives using branches, even if they are well-predicted and you use unrolled binary search, is still likely to result in frequent mispredictions, and is an obstacle to other JIT optimizations in the processor’s backend. Predicting a single indirect branch with dozens of popular targets is something modern CPUs are pretty good at. Chips and Cheese have a great writeup on the indirect prediction characteristics of Zen 4 chips. vtprotobuf gets a lot of things wrong that make it beat us in like two benchmarks, because it’s so sloppy. For example, vtprotobuf believes that it’s ok to not validate UTF-8 strings. This is non-conforming behavior. It also believes that map entries’ fields are always in order and always populated, meaning that valid Protobuf messages containing maps can be parsed incorrectly . This sloppiness is unacceptable, which is why goes to great lengths to implement all of Protobuf correctly.  ↩ Never gonna let Rob live that one down. Of all of Rob’s careless design decisions, the assembler is definitely one of the least forgivable ones.  ↩ There are only really two pieces of code in that could benefit from hand-written assembly: varint decoding and UTF-8 validation. Both of these vectorize well, however, is so inefficient that no hand-written implementation will be faster. If I do wind up doing this, it will require a build tag like , along with something like to treat the assembly implementations as part of the Go runtime, allowing the use of . But even then, this only speeds up parsing of large (>2 byte) varints.  ↩ This is PGO performed by itself; this is unrelated to ’s own PGO mode, which seems to not actually make faster.  ↩ Yes, the parser manages its own stack separate from the goroutine stack. This ensures that nothing in the parser has to be reentrant. The only time the stack is pushed to is when we “recurse” into a submessage.  ↩ Large packed repeated fields are where the biggest wins are for us. Being able to zero-copy large packed fields full of small values allows us to eliminate all of the overhead that the other runtimes are paying for; we also choose different parsing strategies depending on the byte-to-varint ratio of the record. Throughput for various repeated field benchmarks. This excludes the benchmarks, since those achieve such high throughputs (~20 Gbps) that they make the chart unreadable. These optimizations account for the performance difference between and in the first benchmark chart. The latter is a containing , Protobuf’s janky debuginfo format. It is dominated by fields. NB: This chart is currently missing the Y-axis, I need to have it re-made.  ↩ Map parsing performance has been a bit of a puzzle. cheats by rejecting some valid map entry encodings, such as (in Protoscope) (value is implied to be ), while mis-parsing others, such as (fields can go in any order), since they don’t actually validate the field numbers like does. Here’s where the benchmarks currently stand for maps: Throughput for various map parsing benchmarks. Maps, I’m told, are not very popular in Protobuf, so they’re not something I have tried to optimize as hard as packed repeated fields.  ↩

0 views
mcyoung 4 months ago

The Best C++ Library

It’s no secret that my taste in programming languages is very weird for a programming language enthusiast professional. Several of my last few posts are about Go, broadly regarded as the programming language equivalent of eating plain oatmeal for breakfast. To make up for that, I’m going to write about the programming language equivalent of diluting your morning coffee with Everclear . I am, of course, talking about C++. If you’ve ever had the misfortune of doing C++ professionally, you’ll know that the C++ standard library is really bad. Where to begin? Well, the associative containers are terrible. Due to bone-headed API decisions, MUST be a closed-addressing, array-of-linked-lists map, not a Swisstable, despite closed-addressing being an outdated technology. , which is not what you usually want, must be a red-black tree. It can’t be a b-tree, like every sensible language provides for the ordered map. is a massive pain in the ass to use, and is full of footguns, like . is also really annoying to use. is full of sharp edges. And where are the APIs for signals? Everything is extremely wordy. could have been called . could have used . The standard algorithms are super wordy, because they deal with iterator pairs. Oh my god, iterator pairs. They added , which do not measure up to Rust’s at all! I’m so mad about all this! The people in charge of C++ clearly, actively hate their users! 1 They want C++ to be as hard and unpleasant as possible to use. Many brilliant people that I am lucky to consider friends and colleagues, including Titus Winters, JeanHeyd Meneide, Matt Fowles-Kulukundis, and Andy Soffer, have tried and mostly failed 2 to improve the language. This is much to say that I believe C++ in its current form is unfixable. But that’s only due to the small-mindedness of a small cabal based out of Redmond. What if we could do whatever we wanted? What if we used C++’s incredible library-building language features to build a brand-new language? For the last year-or-so I’ve been playing with a wild idea: what would C++ look like if we did it over again? Starting from an empty C++20 file with no access to the standard library, what can we build in its place? Titus started Abseil while at Google, whose namespace, , is sometimes said to stand for “a better standard library” 3 . To me, Abseil is important because it was an attempt to work with the existing standard library and make it better, while retaining a high level of implementation quality that a C++ shop’s home-grown utility library won’t have, and a uniformity of vision that Boost is too all-over-the-place to achieve. Rather than trying to coexist with the standard library, I want to surpass it. As a form of performance art, I want to discover what the standard library would look like if we designed it today , in 2025. In this sense, I want to build something that isn’t just better . It should be the C++ standard library from the best possible world. It is the best possible library. This is why my library’s namespace is . In general, I am trying not to directly copy either what C++, or Abseil, or Rust, or Go did. However, each of them has really interesting ideas, and the best library probably lies in some middle-ground somewhere. The rest of this post will be about what I have achieved with so far, and where I want to take it. You can look at the code here . We’re throwing out everything, and that includes . This is a header which shows its age: alias templates were’t added until C++14, and variable templates were added in C++17. As a result, many things that really aught to be concepts have names like . All of these now have concept equivalents in . I have opted to try to classify type traits into separate headers to make them easier to find. They all live under , and they form the leaves of the dependency graph. For example, contains all of the array traits, such as , (to remove an array extent), and , which applies an extent to a type T, such that is not an error. contains very low-level metaprogramming helpers, such as: provides , which removes the qualifiers from an abominable function type . provides , necessary for determining if a type is “more const” than another. provides a standard empty type that interoperates cleanly with . On top of the type traits is the metaprogramming library , which includes generalized constructibility traits in (e.g., to check that you can, in fact, initialize a from a , for example). provides a very general type-level heterogenous list abstraction; a parameter pack as-a-type. The other part of “the foundation” is , which mostly provides access to intrinsics, portability helpers, macros, and “tag types” such as our versions of . For example, provides , provides , and provides . provides our version of the Rust operator, which is not an expression because statement expressions are broken in Clang. Finally, within we find , a special type for turning any C++ type into an object (i.e., a type that you can form a reference to). This is useful for manipulating any type generically, without tripping over the assign-through semantics of references. For example, is essentially a pointer. On top of this foundation we build the basic algebraic data types of : and , which replace and . is a heterogenous collection of values, stored inside of s. This means that has natural rebinding, rather than assign-through, semantics. Accessing elements is done with : returns a reference to the first element. Getting the first element is so common that you can also use . Using will return a reference to a instead, which can be used for rebinding references. For example: There is also and , for the other two most common elements to access. is named so in reference to database rows: it provides many operations for slicing and dicing that does not. For example, in addition to extracting single elements, it’s also possible to access contiguous subsequences, using : ! There are also a plethora of mutation operations: Meanwhile, is a method now: calls with ’s elements as its arguments. is similar, but instead expands to unary calls of , one with each element. And of course , supports structured bindings. Meanwhile, contains precisely one value from various types. There is an underlying type that implements a variadic untagged union that works around many of C++’s bugs relating to unions with members of non-trivial type. The most common way to operate on a choice is to on it: Which case gets called here is chosen by overload resolution, allowing us to write a default case as . Which variant is currently selected can be checked with , while specific variants can be accessed with , just like a , except that it returns a . is what all of the other sum types, like and , are built out of. All of the clever layout optimizations live here. Speaking of , that’s our option type. It’s close in spirit to what is in Rust. has a generic niche mechanism that user types can opt into, allowing to be the same size as a pointer, using for the variant. provides the usual transformation operations: , , . Emptiness can be checked with or . You can even pass a predicate to to check the value with, if it’s present: . The value can be accessed using and , like ; however, this operation is checked, instead of causing UB if the option is empty. can be used to unwrap with a default; the default can be any number of arguments, which are used to construct the default, or even a callback. For example: also Just Works (in fact, is a internally), allowing for truly generic manipulation of optional results. is, unsurprisingly, the analogue of Rust’s . Because it’s a internally, works as you might expect, and is a common return value for I/O operations. It’s very similar to , including offering for accessing the “ok” variant. This enables succinct idioms: and return s containing references to the ok and error variants, depending on which is actually present; meanwhile, a can be converted into a using or , just like in Rust. s are constructed using and . For example: These internally use , a wrapper over that represents a “delayed initialization” that can be stored in a value. It will implicitly convert into any type that can be constructed from its elements. For example: Also, every one of the above types is a structural type, meaning it can be used for non-type template parameters! Of course, all of these ADTs need to be built on top of pointer operations, which is where comes in. is a generalized pointer type that provides many of the same operations as Rust’s raw pointers, including offsetting, copying, and indexing. Like Rust pointers, can be a fat pointer, i.e., it can carry additional metadata on top of the pointer. For example, remembers the size of the array. Providing metadata for a is done through a member alias called . This alias should be private, which is given access to by befriending . Types with custom metadata will usually not be directly constructible (because they are of variable size), and must be manipulated exclusively through types like . Specifying custom metadata allows specifying what the pointer dereferences to. For example, dereferences to a , meaning that all the span operations are accessible through : for example, . Most of this may seem a bit over-complicated, since ordinary C++ raw pointers and references are fine for most uses. However, is the foundation upon which is built on. is a replacement for that fixes its const correctness and adds Rust -like helpers. also works, but unlike , it remembers its size, just like . is parameterized by its allocator, which must satisfy , a much less insane API than what offers. is a singleton allocator representing the system allocator. , mentioned before, is the contiguous memory abstraction, replacing . Like , is a fixed-length span of elements. Unlike , the second parameter is a , not a that uses as a sentinel. tries to approximate the API of Rust slices , providing indexing, slicing, splicing, search, sort, and more. Naturally, it’s also iterable, both forwards and backwards, and provides splitting iterators, just like Rust. Slicing and indexing is always bounds-checked. Indexing can be done with values, while slicing uses a : is a generic mechanism for specifying slicing bounds, similar to Rust’s range types . You can specify the start and end (exclusive), like in Rust. You can also specify an inclusive end using , equivalent to Rust’s . And you can specify a count, like C++’s slicing operations prefer: . itself provides all of the necessary helpers for performing bounds checks and crashing with a nice error message. is also iterable, as we’ll see shortly. is a copy of Rust’s type, providing similar helpers for performing C++-specific size and address calculations. C++ iterator pairs suck. C++ ranges suck. provides a new paradigm for iteration that is essentially just Rust s hammered into a C++ shape. This library lives in . To define an iterator, you define an iterator implementation type , which must define a member function named that returns a : This type is an implementation detail; the actual iterator type is . provides all kinds of helpers, just like , for adapting the iterator or consuming items out of it. Iterators can override the behavior of some of these adaptors to be more efficient, such as for making constant-time rather than linear. Iterators can also offer extra methods if they define the member alias ; for example, the iterators for have a method for returning the part of the slice that has not been yielded by yet. One of the most important extension points is , analogous to , for right-sizing containers that the iterator is converted to, such as a . And of course, provides so that it can be used in a C++ range-for loop, just like C++20 ranges do. 4 , which is an instantiation of, is also an iterator, and can be used much like Rust ranges would: will carefully handle all of the awkward corner cases around overflow, such as . Iterators brings us to the most complex container type that’s checked in right now, . Not only can you customize its allocator type, but you can customize its small vector optimization type. In , s of at most 23 bytes are stored inline , meaning that the strings’s own storage, rather than heap storage, is used to hold them. generalizes this, by allowing any trivially copyable type to be inlined. Thus, a will hold at most five s inline, on 64-bit targets. mostly copies the APIs of and Rust’s . Indexing and slicing works the same as with , and all of the operations can be accessed through , allowing for things like . I have an active (failing) PR which adds , a general hash table implementation that can be used as either a map or a set. Internally it’s backed by a Swisstable 5 implementation. Its API resembles neither , , or Rust’s . Instead, everything is done through a general entry API, similar to that of Rust, but optimized for clarity and minimizing hash lookups. I want to get it merged soonish. Beyond , I plan to add at least the following containers: Possible other ideas: Russ’s sparse array , splay trees, something like Java’s , bitset types, and so on. ’s string handling is intended to resemble Rust’s as much as possible; it lives within . is the Unicode scalar type, which is such that it is always within the valid range for a Unicode scalar, but including the unpaired surrogates. It offers a number of relatively simple character operations, but I plan to extend it to all kinds of character classes in the future. is our replacement for , close to Rust’s : a sequence of valid UTF-8 bytes, with all kinds of string manipulation operations, such as rune search, splitting, indexing, and so on. and use compiler extensions to ensure that when constructed from literals, they’re constructed from valid literals. This means that the following won’t compile! is a under the hood, which can be accessed and manipulated the same way as the underlying to is. is our equivalent. There isn’t very much to say about it, because it works just like you’d expect, and provides a Rust -like API. Where this library really shines is that everything is parametrized over encodings. is actually a ; is then . You can write your own text encodings, too, so long as they are relatively tame and you provide rune encode/decode for them. is the concept is always validly encoded; however, sometimes, that’s not possible. For this reason we have , which is “presumed validly encoded”; its operations can fail or produce replacement characters if invalid code units are found. There is no ; instead, you would generally use something like a instead. Unlike C++, the fact that a is a under the hood is part of the public interface, allowing for cheap conversions and, of course, we get ’s small vector optimization for free. provides the following encodings out of the box: , , , , , and . provides a Rust -style text formatting library. It’s as easy as: Through the power of compiler extensions and , the format is actually checked at compile time! The available formats are the same as Rust’s, including the vs distinction. But it’s actually way more flexible. You can use any ASCII letter, and types can provide multiple custom formatting schemes using letters. By convention, , , , and all mean numeric bases. will quote strings, runes, and other text objects; will print pointer addresses. The special format “forwards from above”; when used in a formatting implementation, it uses the format specifier the caller used. This is useful for causing formats to be “passed through”, such as when printing lists or . Any type can be made formattable by providing a friend template ADL extension (FTADLE) called . This is analogous to implementing a trait like in Rust, however, all formatting operations use the same function; this is similar to in Go. The type, which gets passed into , is similar to Rust’s . Beyond being a sink, it also exposes information on the specifier for the formatting operation via , and helpers for printing indented lists and blocks. is a related FTADLE that is called to determine what the valid format specifiers for this type are. This allows the format validator to reject formats that a type does not support, such as formatting a with . returns (or appends to) a ; and can be used to write to stdout and stderr. Within the metaprogramming library, offers a basic form of reflection. It’s not C++26 reflection, because that’s wholely overkill. Instead, it provides a method for introspecting the members of structs and enums. For example, suppose that we want to have a default way of formatting arbitrary aggregate structs. The code for doing this is actually devilishly simple: provides access to the fields (or enum variants) of a user-defined type that opts itself in by providing the FTADLE, which tells the reflection framework what the fields are. The simplest version of this FTADLE looks like this: is essentially a “reflection builder” that offers fine-grained control over what reflection actually shows of a struct. This allows for hiding fields, or attaching tags to specific fields, which generic functions can then introspect using . The functions on allow iterating over and searching for specific fields (or enum variants); these s provide metadata about a field (such as its name) and allow accessing it, with the same syntax as a pointer-to-member: . Explaining the full breadth (and implementation tricks) of would be a post of its own, so I’ll leave it at that. provides a unit testing framework under , like any good standard library should. To define a test, you define a special kind of global variable: This is very similar to a Go unit test, which defines a function that starts with and takes a as its argument. The value offers test assertions and test failures. Through the power of looking at debuginfo, we can extract the name from the binary, and use that as the name of the test directly. That’s right, this is a C++ test framework with no macros at all ! Meanwhile, at we can find a robust CLI parsing library, in the spirit of and other similar Rust libraries. The way it works is you first define a reflectable struct, whose fields correspond to CLI flags. A very basic example of this can be found in , since test binaries define their own flags: Using , we can apply tags to the individual fields that describe how they should be parsed and displayed as CLI flags. A more complicated, full-featured example can be found at , which exercises most of the CLI parser’s features. can be used to parse a particular flag struct from program inputs, independent of the actual of the program. A contains the actual parser metadata, but this is not generally user-accessible; it is constructed automatically using reflection. Streamlining top-level app execution can be done using , which fully replaces the function. Defining an app is very similar to defining a test: This will automatically record the program inputs, run the flag parser for (printing and existing, when requested), and then call the body of the lambda. The lambda can either return , an (as an exit code) or even a , like Rust. is also where the of the program can be requested by other parts of the program. There’s still a lot of stuff I want to add to . There’s no synchronization primitives, neither atomics nor locks or channels. There’s no I/O; I have a work-in-progress PR to add and . I’d like to write my own math library, (reference-counting), and portable SIMD. There’s also some other OS APIs I want to build, such as signals and subprocesses. I want to add a robust PRNG, time APIs, networking, and stack symbolization. Building the best C++ library is a lot of work, not the least because C++ is a very tricky language and writing exhaustive tests is tedious. But it manages to make C++ fun for me again! I would love to see contributions some day. I don’t expect anyone to actually use this, but to me, it proves C++ could be so much better. They are also terrible people .  ↩ I will grant that JeanHeyd has made significant process where many people believed was impossible. He appears to have the indomitable willpower of a shōnen protagonist.  ↩ I have heard an apocryphal story that the namespace was going to be or , because it was “Alphabet’s library”. This name was ultimately shot down by the office of the CEO, or so the legend goes.  ↩ This may get renamed to or even We’ll see!  ↩ The fourth time I’ve written one in my career, lmao. I also wrote a C implementation at one point. My friend Matt has an excellent introduction to the Swisstable data structure.  ↩ and , the identity traits for type- and value-kinded traits. , which returns whether an entire pack of types is all equal. , our version of . , a “symbol compression” mechanism for shortening the names of otherwise huge symbols. concatenates tuples, copying or moving as appropriate ( will move out of the elements of , for example). returns a copy of with appended, while does the same at an arbitrary index. replaces the th element with , potentially of a different type. deletes the th element, while deletes a contiguous range. splices a row into another row, offering a general replace/delete operation that all of the above operations are implemented in terms of. and are even more general, allowing for non-contiguous indexing. , a btree map/set with a similar API. , a simple min-heap implementation. , a with a linked list running through it for in-order iteration and oldest-member eviction. , a ring buffer like . , a port of my crate . They are also terrible people .  ↩ I will grant that JeanHeyd has made significant process where many people believed was impossible. He appears to have the indomitable willpower of a shōnen protagonist.  ↩ I have heard an apocryphal story that the namespace was going to be or , because it was “Alphabet’s library”. This name was ultimately shot down by the office of the CEO, or so the legend goes.  ↩ This may get renamed to or even We’ll see!  ↩ The fourth time I’ve written one in my career, lmao. I also wrote a C implementation at one point. My friend Matt has an excellent introduction to the Swisstable data structure.  ↩

0 views
mcyoung 4 months ago

What’s //go:nosplit for?

Most people don’t know that Go has special syntax for directives. Unfortunately, it’s not real syntax, it’s just a comment. For example, causes the next function declaration to never get inlined, which is useful for changing the inlining cost of functions that call it. There are three types of directives: The ones documented in ’s doc comment . This includes and . The ones documented elsewhere, such as and . The ones documented in runtime/HACKING.md , which can only be used if the flag is passed to . This includes . The ones not documented at all, whose existence can be discovered by searching the compiler’s tests. These include , , and . We are most interested in a directive of the first type, . According to the documentation: The directive must be followed by a function declaration. It specifies that the function must omit its usual stack overflow check. This is most commonly used by low-level runtime code invoked at times when it is unsafe for the calling goroutine to be preempted. What does this even mean? Normal program code can use this annotation, but its behavior is poorly specified. Let’s dig in. Go allocates very small stacks for new goroutines, which grow their stack dynamically. This allows a program to spawn a large number of short-lived goroutines without spending a lot of memory on their stacks. This means that it’s very easy to overflow the stack. Every function knows how large its stack is, and , the goroutine struct, contains the end position of the stack; if the stack pointer is less than it (the stack grows up) control passes to , which effectively preempts the goroutine while its stack is resized. In effect, every Go function has the following code around it: Note that holds a pointer to the current , and the stack limit is the third word-sized field ( ) in that struct, hence the offset of 16. If the stack is about to be exhausted, it jumps to a special block at the end of the function that spills all of the argument registers, traps into the runtime, and, once that’s done, unspills the arguments and re-starts the function. Note that arguments are spilled before adjusting , which means that the arguments are written to the caller’s stack frame. This is part of Go’s ABI; callers must allocate space at the top of their stack frames for any function that they call to spill all of its registers for preemption 1 . Preemption is not reentrant, which means that functions that are running in the context of a preempted G or with no G at all must not be preempted by this check. The directive marks a function as “nosplit”, or a “non-splitting function”. “Splitting” has nothing to do with what this directive does. aside Segmented Stacks In the bad old days, Go’s stacks were split up into segments , where each segment ended with a pointer to the next, effectively replacing the stack’s single array with a linked list of such arrays. Segmented stacks were terrible. Instead of triggering a resize, these prologues were responsible for updating to the next (or previous) block by following this pointer, whenever the current segment bottomed out. This meant that if a function call happened to be on a segment boundary, it would be extremely slow in comparison to other function calls, due to the significant work required to update correctly. This meant that unlucky sizing of stack frames meant sudden performance cliffs. Fun! Go has since figured out that segmented stacks are a terrible idea. In the process of implementing a correct GC stack scanning algorithm (which it did not have for many stable releases), it also gained the ability to copy the contents of a stack from one location to another, updating pointers in such a way that user code wouldn’t notice. This stack splitting code is where the name “nosplit” comes from. A nosplit function does not load and branch on , and simply assumes it has enough stack. This means that nosplit functions will not preempt themselves, and, as a result, are noticeably faster to call in a hot loop. Don’t believe me? If we profile this and pull up the timings for each function, here’s what we get: The time spent at each instruction (for the whole benchmark, where I made sure equal time was spent on each test case with ) is comparable for all of the instructions these functions share, but an additional ~2% cost is incurred for the stack check. This is a very artificial setup, because the struct is always in L1 in the benchmark due to the fact that no other memory operations occur in the loop. However, for very hot code that needs to saturate the cache, this can have an outsized effect due to cache misses. We can enhance this benchmark by adding an assembly function that executes , which causes the struct to be ejected from all caches. If we add a call to this function to both benchmark loops, we see the staggering cost of a cold fetch from RAM show up in every function call: 120.1 nanosecods for , versus 332.1 nanoseconds for . The 200 nanosecond difference is a fetch from main memory. An L1 miss is about 15 times less expensive, so if the struct manages to get kicked out of L1, you’re paying about 15 or so nanoseconds, or about two map lookups! Despite the language resisting adding an inlining heuristic, which programmers would place everywhere without knowing what it does, they did provide something worse that makes code noticeably faster: nosplit. Consider the following program 2 : Naturally, this will instantly overflow the stack. Instead, we get a really scary linker error: The Go linker contains a check to verify that any chain of nosplit functions which call nosplit functions do not overflow a small window of extra stack, which is where the stack frames of nosplit functions live if they go past . Every stack frame contributes some stack use (for the return address, at minimum), so the number of functions you can call before you get this error is limited. And because every function needs to allocate space for all of its callees to spill their arguments if necessary, you can hit this limit every fast if every one of these functions uses every available argument register (ask me how I know). Also, turning on fuzzing instruments the code by inserting nosplit calls into the fuzzer runtime around branches, meaning that turning on fuzzing can previously fine code to no longer link . Stack usage also varies slightly by architecture, meaning that code which builds in one architecture fails to link in others (most visible when going from 32-bit to 64-bit). There is no easy way to control directives using build tags (two poorly-designed features collide), so you cannot just “turn off” performance-sensitive nosplits for debugging, either. For this reason, you must be very very careful about using nosplit for performance. Excitingly, nosplit functions whose addresses are taken do not have special codegen, allowing us to defeat the linker stack check by using virtual function calls. Consider the following program: This will quickly exhaust the main G’s tiny stack and segfault in the most violent way imaginable, preventing the runtime from printing a debug trace. All this program outputs is . This is probably a bug . It turns out that nosplit has various other fun side-effects that are not documented anywhere. The main thing it does is it contributes to whether a function is considered “unsafe” by the runtime. Consider the following program: This program will make sure that every P becomes bound to a G that loops forever, meaning they will never trap into the runtime. Thus, this program will hang forever, never printing its result and exiting. But that’s not what happens. Thanks to asynchronous preemption, the scheduler will detect Gs that have been running for too long, and preempt its M by sending a signal to it ( due to happenstance , this is of all things.) However, asynchronous preemption is only possible when the M stops due to the signal at a safe point, as determined by . It includes the following block of code: If we chase down where this value is set, we’ll find that it is set explicitly for write barrier sequences, for any function that is “part of the runtime” (as defined by being built with the flag) and for any nosplit function . With a small modification of hoisting the body into a nosplit function, the following program will run forever: it will never wake up from . Even though there is work to do, every P is bound to a G that will never reach a safe point, so there will never be a P available to run the main goroutine. This represents another potential danger of using nosplit functions: those that do not call preemptable functions must terminate promptly, or risk livelocking the whole runtime. I use nosplit a lot , because I write high-performance, low-latency Go. This is a very insane thing to do, which has caused me to slowly generate bug reports whenever I hit strange corner cases. For example, there are many cases where spill regions are allocated for functions that never use them, for example, functions which only call nosplit functions allocate space for them to spill their arguments, which they don’t do. 3 This is a documented Go language feature which: I’m surprised such a massive footgun exists at all, buuuut it’s a measureable benchmark improvement for me, so it’s impossible to tell if it’s bad or not. The astute reader will observe that because preemption is not reentrant, only one of these spill regions will be in use at at time in a G. This is a known bug in the ABI, and is essentially a bodge to enable easy adoption of passing arguments by register, without needing all of the parts of the runtime that expect arguments to be spilled to the stack, as was the case in the slow old days when Go’s ABI on every platform was “ but worse”, i.e., arguments went on the stack and made the CPU’s store queue sad. I recently filed a bug about this that boils down to “add a field to to use a spill space”, which seems to me to be simpler than the alternatives described in the spec.  ↩ Basically every bug report I write starts with these four words and it means you’re about to see the worst program ever written.  ↩ The spill area is also used for spilling arguments across calls, but in this case, it is not necessary for the caller to allocate it for a nosplit function.  ↩ The ones documented in ’s doc comment . This includes and . The ones documented elsewhere, such as and . The ones documented in runtime/HACKING.md , which can only be used if the flag is passed to . This includes . The ones not documented at all, whose existence can be discovered by searching the compiler’s tests. These include , , and . Isn’t very well-documented (the async preemption behavior certainly isn’t)! Has very scary optimization-dependent build failures. Can cause livelock and mysterious segfaults. Can be used in user programs that don’t ! And it makes code faster! The astute reader will observe that because preemption is not reentrant, only one of these spill regions will be in use at at time in a G. This is a known bug in the ABI, and is essentially a bodge to enable easy adoption of passing arguments by register, without needing all of the parts of the runtime that expect arguments to be spilled to the stack, as was the case in the slow old days when Go’s ABI on every platform was “ but worse”, i.e., arguments went on the stack and made the CPU’s store queue sad. I recently filed a bug about this that boils down to “add a field to to use a spill space”, which seems to me to be simpler than the alternatives described in the spec.  ↩ Basically every bug report I write starts with these four words and it means you’re about to see the worst program ever written.  ↩ The spill area is also used for spilling arguments across calls, but in this case, it is not necessary for the caller to allocate it for a nosplit function.  ↩

0 views
mcyoung 6 months ago

Protobuf Tip #7: Scoping It Out

You’d need a very specialized electron microscope to get down to the level to actually see a single strand of DNA. – Craig Venter TL;DR: is a powerful tool for examining wire format dumps, by converting them to JSON and using existing JSON analysis tooling. can be used for lower-level analysis, such debugging messages that have been corrupted. I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . JSON’s human-readable syntax is a big reason why it’s so popular, possibly second only to built-in support in browsers and many languages. It’s easy to examine any JSON document using tools like online prettifiers and the inimitable . But Protobuf is a binary format! This means that you can’t easily use -like tools with it…or can you? The Buf CLI offers a utility for transcoding messages between the three Protobuf encoding formats: the wire format, JSON, and textproto; it also supports YAML. This is , and it’s very powerful. To perform a conversion, we need four inputs: supports input and output redirection, making it usable as part of a shell pipeline. For example, consider the following Protobuf code in our local Buf module: Then, let’s say we’ve dumped a message of type from a service to debug it. And let’s say…well—you can’t just it. However, we can use to turn it into some nice JSON. We can then pipe it into to format it. Now you have the full expressivity of at your disposal. For example, we could pull out the user ID for the cart: Or we can extract all of the SKUs that appear in the cart: Or we could try calculating how many items are in the cart, total: Wait. That’s wrong. The answer should be . This illustrates one pitfall to keep in mind when using with Protobuf. Protobuf will sometimes serialize numbers as quoted strings (the C++ reference implementation only does this when they’re integers outside of the IEEE754 representable range, but Go is somewhat lazier, and does it for all 64-bit values). You can test if an is in the representable float range with this very simple check: . See https://go.dev/play/p/T81SbbFg3br . The equivalent version in C++ is much more complicated. This means we need to use the conversion function: ’s whole deal is JSON, so it brings with it all of JSON’s pitfalls. This is notable for Protobuf when trying to do arithmetic on 64-bit values. As we saw above, Protobuf serializes integers outside of the 64-bit float representable range (and in some runtimes, some integers inside it). For example, if you have a that you want to sum over, it may produce incorrect answers due to floating-point rounding. For notes on conversions in , see https://jqlang.org/manual/#identity . is a tool provided by the Protobuf team (which I originally wrote!) for decoding arbitrary data as if it were encoded in the Protobuf wire format. This process is called disassembly . It’s designed to work without a schema available, although it doesn’t produce especially clean output. The field names are gone; only field numbers are shown. This example also reveals an especially glaring limitation of , which is that it can’t tell the difference between string and message fields, so it guesses according to some heuristics. For the first and third elements it was able to grok them as strings, but for , it incorrectly guessed it was a message and produced garbage. The tradeoff is that not only does not need a schema, it also tolerates almost any error, making it possible to analyze messages that have been partly corrupted. If we flip a random bit somewhere in , disassembling the message still succeeds: Although did give up on disassembling the corrupted submessage, it still made it through the rest of the dump. Like , we can give a to make its heuristic a little smarter. Not only is the second order decoded correctly now, but shows the name of each field (via ). In this mode, still decodes partially-valid messages. also provides a number of other flags for customizing its heuristic in the absence of a . This enables it to be used as a forensic tool for debugging messy data corruption bugs. A Protobuf source to get types out of. This can be a local file, an encoded , or a remote BSR module. If not provided, but run in a directory that is within a local Buf module, that module will be used as the Protobuf type source. The name of the top-level type for the message we want to transcode, via the flag. The input message, via the flag. A location to output to, via the flag.

0 views
mcyoung 6 months ago

Protobuf Tip #6: The Subtle Dangers of Enum Aliases

I’ve been very fortunate to dodge a nickname throughout my entire career. I’ve never had one. – Jimmie Johnson TL;DR: Enum values can have aliases. This feature is poorly designed and shouldn’t be used. The Buf lint rule prevents you from using them by default. I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . Protobuf permits multiple enum values to have the same number. Such enum values are said to be aliases of each other. Protobuf used to allow this by default, but now you have to set a special option, , for the compiler to not reject it. This can be used to effectively rename values without breaking existing code: This works perfectly fine, and is fully wire-compatible! And unlike renaming a field (see TotW #1 ), it won’t result in source code breakages. But if you use either reflection or JSON, or a runtime like Java that doesn’t cleanly allow enums with multiple names, you’ll be in for a nasty surprise. For example, if you request an enum value from an enum using reflection, such as with , the value you’ll get is the one that appears in the file lexically. In fact, both and return the same value, leading to potential confusion, as the old “bad” value will be used in printed output like logs. You might think, “oh, I’ll switch the order of the aliases”. But that would be an actual wire format break. Not for the binary format, but for JSON. That’s because JSON preferentially stringifies enum values by using their declared name (if the value is in range). So, reordering the values means that what once serialized as now serializes as . If an old binary that hasn’t had the new enum value added sees this JSON document, it won’t parse correctly, and you’ll be in for a bad time. You can argue that this is a language bug, and it kind of is. Protobuf should include an equivalent of for enum values, or mandate that JSON should serialize enum values with multiple names as a number, rather than an arbitrarily chosen enum name. The feature is intended to allow renaming of enum values, but unfortunately Protobuf hobbled it enough that it’s pretty dangerous. Instead, if you really need to rename an enum value for usability or compliance reasons (ideally, not just aesthetics) you’re better off making a new enum type in a new version of your API. As long as the enum value numbers are the same, it’ll be binary-compatible, but it will somewhat reduce the risk of the above JSON confusion. Buf provides a lint rule against this feature, , and Protobuf requires that you specify a magic option to enable this behavior, so in practice you don’t need to worry about this. But remember, the consequences of enum aliases go much further than JSON—they affect anything that uses reflection. So even if you don’t use JSON, you can still get burned.

0 views
mcyoung 6 months ago

Protobuf Tip #5: Avoid import public/weak

My dad had a guitar but it was acoustic, so I smashed a mirror and glued broken glass to it to make it look more metal. It looked ridiculous! –Max Cavalera TL;DR: Avoid and . The Buf lint rules and enforce this for you by default. I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . Protobuf s allow you to specify two special modes: and . The Buf CLI lints against these by default, but you might be tempted to try using them anyway, especially because some GCP APIs use . What are these modes, and why do they exist? Protobuf imports are by file path, a fact that is very strongly baked into the language and its reflection model. Importing a file dumps all of its symbols into the current file. For the purposes of name resolution, it’s as if all if the declarations in that file have been pasted into the current file. However, this isn’t transitive. If: This is similar to how importing a package as works in Go. When you write , it dumps all of the declarations from the package into the current file, but not those of any files that imports. Now, what’s nice about Go is that packages can be broken up into files in a way that is transparent to users; users of a package import the package , not the files of that package. Unfortunately, Protobuf is not like that, so the file structure of a package leaks to its callers. was intended as a mechanism for allowing API writers to break up files that were getting out of control. You can define a new file for some of the definitions in , move them to the new file, and then add to . Existing imports of won’t be broken, hooray! Except this feature was designed for C++. In C++, each file maps to a header, which you in your application code. In C++, behaves like , so marking an import as only changes name resolution in Protobuf—the C++ backend doesn’t have to do anything to maintain source compatibility when an import is changed to . But other backends, like Go, do not work this way: in Go doesn’t pull in symbols transitively, so Go would need to explicitly add aliases for all of the symbols that come in through a public import. That is, if you had: Then the Go backend has to generate a in to emulate this behavior (in fact, I was surprised to learn Go Protobuf implements this at all). Go happens to implement public imports correctly, but not all backends are as careful, because this feature is obscure. The example of an isn’t even used for breaking up an existing file; instead, it’s used to not make a huge file bigger and avoid making callers have to add an additional import. This is a bad use of a bad feature! Using to effectively “hide” imports makes it harder to understand what a file is pulling in. If Protobuf imports were at the package/symbol level, like Go or Java, this feature would not need to exist. Unfortunately, Protobuf is closely tailored for C++, and this is one of the consequences. Instead of using to break up a file, simply plan to break up the file in the next version of the API. The Buf lint rule enforces that no one uses this feature by default. It’s tempting, but the footguns aren’t worth it. Public imports have a good, if flawed, reason to exist. Their implementation details are the main thing that kneecaps them. Weak imports, however, simply should not exist. They were added to the language to make it easier for some of Google’s enormous binaries to avoid running out of linker memory, by making it so that message types could be dropped if they weren’t accessed. This means that weak imports are “optional”—if the corresponding descriptors are missing at runtime, the C++ runtime can handle it gracefully. This leads to all kinds of implementation complexity and subtle behavior differences across runtimes. Most runtimes implement (or implemented, in the case of those that removed support) in a buggy or inconsistent way. It’s unlikely the feature will ever be truly removed, even though Google has tried. Don’t use . It should be treated as completely non-functional. The Buf lint rule takes care of this for you. imports … and imports … and defines … then, must import to refer to , even though imports it.

0 views
mcyoung 7 months ago

Protobuf Tip #4: Accepting Mistakes We Can’t Fix

Bad humor is an evasion of reality; good humor is an acceptance of it. –Malcolm Muggeridge TL;DR: Protobuf’s distributed nature introduces evolution risks that make it hard to fix some types of mistakes. Sometimes the best thing to do is to just let it be. I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . Often, you’ll design and implement a feature for the software you work on, and despite your best efforts to test it, something terrible happens in production. We have a playbook for this, though: fix the bug in your program and ship or deploy the new, fixed version to your users. It might mean working late for big emergencies, but turnaround for most organizations is a day to a week. Most bugs aren’t emergencies, though. Sometimes a function has a confusing name, or an integer type is just a bit too small for real-world data, or an API conflates “zero” and “null”. You fix the API, refactor all usages in your API in one commit, merge, and the fix rolls out gradually. Unless, of course, it’s a bug in a communication API, like a serialization format: your Protobuf types, or your JSON schema, or the not-too-pretty code that parses fields out of dict built from a YAML file. Here, you can’t just atomically fix the world. Fixing bugs in your APIs (from here on, “APIs” means “Protobuf definitions”) requires a different mindset than fixing bugs in ordinary code. Protobuf’s wire format is designed so that you can safely add new fields to a type, or values to an enum, without needing to perform an atomic upgrade. But other changes, like renaming fields or changing their type, are very dangerous. This is because Protobuf types exist on a temporal axis: different versions of the same type exist simultaneously among programs in the field that are actively talking to each other. This means that writers from the future (that is, new serialization code) must be careful to not confuse the many readers from the past (old versions of the deserialization code). Conversely, future readers must tolerate anything past writers produce. In a modern distributed deployment, the number of versions that exist at once can be quite large. This is true even in self-hosted clusters, but becomes much more fraught whenever user-upgradable software is involved. This can include mobile applications that talk to your servers, or appliance software managed by a third-party administrator, or even just browser-service communication. The most important principle: you can’t easily control when old versions of a type or service are no longer relevant. As soon as a type escapes out of the scope of even a single team, upgrading types becomes a departmental effort. There are many places where Protobuf could have made schema evolution easier, but didn’t. For example, changing to is a breakage, even though at the wire format level, it is possible for a parser to distinguish and accept both forms of correctly. There too many other examples to list, but it’s important to understand that the language is not always working in our favor. For example, if we notice a value is too small, and should have been 64-bit, you can’t upgrade it without readers from the past potentially truncating it. But we really have to upgrade it! What are our options? Of course, there is a third option, which is to accept that some things aren’t worth fixing. When the cost of a fix is so high, fixes just aren’t worth it, especially when the language is working against us. This means that even in Buf’s own APIs, we sometimes do things in a way that isn’t quite ideal, or is inconsistent with our own best practices. Sometimes, the ecosystem changes in a way that changes best practice, but we can’t upgrade to it without breaking our users. In the same way, you shouldn’t rush to use new, better language features if they would cause protocol breaks: sometimes, the right thing is to do nothing, because not breaking your users is more important. Issue a new version of the message and all of its dependencies. This is the main reason why sticking a version number in the package name, as enforced by Buf’s lint rule, is so important. Do the upgrade anyway and hope nothing breaks. This can work for certain kinds of upgrades, if the underlying format is compatible, but it can have disastrous consequences if you don’t know what you’re doing, especially if it’s a type that’s not completely internal to a team’s project. Buf breaking change detection helps you avoid changes with potential for breakage.

0 views
mcyoung 7 months ago

Protobuf Tip #3: Enum Names Need Prefixes

Smart people learn from their mistakes. But the real sharp ones learn from the mistakes of others. –Brandon Mull TL;DR: s inherit some unfortunate behaviors from C++. Use the Buf lint rules and  to avoid this problem (they’re part of the category). I’m editing a series of best practice pieces on Protobuf, a language that I work on which has lots of evil corner-cases.These are shorter than what I typically post here, but I think it fits with what you, dear reader, come to this blog for. These tips are also posted on the buf.build blog . Protobuf’s s define data types that represent a small set of valid values. For example, lists status codes used by various RPC frameworks, such as GRPC. Under the hood, every is just an  on the wire, although codegen backends will generate custom types and constants for the enum to make it easier to use. Unfortunately, s were originally designed to match C++ enums exactly, and they inadvertently replicate many of those behaviors. If you look at the source for , and compare it to, say, , you will notice a subtle difference: has values starting with , but ‘s values don’t have a prefix.  This is because the fully-qualified names (FQN) of an enum value don’t include the name of the enum. That is, actually refers to . Thus, is not , but . This is because it matches the behavior of unscoped C++ enums. C++ is the “reference” implementation, so the language often bends for the sake of the C++ backend. When generating code, ’s C++ backend emits the above as follows: And in C++, s don’t scope their enumerators: you write , NOT . If you know C++, you might be thinking, “why didn’t they use ?!”? Enums were added in , which was developed around 2007-2008, but Google didn’t start using C++11, which introduced , until much, much later. Now, if you’re a Go or Java programmer, you’re probably wondering why you even care about C++. Both Go and Java do scope enum values to the enum type (although Go does it in a somewhat grody way: ). Unfortunately, this affects name collision detection in Protobuf. You can’t write the following code: Because the enum name is not part of the FQN for an enum value, both s here have the FQN , so Protobuf complains about duplicate symbols. Thus, the convention we see in : Buf provides a lint rule to enforce this convention: . Even though you might think that an enum name will be unique, because top-level enums bleed their names into the containing package, the problem spreads across packages! relies heavily on the concept of “zero values” – all non-message fields that are neither nor are implicitly zero if they are not present. Thus, requires that enums specify a value equal to zero. By convention, this value shouldn’t be a specific value of the enum, but rather a value representing that no value is specified. enforces this, with a default of . Of course, there are situations where this might not make sense for you, and a suffix like or might make more sense. It may be tempting to have a specific “good default” value for the zero value. Beware though, because that choice is forever. Picking a generic “unknown” as the default reduces the chance you’ll burn yourself. Name prefixes and zero values also teach us an important lesson: because Protobuf names are forever, it’s really hard to fix style mistakes, especially as we collectively get better at using Protobuf. is intended to be source-compatible with very old existing C++ code, so it throws caution to the wind. doesn’t have a zero value because in , which doesn’t have zero value footguns in its wire format, you don’t need to worry about that. The lesson isn’t just to use Buf’s linter to try to avoid some of the known pitfalls, but also to remember that even APIs designed by the authors of the language make unfixable mistakes, so unlike other programming languages, imitating “existing practice” isn’t always the best strategy.

0 views