Posts in Haskell (20 found)
mcyoung 1 months ago

Why SSA?

If you’ve read anything about compilers in the last two decades or so, you have almost certainly heard of SSA compilers , a popular architecture featured in many optimizing compilers, including ahead-of-time compilers such as LLVM, GCC, Go, CUDA (and various shader compilers), Swift 1 , and MSVC 2 , and just-in-time compilers such as HotSpot C2 3 , V8 4 , SpiderMonkey 5 , LuaJIT, and the Android Runtime 6 . SSA is hugely popular, to the point that most compiler projects no longer bother with other IRs for optimization 7 . This is because SSA is incredibly nimble at the types of program analysis and transformation that compiler optimizations want to do on your code. But why ? Many of my friends who don’t do compilers often say that compilers seem like opaque magical black boxes, and SSA, as it often appears in the literature, is impenetrably complex. But it’s not! SSA is actually very simple once you forget everything you think your programs are actually doing. We will develop the concept of SSA form, a simple SSA IR, prove facts about it, and design some optimizations on it. I have previously written about the granddaddy of all modern SSA compilers, LLVM. This article is about SSA in general, and won’t really have anything to do with LLVM. However, it may be helpful to read that article to make some of the things in this article feel more concrete. SSA is a property of intermediate representations (IRs), primarily used by compilers for optimizing imperative code that target a register machine . Register machines are computers that feature a fixed set of registers that can be used as the operands for instructions: this includes virtually all physical processors, including CPUs, GPUs, and weird tings like DSPs. SSA is most frequently found in compiler middle-ends , the optimizing component between the frontend (which deals with the surface language programmers write, and lowers it into the middle-end’s IR), and the backend (which takes the optimized IR and lowers it into the target platform’s assembly). SSA IRs, however, often have little resemblance to the surface language they lower out of, or the assembly language they target. This is because neither of these representations make it easy for a compiler to intuit optimization opportunities. Imperative code consists of a sequence of operations that mutate the executing machine’s state to produce a desired result. For example, consider the following C program: This program returns no matter what its input is, so we can optimize it down to this: But, how would you write a general algorithm to detect that all of the operations cancel out? You’re forced to keep in mind program order to perform the necessary dataflow analysis, following mutations of and through the program. But this isn’t very general, and traversing all of those paths makes the search space for large functions very big. Instead, you would like to rewrite the program such that and gradually get replaced with the expression that calculates the most recent value, like this: Then we can replace each occurrence of a variable with its right-hand side recursively… Then fold the constants together… And finally, we see that we’re returning , and can replace it with . All the other variables are now unused, so we can delete them. The reason this works so well is because we took a function with mutation, and converted it into a combinatorial circuit , a type of digital logic circuit that has no state, and which is very easy to analyze. The dependencies between nodes in the circuit (corresponding to primitive operations such as addition or multiplication) are obvious from its structure. For example, consider the following circuit diagram for a one-bit multiplier: This graph representation of an operation program has two huge benefits: The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. The reason combinatorial circuits are the best circuits is because they are directed acyclic graphs (DAGs) which admit really nice algorithms. For example, longest path in a graph is NP-hard (and because P ≠ N P P \neq NP P  = NP 8 , has complexity O ( 2 n ) O(2^n) O ( 2 n ) ). However, if the graph is a DAG, it admits an O ( n ) O(n) O ( n ) solution! To understand this benefit, consider another program: Suppose we wanted to replace each variable with its definition like we did before. We can’t just replace each constant variable with the expression that defines it though, because we would wind up with a different program! Now, we pick up an extra term because the squaring operation is no longer unused! We can put this into circuit form, but it requires inserting new variables for every mutation. But we can’t do this when complex control flow is involved! So all of our algorithms need to carefully account for mutations and program order, meaning that we don’t get to use the nice graph algorithms without careful modification. SSA stands for “static single assignment”, and was developed in the 80s as a way to enhance the existing three-argument code (where every statement is in the form ) so that every program was circuit-like, using a very similar procedure to the one described above. The SSA invariant states that every variable in the program is assigned to by precisely one operation. If every operation in the program is visited once, they form a combinatorial circuit. Transformations are required to respect this invariant. In circuit form, a program is a graph where operations are nodes, and “registers” (which is what variables are usually called in SSA) are edges (specifically, each output of an operation corresponds to a register). But, again, control flow. We can’t hope to circuitize a loop, right? The key observation of SSA is that most parts of a program are circuit-like. A basic block is a maximal circuital component of a program. Simply put, it is a sequence of non-control flow operations, and a final terminator operation that transfers control to another basic block. The basic blocks themselves form a graph, the control flow graph , or CFG. This formulation of SSA is sometimes called SSA-CFG 9 . This graph is not a DAG in general; however, separating the program into basic blocks conveniently factors out the “non-DAG” parts of the program, allowing for simpler analysis within basic blocks. There are two equivalent formalisms for SSA-CFG. The traditional one uses special “phi” operations (often called phi nodes , which is what I will call them here) to link registers across basic blocks. This is the formalism LLVM uses. A more modern approach, used by MLIR, is block arguments : each basic block specifies parameters, like a function, and blocks transferring control flow to it must pass arguments of those types to it. Let’s look at some code. First, consider the following C function which calculates Fibonacci numbers using a loop. How might we express this in an SSA-CFG IR? Let’s start inventing our SSA IR! It will look a little bit like LLVM IR, since that’s what I’m used to looking at. Every block ends in a , which transfers control to one of several possible blocks. In the process, it calls that block with the given arguments. One can think of a basic block as a tiny function which tails 10 into other basic blocks in the same function. aside Phi Nodes LLVM IR is… older, so it uses the older formalism of phi nodes. “Phi” comes from “phony”, because it is an operation that doesn’t do anything; it just links registers from predecessors. A operation is essentially a switch-case on the predecessors, each case selecting a register from that predecessor (or an immediate). For example, has two predecessors, the implicit entry block , and . In a phi node IR, instead of taking a block argument for , it would specify The value of the operation is the value from whichever block jumped to this one. This can be awkward to type out by hand and read, but is a more convenient representation for describing algorithms (just “add a phi node” instead of “add a parameter and a corresponding argument”) and for the in-memory representation, but is otherwise completely equivalent. It’s a bit easier to understand the transformation from C to our IR if we first rewrite the C to use goto instead of a for loop: However, we still have mutation in the picture, so this isn’t SSA. To get into SSA, we need to replace every assignment with a new register, and somehow insert block arguments… The above IR code is already partially optimized; the named variables in the C program have been lifted out of memory and into registers. If we represent each named variable in our C program with a pointer, we can avoid needing to put the program into SSA form immediately. This technique is used by frontends that lower into LLVM, like Clang. We’ll enhance our IR by adding a declaration for functions, which defines scratch space on the stack for the function to use. Each stack slot produces a pointer that we can from and to. Our Fibonacci function would now look like so: Any time we reference a named variable, we load from its stack slot, and any time we assign it, we store to that slot. This is very easy to get into from C, but the code sucks because it’s doing lots of unnecessary pointer operations. How do we get from this to the register-only function I showed earlier? aside Program Order We want program order to not matter for the purposes of reordering, but as we’ve written code here, program order does matter: loads depend on prior stores but stores don’t produce a value that can be used to link the two operations. We can restore not having program order by introducing operands representing an “address space”; loads and stores take an address space as an argument, and stores return a new address space. An address space, or , represents the state of some region of memory. Loads and stores are independent when they are not connected by a argument. This type of enhancement is used by Go’s SSA IR, for example. However, it adds a layer of complexity to the examples, so instead I will hand-wave this away. Now we need to prove some properties about CFGs that are important for the definition and correctness of our optimization passes. First, some definitions. The predecessors (or “preds”) of a basic block is the set of blocks with an outgoing edge to that block. A block may be its own predecessors. Some literature calls the above “direct” or immediate predecessors. For example, the preds of in our example are are (the special name for the function entry-point) . The successors (no, not “succs”) of a basic block is the set of blocks with an outgoing edge from that block. A block may be its own successors. The sucessors of are and . The successors are listed in the loop’s . If a block is a transitive pred of a block , we say that weakly dominates , or that it is a weak dominator of . For example, , and both weakly dominate . However, this is not usually an especially useful relationship. Instead, we want to speak of dominators: A block is a dominator (or dominates ) if every pred of is dominated by , or if is itself. Equivalently, the dominator set of is the intersection of the dominator sets of its preds, plus . The dominance relation has some nice order properties that are necessary for defining the core graph algorithms of SSA. We only consider CFGs which are flowgraphs, that is, all blocks are reachable from the root block , which has no preds. This is necessary to eliminate some pathological graphs from our proofs. Importantly, we can always ask for an acyclic path 11 from to any block . An equivalent way to state the dominance relationship is that from every path from to contains all of ’s dominators. proposition dominates iff every path from to contains . First, assume every to path contains . If is , we’re done. Otherwise we need to prove each predecessor of is dominated by ; we do this by induction on the length of acyclic paths from to . Consider preds of that are not , and consider all acyclic paths p p p from to ; by appending to them, we have an acyclic path p ′ p' p ′ from to , which must contain . Because both the last and second-to-last elements of this are not , it must be within the shorter path p p p which is shorter than p ′ p' p ′ . Thus, by induction, dominates and therefore Going the other way, if dominates , and consider a path p p p from to . The second-to-last element of p p p is a pred of ; if it is we are done. Otherwise, we can consider the path p p p made by deleting at the end. is dominated by , and p ′ p' p ′ is shorter than p p p , so we can proceed by induction as above. Onto those nice properties. Dominance allows us to take an arbitrarily complicated CFG and extract from it a DAG, composed of blocks ordered by dominance. The dominance relation is a partial order. Dominance is reflexive and transitive by definition, so we only need to show blocks can’t dominate each other. Suppose distinct and dominate each other.Pick an acyclic path p p p from to . Because dominates , there is a prefix p ′ p' p ′ of this path ending in . But because dominates , some prefix p ′ ′ p'' p ′′ of p ′ p' p ′ ends in . But now p p p must contain twice, contradicting that it is acyclic. This allows us to write when dominates . There is an even more refined graph structure that we can build out of dominators, which follows immediately from the partial order theorem. The dominators of a basic block are totally ordered by the dominance relation. Suppose and , but neither dominates the other. Then, there must exist acyclic paths from to which contain both, but in different orders. Take the subpaths of those paths which follow , and , neither of which contains . Concatenating these paths yields a path from to that does not contain , a contradiction. This tells us that the DAG we get from the dominance relation is actually a tree, rooted at . The parent of a node in this tree is called its immediate dominator . Computing dominators can be done iteratively: the dominator set of a block is the intersection the dominator sets of its preds, plus . This algorithm runs in quadratic time. A better algorithm is the Lengauer-Tarjan algorithm[^lta]. It is relatively simple, but explaining how to implement it is a bit out of scope for this article. I found a nice treatment of it here . What’s important is we can compute the dominator tree without breaking the bank, and given any node, we can ask for its immediate dominator. Using immediate dominators, we can introduce the final, important property of dominators. The dominance frontier of a block is the set of all blocks not dominated by with at least one pred which dominates. These are points where control flow merges from distinct paths: one containing and one not. The dominance frontier of is , whose preds are and . There are many ways to calculate dominance frontiers, but with a dominance tree in hand, we can do it like this: algorithm Dominance Frontiers. For each block with more than one pred, for each of its preds, let be that pred. Add to the dominance frontier of and all of its dominators, stopping when encountering ’ immediate dominator. We need to prove that every block examined by the algorithm winds up in the correct frontiers. First, we check that every examined block is added to the correct frontier. If , where is a pred of , and a is ’s immediate dominator, then if , is not in its frontier, because must dominate . Otherwise, must be in ’s frontier, because dominates a pred but it cannot dominate , because then it would be dominated by , a contradiction. Second, we check that every frontier is complete. Consider a block . If an examined block is in its frontier, then must be among the dominators of some pred , and it must be dominated by ’s immediate dominator; otherwise, would dominate (and thus would not be in its frontier). Thus, gets added to ’s dominator. You might notice that all of these algorithms are quadratic. This is actually a very good time complexity for a compilers-related graph algorithm. Cubic and quartic algorithms are not especially uncommon, and yes, your optimizing compiler’s time complexity is probably cubic or quartic in the size of the program! Ok. Let’s construct an optimization. We want to figure out if we can replace a load from a pointer with the most recent store to that pointer. This will allow us to fully lift values out of memory by cancelling out store/load pairs. This will make use of yet another implicit graph data structure. The dataflow graph is the directed graph made up of the internal circuit graphs of each each basic block, connected along block arguments. To follow a use-def chain is to walk this graph forward from an operation to discover operations that potentially depend on it, or backwards to find operations it potentially depends on. It’s important to remember that the dataflow graph, like the CFG, does not have a well defined “up” direction. Navigating it and the CFG requires the dominator tree. One other important thing to remember here is that every instruction in a basic block always executes if the block executes. In much of this analysis, we need to appeal to “program order” to select the last load in a block, but we are always able to do so. This is an important property of basic blocks that makes them essential for constructing optimizations. For a given , we want to identify all loads that depend on it. We can follow the use-def chain of to find which blocks contain loads that potentially depend on the store (call it ). First, we can eliminate loads within the same basic block (call it ). Replace all instructions after (but before any other s, in program order) with ’s def. If is not the last store in this block, we’re done. Otherwise, follow the use-def chain of to successors which use , i.e., successors whose case has as at least one argument. Recurse into those successors, and now replacing the pointer of interest with the parameters of the successor which were set to (more than one argument may be ). If successor loads from one of the registers holding , replace all such loads before a store to . We also now need to send into somehow. This is where we run into something of a wrinkle. If has exactly one predecessor, we need to add a new block argument to pass whichever register is holding (which exists by induction). If is already passed into by another argument, we can use that one. However, if has multiple predecessors, we need to make sure that every path from to sends , and canonicalizing those will be tricky. Worse still, if is in ’s domination frontier, a different store could be contributing to that load! For this reason, dataflow from stores to loads is not a great strategy. Instead, we’ll look at dataflow from loads backwards to stores (in general, dataflow from uses to defs tends to be more useful), which we can use to augment the above forward dataflow analysis to remove the complex issues around domination frontiers. Let’s analyze loads instead. For each in , we want to determine all stores that could potentially contribute to its value. We can find those stores as follows: We want to be able to determine which register in a given block corresponds to the value of , and then find its last store in that block. To do this, we’ll flood-fill the CFG backwards in BFS order. This means that we’ll follow preds (through the use-def chain) recursively, visiting each pred before visiting their preds, and never revisiting a basic block (except we may need to come back to at the end). Determining the “equivalent” 12 of in (we’ll call it ) can be done recursively: while examining , follow the def of . If is a block parameter, for each pred , set to the corresponding argument in the case in ’s . Using this information, we can collect all stores that the load potentially depends on. If a predecessor stores to , we add the last such store in (in program order) to our set of stores, and do not recurse to ’s preds (because this store overwrites all past stores). Note that we may revisit in this process, and collect a store to from it occurs in the block. This is necessary in the case of loops. The result is a set of pairs. In the process, we also collected a set of all blocks visited, , which are dominators of which we need to plumb a through. This process is called memory dependency analysis , and is a key component of many optimizations. Not all contributing operations are stores. Some may be references to globals (which we’re disregarding), or function arguments or the results of a function call (which means we probably can’t lift this load). For example gets traced all the way back to a function argument, there is a code path which loads from a pointer whose stores we can’t see. It may also trace back to a stack slot that is potentially not stored to. This means there is a code path that can potentially load uninitialized memory. Like LLVM, we can assume this is not observable behavior, so we can discount such dependencies. If all of the dependencies are uninitialized loads, we can potentially delete not just the load, but operations which depend on it (reverse dataflow analysis is the origin of so-called “time-traveling” UB). Now that we have the full set of dependency information, we can start lifting loads. Loads can be safely lifted when all of their dependencies are stores in the current function, or dependencies we can disregard thanks to UB in the surface language (such as loads or uninitialized loads). There is a lot of fuss in this algorithm about plumbing values through block arguments. A lot of IRs make a simplifying change, where every block implicitly receives the registers from its dominators as block arguments. I am keeping the fuss because it makes it clearer what’s going on, but in practice, most of this plumbing, except at dominance frontiers, would be happening in the background. Suppose we can safely lift some load. Now we need to plumb the stored values down to the load. For each block in (all other blocks will now be in unless stated otherwise). We will be building two mappings: one , which is the register equivalent to in that block. We will also be building a map , which is the value that must have in that block. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Once we’re done, if is the block that contains the load, we can now replace all loads to before any stores to with . There are cases where this whole process can be skipped, by applying a “peephole” optimization. For example, stores followed by loads within the same basic block can be optimized away locally, leaving the heavy-weight analysis for cross-block store/load pairs. Here’s the result of doing dependency analysis on our Fibonacci function. Each load is annotated with the blocks and stores in . Let’s look at . Is contributing loads are in and . So we add a new parameter : in , we call that parameter with (since that’s stored to it in ), while in , we pass . What about L4? The contributing loads are also in and , but one of those isn’t a pred of . is also in the subgraph for this load, though. So, starting from , we add a new parameter to and feed (the stored value, an immediate this time) through it. Now looking at , we see there is already a parameter for this load ( ), so we just pass as that argument. Now we process , which pushed onto the queue. gets a new parameter , which is fed ’s own . We do not re-process , even though it also appears in ’s gotos, because we already visited it. After doing this for the other two loads, we get this: After lifting, if we know that a stack slot’s pointer does not escape (i.e., none of its uses wind up going into a function call 13 ) or a write to a global (or a pointer that escapes), we can delete every store to that pointer. If we delete every store to a stack slot, we can delete the stack slot altogether (there should be no loads left for that stack slot at this point). This analysis is simple, because it assumes pointers do not alias in general. Alias analysis is necessary for more accurate dependency analysis. This is necessary, for example, for lifting loads of fields of structs through subobject pointers, and dealing with pointer arithmetic in general. However, our dependency analysis is robust to passing different pointers as arguments to the same block from different predecessors. This is the case that is specifically handled by all of the fussing about with dominance frontiers. This robustness ultimately comes from SSA’s circuital nature. Similarly, this analysis needs to be tweaked to deal with something like (a ternary, essentially). s of pointers need to be replaced with s of the loaded values, which means we need to do the lifting transformation “all at once”: lifting some liftable loads will leave the IR in an inconsistent state, until all of them have been lifted. Many optimizations will make a mess of the CFG, so it’s useful to have simple passes that “clean up” the mess left by transformations. Here’s some easy examples. If an operation’s result has zero uses, and the operation has no side-effects, it can be deleted. This allows us to then delete operations that it depended on that now have no side effects. Doing this is very simple, due to the circuital nature of SSA: collect all instructions whose outputs have zero uses, and delete them. Then, examine the defs of their operands; if those operations now have no uses, delete them, and recurse. This bubbles up all the way to block arguments. Deleting block arguments is a bit trickier, but we can use a work queue to do it. Put all of the blocks into a work queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. There are many CFG configurations that are redundant and can be simplified to reduce the number of basic blocks. For example, unreachable code can help delete blocks. Other optimizations may cause the at the end of a function to be empty (because all of its successors were optimized away). We treat an empty as being unreachable (since it has no cases!), so we can delete every operation in the block up to the last non-pure operation. If we delete every instruction in the block, we can delete the block entirely, and delete it from its preds’ s. This is a form of dead code elimination , or DCE, which combines with the previous optimization to aggressively delete redundant code. Some jumps are redundant. For example, if a block has exactly one pred and one successor, the pred’s case for that block can be wired directly to the successor. Similarly, if two blocks are each other’s unique predecessor/successor, they can be fused , creating a single block by connecting the input blocks’ circuits directly, instead of through a . If we have a ternary operation, we can do more sophisticated fusion. If a block has two successors, both of which the same unique successor, and those successors consist only of gotos, we can fuse all four blocks, replacing the CFG diamond with a . In terms of C, this is this transformation: LLVM’s CFG simplification pass is very sophisticated and can eliminate complex forms of control flow. I am hoping to write more about SSA optimization passes. This is a very rich subject, and viewing optimizations in isolation is a great way to understand how a sophisticated optimization pipeline is built out of simple, dumb components. It’s also a practical application of graph theory that shows just how powerful it can be, and (at least in my opinion), is an intuitive setting for understanding graph theory, which can feel very abstract otherwise. In the future, I’d like to cover CSE/GVN, loop optimizations, and, if I’m feeling brave, getting out of SSA into a finite-register machine (backends are not my strong suit!). Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩ The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩

0 views
Abhinav Sarkar 1 months ago

A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this final post, we write the virtual machine that executes our bytecode, and benchmark it. This post was originally published on abhinavsarkar.net . This post is part of the series: A Fast Bytecode VM for Arithmetic . Bytecode Virtual Machines (VMs) are known to be faster than AST -walking interpreters. That’s why many real-world programming languages these days are implemented with bytecode VM s, for example, Java , Python , PHP , and Raku . The reason is partially, the flat and compact nature of bytecode itself. But VM s also have a few other tricks up their sleeves that make them highly performant. In this post, we write a VM for our arithmetic expression language, and explore some of these performance tricks. But first, we need to finish a pending task. We wrote some unit tests for our compiler in the last post, but unit tests cover only the cases we can think of. A compiler has to deal with any input, and with just unit tests we cannot be sure of its correctness. To test our compiler and other components for correctness, we use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input, and then to automatically generate a large number of arbitrary inputs and make sure that the properties are indeed true for them 1 2 . Since we are writing an arithmetic expression parser/compiler/ VM , we generate arbitrary expression AST s, and use them to assert certain invariants of our program. With QuickCheck, we write generators for the inputs for our tests. These generators are composable just like parser combinators are. We use the library provided generators to write small generators that we combine to create larger ones. Let’s start: First come the basic generators: Moving on to composite generators: generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. Finally, we have some instances of QuickCheck’s type class to tie everything together: We can apply them in GHCi: Notice that the generated samples increase in complexity. With the generators in place, we define our properties next. Let’s test our parser first: This property is a simple round-trip test for the parser and printer: we parse the string representation of a generated expression, and assert that it gives back the same expression. The second property is a more involved round-trip test for the compiler and decompiler: This asserts that compiling an expression, then disassembling and decompiling it, and finally compiling it again should result in the original bytecode 3 . This requires a helper function to get the size of an expression: We run these tests in a later section. This ends our short detour. Now for the main event: the virtual machine. Our VM is a stack-based machine that operates on a stack of values and executes the compiled bytecode. Our goal is to be as fast as possible. For a quick reminder, these are our s: And now, the heart of the VM : The function is where the action happens. It is way more complex than , but the complexity has a reason, namely performance. runs inside the monad wrapped with the monad transformer. monad lets us use mutable data structures locally while ensuring the function remains externally pure. monad transformer adds support for throwing and propagating errors in a pure manner. We use for our stack, which is a mutable array of unboxed primitive types, in our case an array of values. Using a mutable unboxed array is much faster than using an immutable and/or boxed one like or due to reduced allocation and/or pointer chasing. The core of the VM is the function, a tight, tail-recursive loop that GHC compiles into an efficient machine loop, as we see later. It takes the stack pointer ( ), instruction pointer ( ) 4 , and the stack as arguments. At the top of each loop, a block of guard clauses checks for stack overflow, underflow, and other error conditions before branching on the current opcode. Placing these checks at the top instead of inside the opcode cases is a deliberate choice. This may make the code slightly harder to understand, but it significantly improves the performance of the loop by moving all branching at the beginning of the loop, resulting in code that is more friendly to the CPU’s Branch Predictor . Also notice how we reduce the number of checks by working with a range of opcodes at once in the guard. The checks are also sorted so as to be most performant, guided by profiling and benchmarking 5 . The handling of each opcode is actually pretty straightforward. We use different specific operations to read and write to the stack, while taking care of doing the required bound and arithmetic checks. We also use the functions that we wrote earlier . After carrying out each operation, we reenter the loop by calling it tail-recursively with the right stack and instruction pointers. Finally, we make sure that the execution terminated correctly by checking the state of the stack, and return its first element. We see later that the VM is quite fast, but how does GHC achieve this performance? To see the magic, we can look at GHC ’s intermediate language: Core . Core is a simpler functional language than Haskell to which GHC compiles Haskell. The simpler nature of Core makes it easier for GHC to optimize it, and compile it further. We can get the Core code for a program by compiling with the GHC option . The actual Core code for our VM is too verbose to show here, but here is a simplified C-like pseudo-code version of our loop: A few key optimizations are worth pointing out: The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. In short, GHC has turned our high-level, declarative Haskell code into a low-level loop that looks remarkably like one we would write in C. We get the safety and expressiveness of Haskell, while GHC does the heavy lifting to produce highly optimized code. It’s the best of both worlds! We must test the VM to make sure it works correctly 6 . We reuse the success and failure tests for the AST interpreter, as the bytecode interpreter should yield the same result: We also add a property-based test this time: for any given expression, interpreting the AST should produce the same result as compiling it to bytecode and executing it in the VM 7 . Our test suite is complete now: And finally, we run all tests together: Happily, all tests pass. Now for the fun part: benchmarking. We use the criterion library to benchmark the code. We have a benchmark suite to measure the performance of each pass, the two interpreters ( AST and bytecode), and the full end-to-end runs 8 . We compile with the following GHC options: Here are the results in a more digestible format: Here are the times in a chart (smaller is better): Let’s break down these numbers: I can already see readers thinking, “Sure that’s fast, but is it faster than C/Rust/Zig/my favourite language?” Let’s find out. To get a better sense of our VM ’s performance, I rewrote it in C. The C implementation is a classic manual approach: a hand-written tokenizer and recursive-descent parser, s with pointers for the AST , and manual memory management and error propagation. The VM is a simple loop with a statement for dispatching opcodes 9 . To compare our Haskell code against the C code, we need to write the last Haskell module, the CLI app that we demonstrated in the first post : We compile with the following GHC options 10 : And for the C version, we compile using GCC: Now, let’s see how they stack up against each other. We use hyperfine to run the two executables. Here’s a summary of the results: I have subtracted the times of previous passes to get the times for individual passes. Here’s the same in a chart (smaller is better): As expected, the C implementation is faster across the board, between 1.5x to 2.6x. The biggest difference is in parsing, where the hand-written C parser is more than twice as fast as our combinator-based one. On the other hand, the Haskell VM is only 50% slower than the C VM . In my opinion, the Haskell code’s performance is quite respectable, especially given the safety, expressiveness and conciseness benefits, as illustrated by the code sizes 11 : The Haskell implementation is almost half the size of the C code. I don’t know about you but I’m perfectly happy with the half as small, half as fast tread-off. The benchmark results for the VM s become less surprising when I compare the C function with the GHC Core code for 12 . This structure is almost a 1-to-1 match with the GHC Core code we saw earlier. The C loop corresponds to the optimized function that GHC generates, the statement is almost identical to the analysis on the raw opcode byte, and the C stack array is equivalent to the GHC uses. GHC effectively compiles our high-level Haskell into a low-level code that is structurally identical to what we wrote by hand in C 13 . This explains why the performance is in the same ballpark. The remaining performance gap is probably due to the thin layer of abstraction that the Haskell runtime still maintains, but it’s remarkable how close we can get to C-like performance. While our Haskell program is fast, we can improve certain things: Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. And that’s a wrap! We successfully built a bytecode compiler and virtual machine in Haskell. We covered parsing, AST interpretation, compilation, and bytecode execution, as well as, debugging and testing functionalities. Let’s update our checklist: The journey from a simple AST interpreter to a bytecode VM has been a rewarding one. We saw a significant performance improvement, learned about how compilers and VM s work, and how to write performant code in Haskell. While our Haskell implementation isn’t as fast as the hand-written C version, it’s far more concise and, I would argue, easier to reason about. It’s a great demonstration of Haskell’s power for writing high-performance—yet safe and elegant—code. See the full code at: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎ If you liked this post, please leave a comment . Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine (VM). Unit testing and property-based testing for our VM . Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. The Compiler The Virtual Machine (you are here) Introduction Testing the Compiler The Virtual Machine Testing the VM Benchmarking the VM Benchmarking Against C Future Directions generates number expressions by using QuickCheck’s built-in function. generates variable expressions by choosing from the set of passed valid variable names. generates valid identifiers from combinations of letters a—z and A—Z, and discarding ones that are reserved keywords. generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. Parsing and decompiling are slow : At ~573ms and ~506ms, these are by far the slowest passes. This isn’t surprising. Parsing with parser combinators has a known trade-off of expressiveness for performance. Decompiling is a shift-reduce parser that reconstructs an AST from a linear stream of opcodes, and we didn’t spend any time optimizing it. Compilation is fast : At ~51ms, compilation is an order of magnitude faster than parsing. This is thanks to pre-calculating the bytecode size during the parsing phase, which allows us to pre-allocate a single and fill it in with low-level pointer operations. Bytecode interpretation is blazingly fast : At just ~16ms, our VM ’s interpreter is over 3 times faster than the AST interpreter (~50ms), which proves our belief that bytecode interpreters are faster. End-to-end runs : Interestingly, the total time to run via bytecode (~638ms) is slightly slower than the run via AST (~617ms). This is because the cost of parsing, compiling, and then interpreting is higher than just parsing and interpreting. The real win for a bytecode VM comes when you compile once and run many times , amortizing the initial compilation cost. Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine. Unit testing and property-based testing for our VM. Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. ArithVMLib.hs ArithVMSpec.hs ArithVMBench.hs ArithVMApp.hs Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎

0 views

Arrows to Arrows, Categories to Queries

I’ve had a little time off of work as of late, and been spending it in characteristically unwise ways. In particular, I’ve written a little programming language that compiles to SQL . I call it catlang . That’s not to say that I’ve written a new query language. It’s a programming language, whose compiler spits out one giant statement. When you run that query in postgres, you get the output of your program. Why have I done this? Because I needed a funny compilation target to test out the actual features of the language, which is that its intermediary language is a bunch of abstract category theory nonsense. Which I’ll get to. But I’m sure you first want to see this bad boy in action. Behold, the function that returns 100 regardless of what input you give it. But it does it with the equivalent of a while loop: If you’re familiar with arrow notation , you’ll notice the above looks kinda like one big block. This is not a coincidence (because nothing is a coincidence). I figured if I were to go through all of this work, we might as well get a working arrow desugarer out of the mix. But I digress; that’s a story for another time. Anyway, what’s going on here is we have an arrow , which takes a single argument . We then loop, starting from the value of . Inside the loop, we now have a new variable , which we do some voodoo on to compute —the current value of the loop variable. Then we subtract 100 from , and take the absolute value. The function here is a bit odd; it returns if the input was negative, and otherwise. Then we branch on the output of , where and have been renamed and respectively. If was less than zero, we find ourselves in the case, where we add 1 to and wrap the whole thing in —which the loop interprets as “loop again with this new value.” Otherwise, was non-negative, and so we can return directly. Is it roundabout? You bet! The obtuseness here is not directly a feature, I was just looking for conceptually simple things I could do which would be easy to desugar into category-theoretical stuff. Which brings us to the intermediary language. After desugaring the source syntax for above, we’re left with this IL representation: We’ll discuss all of this momentarily, but for now, just let your eyes glaze over the pretty unicode. The underlying idea here is that each of these remaining symbols has very simple and specific algebraic semantics. For example, means “do and pipe the result into .” By giving a transformation from this categorical IL into other domains, it becomes trivial to compile catlang to all sorts of weird compilation targets. Like SQL. You’re probably wondering what the generated SQL looks like. Take a peek if you dare. It’s not pretty, rather amazingly, running the above query in postgres 17 will in fact return a single row with a single column whose value is 100. And you’d better believe it does it by actually looping its way up to 100. If you don’t believe me, make the following change: which will instead return a row for each step of the iteration. There are some obvious optimizations I could make to the generated SQL, but it didn’t seem worth my time, since that’s not the interesting part of the project. Let’s take some time to discuss the underlying category theory here. I am by no means an expert, but what I have learned after a decade of bashing my head against this stuff is that a little goes a long way. For our intents and purposes, we have types, and arrows (functions) between types. We always have the identity “do nothing arrow” : and we can compose arrows by lining up one end to another: 1 Unlike Haskell (or really any programming language, for that matter), we DO NOT have the notion of function application. That is, there is no arrow: You can only compose arrows, you can’t apply them. That’s why we call these things “arrows” rather than “functions.” There are a bundle of arrows for working with product types. The two projection functions correspond to and , taking individual components out of pairs: How do we get things into pairs in the first place? We can use the “fork” operation, which takes two arrows computing and , and generates a new arrow which generates a pair of : If you’re coming from a Haskell background, it’s tempting to think of this operation merely as the pair constructor. But you’ll notice from the type of the computation that there can be no data dependency between and , thus we are free to parallelize each side of the pair. In category theory, the distinction between left and right sides of an arrow is rather arbitrary. This gives rise to a notion called duality where we can flip the arrows around, and get cool new behavior. If we dualize all of our product machinery, we get the coproduct machinery, where a coproduct of and is “either or , but definitely not both nor neither.” Swapping the arrow direction of and , and replacing with gives us the following injections: and the following “join” operation for eliminating coproducts: Again, coming from Haskell this is just the standard function. It corresponds to a branch between one of two cases. As you can see, with just these eight operations, we already have a tremendous amount of expressivity. We can express data dependencies via and branching via . With we automatically encode opportunities for parallelism, and gain the ability to build complicated data structures, with and allowing us to get the information back out of the data structures. You’ll notice in the IL that there are no variable names anywhere to be found. The desugaring of the source language builds a stack (via the pattern), and replaces subsequent variable lookups with a series of projections on the stack to find the value again. On one hand, this makes the categorical IL rather hard to read, but it makes it very easy to re-target! Many domains do have a notion of grouping, but don’t have a native notion of naming. For example, in an electronic circuit, I can have a ribbon of 32 wires which represents an . If I have another ribbon of 32 wires, I can trivially route both wires into a 64-wire ribbon corresponding to a pair of . By eliminating names before we get to the IL, it means no compiler backend ever needs to deal with names. They can just work on a stack representation, and are free to special-case optimize series of projections if they are able to. Of particular interest to this discussion is how we desugar loops in catlang. The underlying primitive is : which magically turns an arrow on s into an arrow without the eithers. We obviously must run that arrow on eithers. If that function returns , then we’re happy and we can just output that. But if the function returns , we have no choice but to pass it back in to the eithered arrow. In Haskell, cochoice is implemented as: which as you can see, will loop until finally returns a . What’s neat about this formulation of a loop is that we can statically differentiate between our first and subsequent passes through the loop body. The first time through is , while for all other times it is . We don’t take advantage of it in the original program, but how many times have you written loop code that needs to initialize something its first time through? So that’s the underlying theory behind the IL. How can we compile this to SQL now? As alluded to before, we simply need to give SQL implementations for each of the operations in the intermediary language. As a simple example, compiles to , where is the input of the arrow. The hardest part here was working out a data representation. It seems obvious to encode each element of a product as a new column, but what do we do about coproducts? After much work thought, I decided to flatten out the coproducts. So, for example, the type: would be represented as three columns: with the constraint that exactly one of or would be at any given point in time. With this hammered out, almost everything else is pretty trivial. Composition corresponds to a nested query. Forks are s which concatenate the columns of each sub-query. Joins are s, where we add a clause to enforce we’re looking at the correct coproduct constructor. Cochoice is the only really tricky thing, but it corresponds to a recursive CTE . Generating a recursive CTE table for the computation isn’t too hard, but getting the final value out of it was surprisingly tricky. The semantics of SQL tables is that they are multisets and come with an arbitrary greatest element. Which is to say, you need an column structured in a relevant way in order to query the final result. Due to some quirks in what postgres accepts, and in how I structured my queries, it was prohibitively hard to insert a “how many times have I looped” column and order by that. So instead I cheated and added a column which looks at the processor clock and ordered by that. This is clearly a hack, and presumably will cause problems if I ever add some primitives which generate more than one row, but again, this is just for fun and who cares. Send me a pull request if you’re offended by my chicanery! I’ve run out of vacation time to work on this project, so I’m probably not going to get around to the meta-circular stupidity I was planning. The compiler still needs a few string-crunching primitives (which are easy to add), but then it would be simple to write a little brainfuck interpreter in catlang. Which I could then compile to SQL. Now we’ve got a brainfuck interpreter running in postgres. Of course, this has been done by hand before, but to my knowledge, never via compilation. There exist C to brainfuck compilers. And postgres is written in C. So in a move that would make Xzibit proud, we could run postgres in postgres. And of course, it would be fun to run brainfuck in brainfuck. That’d be a cool catlang backend if someone wanted to contribute such a thing. I am not the first person to do anything like this. The source language of catlang is heavily inspired by Haskell’s arrow syntax , which in turn is essentially a desugaring algorithm for Arrows . Arrows are slightly the wrong abstraction because they require an operation —which requires you to be able to embed Haskell functions in your category, something which is almost never possible. Unfortunately, arrow syntax in Haskell desugars down to for almost everything it does, which in turn makes arrow notation effectively useless. In an ideal world, everything I described in this blog post would be a tiny little Haskell library, with arrow notation doing the heavy lifting. But that is just not the world we live in. Nor am I the first person to notice that there are categorical semantics behind programming languages. I don’t actually know whom to cite on this one, but it is well-established folklore that the lambda calculus corresponds to cartesian-closed categories . The “closed” part of “cartesian-closed” means we have an operation , but everyone and their dog has implemented the lambda calculus, so I thought it would be fun to see how far we can get without it. This is not a limitation on catlang’s turing completeness (since gives us everything we need.) I’ve been thinking about writing a category-first programming language for the better part of a decade, ever since I read Compiling to Categories . That paper takes Haskell and desugars it back down to categories. I stole many of the tricks here from that paper. Anyway. All of the code is available on github if you’re interested in taking a look. The repo isn’t up to my usual coding standards, for which you have my apologies. Of note is the template-haskell backend which can spit out Haskell code; meaning it wouldn’t be very hard to make a quasiquoter to compile catlang into what Haskell’s arrow desugaring ought to be. If there’s enough clamor for such a thing, I’ll see about turning this part into a library. When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎ When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎

1 views
Ahmad Alfy 1 months ago

How Functional Programming Shaped (and Twisted) Frontend Development

A friend called me last week. Someone who’d built web applications back for a long time before moving exclusively to backend and infra work. He’d just opened a modern React codebase for the first time in over a decade. “What the hell is this?” he asked. “What are all these generated class names? Did we just… cancel the cascade? Who made the web work this way?” I laughed, but his confusion cut deeper than he realized. He remembered a web where CSS cascaded naturally, where the DOM was something you worked with , where the browser handled routing, forms, and events without twenty abstractions in between. To him, our modern frontend stack looked like we’d declared war on the platform itself. He asked me to explain how we got here. That conversation became this essay. A disclaimer before we begin : This is one perspective, shaped by having lived through the first browser war. I applied to make 24-bit PNGs work in IE6. I debugged hasLayout bugs at 2 AM. I wrote JavaScript when you couldn’t trust to work the same way across browsers. I watched jQuery become necessary, then indispensable, then legacy. I might be wrong about some of this. My perspective is biased for sure, but it also comes with the memory that the web didn’t need constant reinvention to be useful. There’s a strange irony at the heart of modern web development. The web was born from documents, hyperlinks, and a cascading stylesheet language. It was always messy, mutable, and gloriously side-effectful. Yet over the past decade, our most influential frontend tools have been shaped by engineers chasing functional programming purity: immutability, determinism, and the elimination of side effects. This pursuit gave us powerful abstractions. React taught us to think in components. Redux made state changes traceable. TypeScript brought compile-time safety to a dynamic language. But it also led us down a strange path. A one where we fought against the platform instead of embracing it. We rebuilt the browser’s native capabilities in JavaScript, added layers of indirection to “protect” ourselves from the DOM, and convinced ourselves that the web’s inherent messiness was a problem to solve rather than a feature to understand. The question isn’t whether functional programming principles have value. They do. The question is whether applying them dogmatically to the web (a platform designed around mutability, global scope, and user-driven chaos) made our work better, or just more complex. To understand why functional programming ideals clash with web development, we need to acknowledge what the web actually is. The web is fundamentally side-effectful. CSS cascades globally by design. Styles defined in one place affect elements everywhere, creating emergent patterns through specificity and inheritance. The DOM is a giant mutable tree that browsers optimize obsessively; changing it directly is fast and predictable. User interactions arrive asynchronously and unpredictably: clicks, scrolls, form submissions, network requests, resize events. There’s no pure function that captures “user intent.” This messiness is not accidental. It’s how the web scales across billions of devices, remains backwards-compatible across decades, and allows disparate systems to interoperate. The browser is an open platform with escape hatches everywhere. You can style anything, hook into any event, manipulate any node. That flexibility and that refusal to enforce rigid abstractions is the web’s superpower. When we approach the web with functional programming instincts, we see this flexibility as chaos. We see globals as dangerous. We see mutation as unpredictable. We see side effects as bugs waiting to happen. And so we build walls. Functional programming revolves around a few core principles: functions should be pure (same inputs → same outputs, no side effects), data should be immutable, and state changes should be explicit and traceable. These ideas produce code that’s easier to reason about, test, and parallelize, in the right context of course. These principles had been creeping into JavaScript long before React. Underscore.js (2009) brought map, reduce, and filter to the masses. Lodash and Ramda followed with deeper FP toolkits including currying, composition and immutability helpers. The ideas were in the air: avoid mutation, compose small functions, treat data transformations as pipelines. React itself started with class components and , hardly pure FP. But the conceptual foundation was there: treat UI as a function of state, make rendering deterministic, isolate side effects. Then came Elm, a purely functional language created by Evan Czaplicki that codified the “Model-View-Update” architecture. When Dan Abramov created Redux, he explicitly cited Elm as inspiration. Redux’s reducers are directly modeled on Elm’s update functions: . Redux formalized what had been emerging patterns. Combined with React Hooks (which replaced stateful classes with functional composition), the ecosystem shifted decisively toward FP. Immutability became non-negotiable. Pure components became the ideal. Side effects were corralled into . Through this convergence (library patterns, Elm’s rigor, and React’s evolution) Haskell-derived ideas about purity became mainstream JavaScript practice. In the early 2010s, as JavaScript applications grew more complex, developers looked to FP for salvation. jQuery spaghetti had become unmaintainable. Backbone’s two-way binding caused cascading updates (ironically, Backbone’s documentation explicitly advised against two-way binding saying “it doesn’t tend to be terribly useful in your real-world app” yet many developers implemented it through plugins). The community wanted discipline, and FP offered it: treat your UI as a pure function of state. Make data flow in one direction. Eliminate shared mutable state. React’s arrival in 2013 crystallized these ideals. It promised a world where : give it data, get back a component tree, re-render when data changes. No manual DOM manipulation. No implicit side effects. Just pure, predictable transformations. This was seductive. And in many ways, it worked. But it also set us on a path toward rebuilding the web in JavaScript’s image, rather than JavaScript in the web’s image. CSS was designed to be global. Styles cascade, inherit, and compose across boundaries. This enables tiny stylesheets to control huge documents, and lets teams share design systems across applications. But to functional programmers, global scope is dangerous. It creates implicit dependencies and unpredictable outcomes. Enter CSS-in-JS: styled-components, Emotion, JSS. The promise was component isolation. Styles scoped to components, no cascading surprises, no naming collisions. Styles become data , passed through JavaScript, predictably bound to elements. But this came at a cost. CSS-in-JS libraries generate styles at runtime, injecting them into tags as components mount. This adds JavaScript execution to the critical rendering path. Server-side rendering becomes complicated. You need to extract styles during the render, serialize them, and rehydrate them on the client. Debugging involves runtime-generated class names like . And you lose the cascade; the very feature that made CSS powerful in the first place. Worse, you’ve moved a browser-optimized declarative language into JavaScript, a single-threaded runtime. The browser can parse and apply CSS in parallel, off the main thread. Your styled-components bundle? That’s main-thread work, blocking interactivity. The web had a solution. It’s called a stylesheet. But it wasn’t pure enough. The industry eventually recognized these problems and pivoted to Tailwind CSS. Instead of runtime CSS generation, use utility classes. Instead of styled-components, compose classes in JSX. This was better, at least it’s compile-time, not runtime. No more blocking the main thread to inject styles. No more hydration complexity. But Tailwind still fights the cascade. Instead of writing once and letting it cascade to all buttons, you write on every single button element. You’ve traded runtime overhead for a different set of problems: class soup in your markup, massive HTML payloads, and losing the cascade’s ability to make sweeping design changes in one place. And here’s where it gets truly revealing: when Tailwind added support for nested selectors using (a feature that would let developers write more cascade-like styles), parts of the community revolted. David Khourshid (creator of XState) shared examples of using nested selectors in Tailwind, and the backlash was immediate. Developers argued this defeated the purpose of Tailwind, that it brought back the “problems” of traditional CSS, that it violated the utility-first philosophy. Think about what this means. The platform has cascade. CSS-in-JS tried to eliminate it and failed. Tailwind tried to work around it with utilities. And when Tailwind cautiously reintroduced a cascade-like feature, developers who were trained by years of anti-cascade ideology rejected it. We’ve spent so long teaching people that the cascade is dangerous that even when their own tools try to reintroduce platform capabilities, they don’t want them. We’re not just ignorant of the platform anymore. We’re ideologically opposed to it. React introduced synthetic events to normalize browser inconsistencies and integrate events into its rendering lifecycle. Instead of attaching listeners directly to DOM nodes, React uses event delegation. It listens at the root, then routes events to handlers through its own system. This feels elegant from a functional perspective. Events become data flowing through your component tree. You don’t touch the DOM directly. Everything stays inside React’s controlled universe. But native browser events already work. They bubble, they capture, they’re well-specified. The browser has spent decades optimizing event dispatch. By wrapping them in a synthetic layer, React adds indirection: memory overhead for event objects, translation logic for every interaction, and debugging friction when something behaves differently than the native API. Worse, it trains developers to avoid the platform. Developers learn React’s event system, not the web’s. When they need to work with third-party libraries or custom elements, they hit impedance mismatches. becomes a foreign API in their own codebase. Again: the web had this. The browser’s event system is fast, flexible, and well-understood. But it wasn’t controlled enough for the FP ideal of a closed system. The logical extreme of “UI as a pure function of state” is client-side rendering: the server sends an empty HTML shell, JavaScript boots up, and the app renders entirely in the browser. From a functional perspective, this is clean. Your app is a deterministic function that takes initial state and produces a DOM tree. From a web perspective, it’s a disaster. The browser sits idle while JavaScript parses, executes, and manually constructs the DOM. Users see blank screens. Screen readers get empty documents. Search engines see nothing. Progressive rendering which is one of the browser’s most powerful features, goes unused. The industry noticed. Server-side rendering came back. But because the mental model was still “JavaScript owns the DOM,” we got hydration : the server renders HTML, the client renders the same tree in JavaScript, then React walks both and attaches event handlers. During hydration, the page is visible but inert. Clicks do nothing, forms don’t submit. This is architecturally absurd. The browser already rendered the page. It already knows how to handle clicks. But because the framework wants to own all interactions through its synthetic event system, it must re-create the entire component tree in JavaScript before anything works. The absurdity extends beyond the client. Infrastructure teams watch in confusion as every user makes double the number of requests : the server renders the page and fetches data, then the client boots up and fetches the exact same data again to reconstruct the component tree for hydration. Why? Because the framework can’t trust the HTML it just generated. It needs to rebuild its internal representation of the UI in JavaScript to attach event handlers and manage state. This isn’t just wasteful, it’s expensive. Database queries run twice. API calls run twice. Cache layers get hit twice. CDN costs double. And for what? So the framework can maintain its pure functional model where all state lives in JavaScript. The browser had the data. The HTML had the data. But that data wasn’t in the right shape . It wasn’t a JavaScript object tree, so we throw it away and fetch it again. Hydration is what happens when you treat the web like a blank canvas instead of a platform with capabilities. The web gave us streaming HTML, progressive enhancement, and instant interactivity. We replaced it with JSON, JavaScript bundles, duplicate network requests, and “please wait while we reconstruct reality.” Consider the humble modal dialog. The web has , a native element with built-in functionality: it manages focus trapping, handles Escape key dismissal, provides a backdrop, controls scroll-locking on the body, and integrates with the accessibility tree. It exists in the DOM but remains hidden until opened. No JavaScript mounting required. It’s fast, accessible, and battle-tested by browser vendors. Now observe what gets taught in tutorials, bootcamps, and popular React courses: build a modal with elements. Conditionally render it when is true. Manually attach a click-outside handler. Write an effect to listen for the Escape key. Add another effect for focus trapping. Implement your own scroll-lock logic. Remember to add ARIA attributes. Oh, and make sure to clean up those event listeners, or you’ll have memory leaks. You’ve just written 100+ lines of JavaScript to poorly recreate what the browser gives you for free. Worse, you’ve trained developers to not even look for native solutions. The platform becomes invisible. When someone asks “how do I build a modal?”, the answer is “install a library” or “here’s my custom hook,” never “use .” The teaching is the problem. When influential tutorial authors and bootcamp curricula skip native APIs in favor of React patterns, they’re not just showing an alternative approach. They’re actively teaching malpractice. A generation of developers learns to build inaccessible soup because that’s what fits the framework’s reactivity model, never knowing the platform already solved these problems. And it’s not just bootcamps. Even the most popular component libraries make the same choice: shadcn/ui builds its Dialog component on Radix UI primitives, which use instead of the native element. There are open GitHub issues requesting native support, but the implicit message is clear: it’s easier to reimplement the browser than to work with it. The problem runs deeper than ignorance or inertia. The frameworks themselves increasingly struggle to work with the platform’s evolution. Not because the platform features are bad, but because the framework’s architectural assumptions can’t accommodate them. Consider why component libraries like Radix UI choose over . The native element manages its own state: it knows when it’s open, it handles its own visibility, it controls focus internally. But React’s reactivity model expects all state to live in JavaScript, flowing unidirectionally into the DOM. When a native element manages its own state, React’s mental model breaks down. Keeping in your React state synchronized with the element’s actual open/closed state becomes a nightmare of refs, effects, and imperative calls. Precisely what React was supposed to eliminate. Rather than adapt their patterns to work with stateful native elements, library authors reimplement the entire behavior in a way that fits the framework. It’s architecturally easier to build a fake dialog in JavaScript than to integrate with the platform’s real one. But the conflict extends beyond architectural preferences. Even when the platform adds features that developers desperately want, frameworks can’t always use them. Accordions? The web has and . Tooltips? There’s attribute and the emerging API. Date pickers? . Custom dropdowns? The web now supports styling elements with and pseudo-elements. You can even put elements with images inside elements now. It eliminates the need for the countless JavaScript select libraries that exist solely because designers wanted custom styling. Frameworks encourage conditional rendering and component state, so these elements don’t get rendered until JavaScript decides they should exist. The mental model is “UI appears when state changes,” not “UI exists, state controls visibility.” Even when the platform adds the exact features developers have been rebuilding in JavaScript for years, the ecosystem momentum means most developers never learn these features exist. And here’s the truly absurd part: even when developers do know about these new platform features, the frameworks themselves can’t handle them. MDN’s documentation for customizable elements includes this warning: “ Some JavaScript frameworks block these features; in others, they cause hydration failures when Server-Side Rendering (SSR) is enabled. ” The platform evolved. The HTML parser now allows richer content inside elements. But React’s JSX parser and hydration system weren’t designed for this. They expect to only contain text. Updating the framework to accommodate the platform’s evolution takes time, coordination, and breaking changes that teams are reluctant to make. The web platform added features that eliminate entire categories of JavaScript libraries, but the dominant frameworks can’t use those features without causing hydration errors. The stack that was supposed to make development easier now lags behind the platform it’s built on. The browser has native routing: tags, the History API, forward/back buttons. It has native forms: elements, validation attributes, submit events. These work without JavaScript. They’re accessible by default. They’re fast. Modern frameworks threw them out. React Router, Next.js’s router, Vue Router; they intercept link clicks, prevent browser navigation, and handle routing in JavaScript. Why? Because client-side routing feels like a pure state transition: URL changes, state updates, component re-renders. No page reload. No “lost” JavaScript state. But you’ve now made navigation depend on JavaScript. Ctrl+click to open in a new tab? Broken, unless you carefully re-implement it. Right-click to copy link? The URL might not match what’s rendered. Accessibility tools that rely on standard navigation patterns? Confused. Forms got the same treatment. Instead of letting the browser handle submission, validation, and accessibility, frameworks encourage JavaScript-controlled forms. Formik, React Hook Form, uncontrolled vs. controlled inputs; entire libraries exist to manage what already does. The browser can validate instantly, with no JavaScript. But that’s not reactive enough, so we rebuild validation in JavaScript, ship it to the client, and hope we got the logic right. The web had these primitives. We rejected them because they didn’t fit our FP-inspired mental model of “state flows through JavaScript.” Progressive enhancement used to be a best practice: start with working HTML, layer on CSS for style, add JavaScript for interactivity. The page works at every level. Now, we start with JavaScript and work backwards, trying to squeeze HTML out of our component trees and hoping hydration doesn’t break. We lost built-in accessibility. Native HTML elements have roles, labels, and keyboard support by default. Custom JavaScript widgets require attributes, focus management, and keyboard handlers. All easy to forget or misconfigure. We lost performance. The browser’s streaming parser can render HTML as it arrives. Modern frameworks send JavaScript, parse JavaScript, execute JavaScript, then finally render. That’s slower. The browser can cache CSS and HTML aggressively. JavaScript bundles invalidate on every deploy. We lost simplicity. is eight characters. A client-side router is a dependency, a config file, and a mental model. is self-documenting. A controlled form with validation is dozens of lines of state management. And we lost alignment with the platform. The browser vendors spend millions optimizing HTML parsing, CSS rendering, and event dispatch. We spend thousands of developer-hours rebuilding those features in JavaScript, slower. This isn’t a story of incompetence. Smart people built these tools for real reasons. By the early 2010s, JavaScript applications had become unmaintainable. jQuery spaghetti sprawled across codebases. Two-way data binding caused cascading updates that were impossible to debug. Teams needed discipline, and functional programming offered it: pure components, immutable state, unidirectional data flow. For complex, stateful applications (like dashboards with hundreds of interactive components, real-time collaboration tools, data visualization platforms) React’s model was genuinely better than manually wiring up event handlers and tracking mutations. The FP purists weren’t wrong that unpredictable mutation causes bugs. They were wrong that the solution was avoiding the platform’s mutation-friendly APIs instead of learning to use them well. But in the chaos of 2013, that distinction didn’t matter. React worked. It scaled. And Facebook was using it in production. Then came the hype cycle. React dominated the conversation. Every conference had React talks. Every tutorial assumed React as the starting point. CSS-in-JS became “modern.” Client-side rendering became the default. When big companies like Facebook, Airbnb, Netflix and others adopted these patterns, they became industry standards. Bootcamps taught React exclusively. Job postings required React experience. The narrative solidified: this is how you build for the web now. The ecosystem became self-reinforcing through its own momentum. Once React dominated hiring pipelines and Stack Overflow answers, alternatives faced an uphill battle. Teams that had already invested in React by training developers, building component libraries, establishing patterns are now facing enormous switching costs. New developers learned React because that’s what jobs required. Jobs required React because that’s what developers knew. The cycle fed itself, independent of whether React was the best tool for any particular job. This is where we lost the plot. Somewhere in the transition from “React solves complex application problems” to “React is how you build websites,” we stopped asking whether the problems we were solving actually needed these solutions. I’ve watched developers build personal blogs with Next.js. Sites that are 95% static content with maybe a contact form, because that’s what they learned in bootcamp. I’ve seen companies choose React for marketing sites with zero interactivity, not because it’s appropriate, but because they can’t hire developers who know anything else. The tool designed for complex, stateful applications became the default for everything, including problems the web solved in 1995 with HTML and CSS. A generation of developers never learned that most websites don’t need a framework at all. The question stopped being “does this problem need React?” and became “which React pattern should I use?” The platform’s native capabilities like progressive rendering, semantic HTML, the cascade, instant navigation are now considered “old-fashioned.” Reinventing them in JavaScript became “best practices.” We chased functional purity on a platform that was never designed for it. And we built complexity to paper over the mismatch. The good news: we’re learning. The industry is rediscovering the platform. HTMX embraces HTML as the medium of exchange. Server sends HTML, browser renders it, no hydration needed. Qwik resumable architecture avoids hydration entirely, serializing only what’s needed. Astro defaults to server-rendered HTML with minimal JavaScript. Remix and SvelteKit lean into web standards: forms that work without JS, progressive enhancement, leveraging the browser’s cache. These tools acknowledge what the web is: a document-based platform with powerful native capabilities. Instead of fighting it, they work with it. This doesn’t mean abandoning components or reactivity. It means recognizing that is a useful model inside your framework, not a justification to rebuild the entire browser stack. It means using CSS for styling, native events for interactions, and HTML for structure and then reaching for JavaScript when you need interactivity beyond what the platform provides. The best frameworks of the next decade will be the ones that feel like the web, not in spite of it. In chasing functional purity, we built a frontend stack that is more complex, more fragile, and less aligned with the platform it runs on. We recreated CSS in JavaScript, events in synthetic wrappers, rendering in hydration layers, and routing in client-side state machines. We did this because we wanted predictability, control, and clean abstractions. But the web was never meant to be pure. It’s a sprawling, messy, miraculous platform built on decades of emergent behavior, pragmatic compromises, and radical openness. Its mutability isn’t a bug. It’s the reason a document written in 1995 still renders in 2025. Its global scope isn’t dangerous. It’s what lets billions of pages share a design language. Maybe the web didn’t need to be purified. Maybe it just needed to be understood. I want to thank my friend Ihab Khattab for reviewing this piece and providing invaluable feedback.

0 views
Neil Madden 2 months ago

Rating 26 years of Java changes

I first started programming Java at IBM back in 1999 as a Pre-University Employee. If I remember correctly, we had Java 1.1.8 installed at that time, but were moving to Java 1.2 (“Java 2”), which was a massive release—I remember engineers at the time grumbling that the ever-present “ Java in a Nutshell ” book had grown to over 600 pages. I thought I’d take a look back at 26 years of Java releases and rate some of the language and core library changes (Java SE only) that have occurred over this time. It’s a very different language to what I started out with! I can’t possibly cover every feature of those releases , as there are just way too many. So I’m just going to cherry-pick some that seemed significant at the time, or have been in retrospect. I’m not going to cover UI- or graphics-related stuff (Swing, Java2D etc), or VM/GC improvements. Just language changes and core libraries. And obviously this is highly subjective. Feel free to put your own opinions in the comments! The descriptions are brief and not intended as an introduction to the features in question: see the links from the Wikipedia page for more background. NB: later features are listed from when they were first introduced as a preview. The Collections Framework : before the collections framework, there was just raw arrays, Vector, and Hashtable. It gets the job done, but I don’t think anyone thinks the Java collections framework is particularly well designed. One of the biggest issues was a failure to distinguish between mutable and immutable collections, strange inconsistencies like why Iterator as a remove() method (but not, say, update or insert), and so on. Various improvements have been made over the years, and I do still use it in preference to pulling in a better alternative library, so it has shown the test of time in that respect. 4/10 The keyword: I remember being somewhat outraged at the time that they could introduce a new keyword! I’m personally quite fond of asserts as an easy way to check invariants without having to do complex refactoring to make things unit-testable, but that is not a popular approach. I can’t remember the last time I saw an assert in any production Java code. 3/10 Regular expressions: Did I really have to wait 3 years to use regex in Java? I don’t remember ever having any issues with the implementation they finally went for. The Matcher class is perhaps a little clunky, but gets the job done. Good, solid, essential functionality. 9/10 “New” I/O (NIO): Provided non-blocking I/O for the first time, but really just a horrible API (still inexplicably using 32-bit signed integers for file sizes, limiting files to 2GB, confusing interface). I still basically never use these interfaces except when I really need to. I learnt Tcl/Tk at the same time that I learnt Java, and Java’s I/O always just seemed extraordinarily baroque for no good reason. Has barely improved in 2 and a half decades. 0/10 Also notable in this release was the new crypto APIs : the Java Cryptography Extensions (JCE) added encryption and MAC support to the existing signatures and hashes, and we got JSSE for SSL. Useful functionality, dr eadful error-prone APIs . 1/10 Absolutely loads of changes in this release. This feels like the start of modern Java to me. Generics : as Go discovered on its attempt to speed-run Java’s mistakes all over again, if you don’t add generics from the start then you’ll have to retrofit them later, badly. I wouldn’t want to live without them, and the rapid and universal adoption of them shows what a success they’ve been. They certainly have complicated the language, and there are plenty of rough edges (type erasure, reflection, etc), but God I wouldn’t want to live without them. 8/10 . Annotations: sometimes useful, sometimes overused. I know I’ve been guilty of abusing them in the past. At the time it felt like they were ushering a new age of custom static analysis, but that doesn’t really seem to be used much. Mostly just used to mark things as deprecated or when overriding a method. Meh. 5/10 Autoboxing: there was a time when, if you wanted to store an integer in a collection, you had to manually convert to and from the primitive int type and the Integer “boxed” class. Such conversion code was everywhere. Java 5 got rid of that, by getting the compiler to insert those conversions for you. Brevity, but no less inefficient. 7/10 Enums : I’d learned Haskell by this point, so I couldn’t see the point of introducing enums without going the whole hog and doing algebraic datatypes and pattern-matching. (Especially as Scala launched about this time). Decent feature, and a good implementation, but underwhelming. 6/10 Vararg methods: these have done quite a lot to reduce verbosity across the standard library. A nice small improvement that’s had a good quality of life enhancement. I still never really know when to put @SafeVarargs annotations on things though. 8/10 The for-each loop: cracking, use it all the time. Still not a patch on Tcl’s foreach (which can loop over multiple collections at once), but still very good. Could be improved and has been somewhat replaced by Streams. 8/10 Static imports: Again, a good simple change. I probably would have avoided adding * imports for statics, but it’s quite nice for DSLs. 8/10 Doug Lea’s java.util.concurrent etc : these felt really well designed. So well designed that everyone started using them in preference to the core collection classes, and they ended up back-porting a lot of the methods. 10/10 After the big bang of Java 5, Java 6 was mostly performance and VM improvements, I believe, so we had to wait until 2011 for more new language features. Strings in switch: seems like a code smell to me. Never use this, and never see it used. 1/10 Try-with-resources : made a huge difference in exception safety. Combined with the improvements in exception chaining (so root cause exceptions are not lost), this was a massive win. Still use it everywhere. 10/10 Diamond operator for type parameter inference: a good minor syntactic improvement to cut down the visual noise. 6/10 Binary literals and underscores in literals: again, minor syntactic sugar. Nice to have, rarely something I care about much. 4/10 Path and Filesystem APIs: I tend to use these over the older File APIs, but just because it feels like I should. I couldn’t really tell you if they are better or not. Still overly verbose. Still insanely hard to set file permissions in a cross-platform way. 3/10 Lambdas: somewhat controversial at the time. I was very in favour of them, but only use them sparingly these days, due to ugly stack traces and other drawbacks. Named method references provide most of the benefit without being anonymous. Deciding to exclude checked exceptions from the various standard functional interfaces was understandable, but also regularly a royal PITA. 4/10 Streams: Ah, streams. So much potential, but so frustrating in practice. I was hoping that Java would just do the obvious thing and put filter/map/reduce methods onto Collection and Map, but they went with this instead. The benefits of functional programming weren’t enough to carry the feature, I think, so they had to justify it by promising easy parallel computing. This scope creep enormously over-complicated the feature, makes it hard to debug issues, and yet I almost never see parallel streams being used. What I do still see quite regularly is resource leaks from people not realising that the stream returned from Files.lines() has to be close()d when you’re done—but doing so makes the code a lot uglier. Combine that with ugly hacks around callbacks that throw checked exceptions, the non-discoverable API (where are the static helper functions I need for this method again?), and the large impact on lots of very common code, and I have to say I think this was one of the largest blunders in modern Java. I blogged what I thought was a better approach 2 years earlier, and I still think it would have been better. There was plenty of good research that different approaches were better , since at least Oleg Kiselyov’s work in the early noughties . 1/10 Java Time: Much better than what came before, but I have barely had to use much of this API at all, so I’m not in a position to really judge how good this is. Despite knowing how complex time and dates are, I do have a nagging suspicion that surely it doesn’t all need to be this complex? 8/10 Modules: I still don’t really know what the point of all this was. Enormous upheaval for minimal concrete benefit that I can discern. The general advice seems to be that modules are (should be) an internal detail of the JRE and best ignored in application code (apart from when they spuriously break things). Awful. -10/10 (that’s minus 10!) jshell: cute! A REPL! Use it sometimes. Took them long enough. 6/10 The start of time-based releases, and a distinct ramp-up of features from here on, trying to keep up with the kids. Local type inference (“var”) : Some love this, some hate it. I’m definitely in the former camp. 9/10 New HTTP Client : replaced the old URL.openStream() approach by creating something more like Apache HttpClient. It works for most purposes, but I do find the interface overly verbose. 6/10 This release also added TLS 1.3 support, along with djb-suite crypto algorithms. Yay. 9/10 Switch expressions : another nice mild quality-of-life improvement. Not world changing, but occasionally nice to have. 6/10 Text blocks: on the face of it, what’s not to like about multi-line strings? Well, apparently there’s a good reason that injection attacks remain high on the OWASP Top 10, as the JEP introducing this feature seemed intent on getting everyone writing SQL, HTML and JavaScript using string concatenation again. Nearly gave me a heart attack at the time, and still seems like a pointless feature. Text templates (later) are trying to fix this, but seem to be currently in limbo . 3/10 Pattern matching in : a little bit of syntactic sugar to avoid an explicit cast. But didn’t we all agree that using was a bad idea decades ago? I’m really not sure who was doing the cost/benefit analysis on these kinds of features. 4/10 Records: about bloody time! Love ‘em. 10/10 Better error messages for NullPointerExceptions: lovely. 8/10 Sealed classes: in principal I like these a lot. We’re slowly getting towards a weird implementation of algebraic datatypes. I haven’t used them very much yet so far. 8/10 EdDSA signatures: again, a nice little improvement in the built-in cryptography. Came with a rather serious bug though… 8/10 Vector (SIMD) API: this will be great when it is finally done, but still baking several years later. ?/10 Pattern matching switch: another piece of the algebraic datatype puzzle. Seems somehow more acceptable than instanceof, despite being largely the same idea in a better form. 7/10 UTF-8 by default: Fixed a thousand encoding errors in one fell swoop. 10/10 Record patterns: an obvious extension, and I think we’re now pretty much there with ADTs? 9/10 Virtual threads: being someone who never really got on with async/callback/promise/reactive stream-based programming in Java, I was really happy to see this feature. I haven’t really had much reason to use them in anger yet, so I don’t know how well they’ve been done. But I’m hopeful! ?/10 String templates: these are exactly what I asked for in A few programming language features I’d like to see , based on E’s quasi-literal syntax, and they fix the issues I had with text blocks. Unfortunately, the first design had some issues, and so they’ve gone back to the drawing board. Hopefully not for too long. I really wish they’d not released text blocks without this feature. 10/10 (if they ever arrive). Sequenced collections: a simple addition that adds a common super-type to all collections that have a defined “encounter order”: lists, deques, sorted sets, etc. It defines convenient getFirst() and getLast() methods and a way to iterate items in the defined order or in reverse order. This is a nice unification, and plugs what seems like an obvious gap in the collections types, if perhaps not the most pressing issue? 6/10 Wildcards in patterns: adds the familiar syntax from Haskell and Prolog etc of using as a non-capturing wildcard variable in patterns when you don’t care about the value of that part. 6/10 Simplified console applications: Java finally makes simple programs simple for beginners, about a decade after universities stopped teaching Java to beginners… Snark aside, this is a welcome simplification. 8/10 This release also adds support for KEMs , although in the simplest possible form only. Meh. 4/10 The only significant change in this release is the ability to have statements before a call to super() in a constructor. Fine. 5/10 Primitive types in patterns: plugs a gap in pattern matching. 7/10 Markdown javadoc comments: Does anyone really care about this? 1/10 The main feature here from my point of view as a crypto geek is the addition of post-quantum cryptography in the form of the newly standardised ML-KEM and ML-DSA algorithms, and support in TLS. Stable values: this is essentially support for lazily-initialised final variables. Lazy initialisation is often trickier than it should be in Java, so this is a welcome addition. Remembering Alice ML , I wonder if there is some overlap between the proposed StableValue and a Future? 7/10 ? PEM encoding of cryptographic objects is welcome from my point of view, but someone will need to tell me why this is not just ? Decoding support is useful though, as that’s a frequent reason I have to grab Bouncy Castle still. 7/10 Well, that brings us pretty much up to date. What do you think? Agree, disagree? Are you a passionate defender of streams or Java modules? Have at it in the comments.

0 views
Blargh 2 months ago

Ideal programming language

My last post about Go got some attention . In fact, two of my posts got attention that day , which broke my nginx since I was running livecount behind nginx, making me run out of file descriptors when thousands of people had the page opened. It’s a shame that I had to turn off livecount, since it’d be cool to see the stats. But I was out of the country, with unreliable access to both Internet and even electricity in hotels, so I couldn’t implement the real fix until I got back, when it had already mostly died down. I knew this was a problem with livecount, of course, and I even allude to it in its blog post . Anyway, back to programming languages. The reactions to my post can be summarized as: I respect the first two. The last one has to be from people who are too emotionally invested with their tools, and take articles like this like a personal attack of some sort. They go out of their way to be offended, and then start screaming “but I don’t fucking want guitar lessons!” . They want to counter attack against another programming language, thinking I would take it personally too. Maybe this heretic is a Java programmer, and that’s why he’s stupid? ( bad guess ). It also reminded me of PHP programmers back in the PHP 3.x days who would die on the hill of defending PHP as an awesome language, while admitting that they knew literally no other language. What should they know of England who only England know? I’m not offended. Those replies are not offensive; They’re boring. There’s nothing to learn from the comments, and probably not from the people making such comments in general, either. Also it seems that somebody managed to get my whole Blog comment database deleted from disqus. Either disqus itself was hacked, or just my account with them. Or someone tricked disqus into deleting it. They managed to restore it, though. Keeping this third type of commenter in mind, I got an email a few days later asking what programming languages are “closer to ideal”, and if I maybe have a blog post in me about that. I don’t know who’s asking, so I replied the long version, while still taking the question at face value. Ideal… Well, this is getting into the space of “what is the best programming language”, which doesn’t have a perfect answer. To do what? To make an Android app (something I’m not an expert in. I’ve just made one simple one), I think Kotlin seems nice. But I don’t know it very well. For web development, something I also don’t do much, it’s probably Typescript. For maximum portability for systems programming, C or C++ (depending on if all your target platforms (e.g. embedded stuff) support C++) is probably best. But these are practical answers. Some people like Lisp. Others like Haskell. Rust strikes a good position between practical, low level control, safe, and a high level type system. If the Rust compiler supports your platform, then it’s pretty much as portable as C/C++. I’ve written about the deficiencies of Java and Go because the ways they are deficient are interesting. I don’t find the ways C++ is deficient to be interesting. C++ is what it is. I happen to enjoy coding C++. I also have thoughts about the trap of accidentally writing too much in bash . I have a draft of things wrong with Rust. But so far I think they are all fixable. (e.g. no placement syntax has been defined yet ) But I have no interest in writing a blog post about the lack of memory safety of C++. It is what it is. Java’s deficiencies are interesting because they are the best guesses of the future that 1990 had to offer. And those guesses were almost all wrong. Go’s deficiencies are interesting/frustrating because it (almost entirely) is the best that the 1980s-early 1990s had to offer. And yet it launched in 2009. I don’t enjoy coding Javascript. So I’m experimenting writing frontend stuff in Rust and compile to WASM. So far it works for me, but is not something I’d recommend for anyone who wants to get anything done. But no, I won’t be writing up which programming language is “ideal”, because it’s one of those “it depends”. Oh yes, these things are definite flaws in the language. What you’re saying is true, but it’s not a problem. Your post is pointless. You’re dumb. You don’t understand Go. Here let me explain your own blog post to you […]

0 views
qouteall notes 2 months ago

Higher-Level Design Patterns

Higher-level software design patterns: These patterns and ideas are often deeply connected and used together. It involves two different aspects: The benefit of turning computation (logic and action) into data: The benefit of turning execution state into explicit data: The distinction between computation and execution state is blurry. A closure can capture data. An execution state can be seen as a continuation , which is also a computation. Algebraic effect : An effect handler executes some code in a scope. Some code is executed under an effect handler. When it performs an effect, the control flow jumps to the effect handler, and the execution state ( delimited continuation ) up to the effect handler's scope is also saved. The effect handler can then resume using the execution state. A simple introduction to Algebraic effects Delimited continuation is the execution state turned into data. It's delimited because the execution state only include the stackframes within effect handling scope. The continuation (without "delimited") contains the whole execution state of the whole program (assume program is single-threaded). Delimited continuation is "local". Continuation is "global". The "local" one is more fine-grained and useful. Continuation passing style (CPS) is a way of representing programs. In CPS, each function accepts a continuation. Returning becomes calling the continuation. Calling continuation is to continue execution. The output of continuation is the "final output of whole program" (if IO or mutable state involved, the "final output of whole program" can be empty). See also: Configuration complexity clock When you try to handle many different business requests, one solution is to create a flexible rules engine. Configuring the rules engine can handle all of the requests. However then a tradeoff become salient: DSL are useful when it's high in abstraction level, and new requirements mostly follow the abstration. System calls are expensive . Replacing system calls with data can improve performance: Mutation can be represented as data. Data can be interpreted as mutation. The benefits: About rollback: Sometimes, to improve user experience, we need to replay conflicted changes, instead of rolling back them. It's more complex. In some places, we specify a new state and need to compute the mutation (diff). Examples: Mutate-by-recreate: Keep data immutable. Change it by recreating the whole data. In multi-threading, for read-heavy data, it's often beneficial to make the data structure immutable, but keep one mutable atomic reference to it. Updating recreates the whole data structure and atomically change the reference. This called read-copy-update (RCU) or copy-on-write (COW). In pure functional languages (e.g. Haskell), there is no direct way of mutating things. Mutation can only be simutated by recreating. Bitemporal modelling : Store two pieces of records. One records the data and time updated to database. Another records the data and time that reflect the reality. (Sometimes the reality changes but database doesn't edit immediately. Sometimes database contains wrong informaiton that's corrected later.) Conflict-free replicated data type (CRDT) : The mutations can be combined, and the result doesn't depend on order of combining. It allows distributed system get eventual consistency without immediate communication. In CRDT, the operator of combining mutation ∗ * ∗ : Examples of CRDT: For example, in multiplayer game, there is a door. The door's state can be open or close (a boolean). Each operation is a tuple . Combination is max-by-timestamp (for two operations, pick the higher-timestamp ones). Consdering that multiple players can do operation in exactly the same timestamp, so we add player ID as tie-breaker . The operation now become . Combination max-by the tuple of . If equals, larger wins. Note that typical multiplayer game implementation doesn't use CRDT. The server holds source-of-truth game state. Clients send actions to servers. The server validates actions, change game state and broadcast to all clients. Drawing solid triangles to framebuffer can also be seen as CRDT. The whole framebuffer can be seen as an operation. Each pixel in framebuffer has a depth value. Combining two framebuffer takes lowest-depth one for two pixels in the same position. (Two framebuffers may have same depth on same pixel with different color. We can use unique triangle ID as tie-breaker.) Note that actual rasterization in GPU works by having one centralized framebuffer, not using CRDT. In a collaborative text editing system, each character has an ID. It supports two kinds of operations: There is a "root character" in the beginning of document. It's invisible and cannot delete. For two insertions after the same character, the tie-breaker is . Higher timestamp ones appear first. For the same timestamp, higher user id ones appear first. It forms a tree. Each character is a node, containing visibility boolean flag. Each operation is an edge pointing to new character. The document is formed by traversing the tree in depth-first order (edges ordered by tie-breaker) while hiding invisible characters. 3 4 Only compute some parts of the data, and keep the information of remaining computation for future use. Deferred (async) compuation vs immediate compuation: A computation, an optimization, or a safety check can be done in: Most computations that are done at compile time can be done at runtime (with extra performance cost). But if you want to avoid the performance cost by doing it in compile time, it becomes harder. Rust and C++ has Statics-Dynamics Biformity ( see also ): most runtime computation methods cannot be easily used in compile-time. Using compile-time mechanisms often require data to be encoded in types, which then require type gymnastics. The ways that solve (or partially solve) the biformity between compile-time and runtime computation: Related: Symbolic execution , Program search using superposition Views in SQL databases are "fake" tables that represents the result of a query. The view's data is derived from other tables (or views). The generalized concept of view: View takes one information model and present it as another information model, and allow operating as another information model. The generalized view can be understood as: Examples of the generalized view concept: More generally: Dynamically-typed languages also have "types". The "type" here is the mapping between in-memory data and information . Even in dynamic languages, the data still has "shape" at runtime . The program only works with specific "shapes" of data . For example, in Python, if a function accepts an array of string, but you pass it one string, then it treats each character as a string, which is wrong. Mainstream languages often have relatively simpler and less expressive type systems. Some "shape" of data are complex and cannot be easily expressed in mainstram languages' type system (without type erasure). Dynamic languages' benefits: But the statically-typed languages and IDEs are improving. The more expressive type systems reduce friction of typing. Types can help catching mistakes, help understanding code and help IDE functionalities. Type inference and IDE completion saves time of typing types. That's why mainstream dynamic languages (JS, Python) are embracing type annotations. A view can be backed by either storage or computation (or a combination of storage and computation). Modern highly-parallel computation are often bottlenecked by IO and synchronization. Adding new computation hardware units is easy. Making the information to flow efficiently between these hardware units is hard. When memory IO becomes bottleneck, re-computing rather than storing can be beneficial. Most algorithms use the idea of producing invariant, growing invariant and maintaining invariant: About transitive rule: if X and Y both follow invariant, then result of "merging" X and Y also follows invariant. "Transitive" is why the invariant can grow without re-checking the whole data. When the invariant is forced in language level, it can be " contagious ". For example, invariants in business logic: Invariants in data: The timing of maintaining invariant: The responsibility of maintaining invariant: In the second case (application code maintains invariant), to make it less error prone, we can encapsulate the data and the invariant-maintaining code , and ensuring that any usage of encapsulated API won't violate the invariant . If some usages of API can break the invariant and developer can only know it by considering implementation, then it's a leaky abstraction. For example, one common invariant to maintain is consistency between derived data and base data (source-of-truth) . There are many solutions: In real-world legacy code, invariants are often not documented . They are implicit in code. A developer not knowing an invariant can easily break it. Type systems also help maintaining invariant. But a simple type system can only maintain simple invariants. Complex invariants require complex types to maintain. If it becomes too complex, type may be longer than execution code, and type errors become harder to resolve. It's a tradeoff. Concentration and fat-tail distribution (80/20) are common in software world: Many optimizations are based on assuming the high-probability case happens : GoF design patterns Other GoF design patterns briefly explained: It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩ Computation-data duality. Mutation-data duality. Partial computation and multi-stage computation. Generalized View. Invariant production, grow, and maintenance. Turn computation (logic and action) into data. Turn execution state into explicit data. Closure (lambda expression, function value). A function along with captured data. It allows reusing a piece of code along with captured data . It can help abstraction: separate the generation of computation (create function values) and execution of computation (executing function). (It's related to partial computation and multi-stage computation) Composition. The computation that's turned to data can be more easily composed. Functional programming encourages having simple building blocks and compose them into complex logic. Flexibility. The computation that's turned to data can be changed and rebuilt dynamically. Inspection: Explicit execution state is easier to inspect and display (the machine code can be optimized, and it's platform-depenent, so machine code execution position and runtime stack are harder to inspect and manipulate than explicit data) Serialization: Explicit execution state can be serialized and deserialized, thus be stored to database and sent across network. (Example: Restate) Suspension: Explicit execution state allows temporarily suspending execution and resume it later. Suspending thread is harder and less efficient 1 . Modification: Explicit execution state can be modified. It makes cancellation and rollback easier. (Modifying execution stack and execution state is harder, and it's not supported by many mainstream languages.) Forking: Allows forking control flow, which can be useful in some kinds of simulations. If the rules engine is high in abstraction level, doing a lot of predefined things under-the-hood, then: it will be unadaptive when a new requirement clashes with predefined behavior. Simple interface = hardcoded defaults = less customizability . If the rules engine is low in abstraction level, then doing things will require more configuration. It's not more convenient than just coding. It becomes a new DSL. The new DSL is often worse than mainstream languages because: The new DSL often has poor debug support. The new DSL often has no IDE support. No existing libraries ecosystem. Need to reinvent wheel. The new DSL is often less battle-tested and more buggy. io_uring: allows submitting many IO tasks by writing into memory, then use one system call to submit them. 2 Graphics API: Old OpenGL use system calls to change state and dispatch draw call. New Graphics APIs like Vulkan, Metal and WebGPU all use command buffer. Operations are turned to data in command buffer, then one system call to submit many commands. Instead of just doing in-place mutation, we can enqueue a command (or event) to do mutation later. The command is then processed to do actual mutation. (It's also moving computatin between stages) Layered filesystem (in Docker). Mutating or adding file is creating a new layer. The unchanged previous layers can be cached and reused. Event sourcing . Derive latest state from a events (log, mutations). Express the latest state as a view of old state + mutations. The idea is adopted by database WAL, data replication, Lambda architecture, etc. Command Query Responsibility Segregation . The system has two facades: the query facade doesn't allow mutation, and the command facade only accepts commands and don't give data. Easier to inspect, audit and debug mutations, because mutations are explicit data, not implicit execution history. Easier to audit and replay. Can replay mutations and rollback easily. Can replicate (sync) data change without sending full data. Transactional databases allow rolling back a uncommited transaction. (MySQL InnoDB does in-place mutation on disk but writes undo log and redo log. PostgreSQL MVCC write is append-only on disk.) Editing software often need to support undo. It's often implemted by storing previous step's data, while sharing unchanged substructure to optimize. Multiplayer game client that does server-state-prediction (to reduce visible latency) need to rollback when prediction is invalidted by server's message. CPU does branch prediction and speculative execution. If branch prediction fails or there is other failure, it internally rollback. ( Spectre vulnerability and Meltdown vulnerability are caused by rollback not cancelling side effects in cache that can be measured in access speed). Git. Compute diff based on file snapshots. The diff can then be manipulated (e.g. merging, rebasing, cherry-pick). React. Compute diff from virtual data structure and apply to actual DOM. Sync change from virtual data structure to actual DOM. Kubernetes. You configure what pods/volumes/... should exist. Kubernetes observes the diff between reality and configuration, then do actions (e.g. launch new pod, destroy pod) to cover the diff. It must be commutative. a ∗ b = b ∗ a a * b = b * a a ∗ b = b ∗ a . The order of combinding doesn't matter. It must be associative. a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . The order of combining doesn't matter. It must be idempotent: a ∗ a = a a * a = a a ∗ a = a . Duplicating a mutation won't affect result. (Idempotence is not needed if you ensure exactly-once delivery.) Insertion. inserts a new character after the character with id . The is unique globally. Deletion only marks invisible flag of character (keep the tombstone) In lazy evaluation, the unobserved data is not computed. Deferred mutation. Relates to mutation-data duality. Replace immediately executed code with data (expression tree, DSL, etc.) that will be executed (interpreted) later. Relates to computation-data duality. In multi-stage programming, some data are fixed while some data are unknown. The fixed data can be used for optimization. It can be seen as runtime constant value folding. JIT can be seen as treating bytecode as runtime constant and fold them in interpreter code. Replacing a value with a function or expression tree helps handling the currently-unknown data. Using a future (promise) object to represent a pending computation. In Idris, having a hole and inspecting the type of hole can help proving. Immediately free memory is immediate computation. GC is deferred computation. Stream processing is immediate computation. Batch processing is deferred computation. Pytorch's most matrix operations are async. GPU computes in background. The tensor object's content may be yet unknown (and CPU will wait for GPU when you try to read its content). PostgreSQL and SQLite require deferred "vacuum" that rearranges storage space. Mobile GPUs often do tiled rendering . After vertex shader running, the triangles are not immediately rasterized, but dispatched to tiles (one triangle can go to multiple tiles). Each tile then rasterize and run pixel shader separately. It can reduce memory bandwidth requirement and power consumption. Pre-compile stage. (Code generation, IDE linting, etc.) Compile stage. (Compile-time computation, macros, dependent type theorem proving, etc.) Runtime stage. (Runtime check, JIT compilation, etc.) After first run. (Offline profile-guided optimization, etc.) Zig compile-time computation and reflection. See also Dependently-typed languages. (e.g. Idris, Lean) Scala multi-stage programming. See also . It's at runtime, not compile-time. But its purpose is similar to macro and code generation. Dynamically compose code at runtime and then get JITed. Encapsulating information. Hiding you the true underlying information and only expose derived information. "Faking" information. Bits are views of voltage in circuits. 5 Integers are views of bits Characters are views of integers. Strings are views of characters. 6 Other complex data structures are views to binary data. (pointer can be seen as a view to pointed data) A map (dictionary) can be viewed as a function. Lookup acceleration structure (e.g. database index) are also views to underlying data. Cache is view to underlying data/computation. Lazy evaluation provides a view to the computation result. Virtual memory is a view to physical memory. File system is a view to data on disk. The not-on-disk data can also be viewed as files (Unix everything-is-file philosophy). Symbolic link in file systems is a view to another point in file system. Database provides generalized views of in-disk/in-memory data. Linux namespaces, hypervisors, sandboxes, etc. provides view of aspects of the system. Proxy, NAT, firewall, virtualized networking etc. provides manipulated view of network. Transaction isolation in databases provide views of data (e.g. snapshot isolation). Replicated data and redundant data are views to the original data. Multi-tier storage system. From small-fast ones to large-slow ones: register, cache, memory, disk, cloud storage. Previously mentioned computation-data duality and mutation-data duality can be also seen as viewing. Transposing in Pytorch (by default) doesn't change the underlying matrix data. It only changes how the data is viewed. The mapping between binary data and information is view. Information is bits+context . The context is how the bits are mapped between information. Type contains viewing from binary data to information . Abstraction involves viewing different things as the same things . Avoid the shackle of an unexpressive type system. Avoid syntax inconvenience related to type erasure (type erasure in typed languages require inconvenient things like type conversion). Can quickly iterate by changing one part of program, before the changes work with other parts of program (in static typed languages, you need to resolve all compile errors in the code that you don't use now). This is double-edged sword. The broken code that was not tested tend to get missed. Save some time typing types and type definitions. Reference in GC languages. It may be implemented with a pointer, a colored pointer, or a handle (object ID). The pointer may be changed by moving GC. But in-language semantic doesn't change after moving. ID. All kinds of ID, like string id, integer id, UUID, etc. can be seen as a reference to an object. The ID may still exist after referenced object is removed, then ID become "dangling ID". A special kind of ID is path. For example, file path points to a file, URL points to a web resource, permission path points to a permission, etc. They are the "pointers" into a node in a hierarchical (tree-like) structure. Content-addressable ID. Using the hash of object as the ID of object. This is used in Git, Blockchain, IPFS and languages like Unison . Iterator. An Iterator can be seen as a pointer pointing to an element in container. Zipper. A zipper contains two things: 1. a container with a hole 2. element at the position of the hole. Unlike iterator, a zipper contains the information of whole container. It's often used in pure functional languages. Produce invariant. Create invariant at the smallest scale, in the simplest case. Grow invariant. Combine or expand small invariants to make them larger. This often utilizes transitive rule . Do it until the invariant become big enough to finish the task. Maintain invariant. Every mutaiton to a data structure need to maintain its invariant. Merge sort. Create sorted sub-sequence in smallest scale (e.g. two elements). Then merge two sorted sub-sequences into a bigger one, and continue. The invariant of sorted-ness grows up to the whole sequence. Quick sort. Select a pivot. Then partition the sequence into a part that's smaller than pivot and a part that's larger than pivot (and a part that equals pivot). By partitioning, it creates invariant LeftPartElements < Pivot < RightPartElements \text{LeftPartElements} < \text{Pivot} < \text{RightPartElements} LeftPartElements < Pivot < RightPartElements . By recursively creating such invariants until to the smallest scale (individual elements), the whole sequence is sorted. Binary search tree. It creates invariant LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements \text{LeftSubtreeElements} \leq \text{ParentNode} \leq \text{RightSubtreeElements} LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements . When there is only one node, the invariant is produced at the smallest scale. Every insertion then follows that invariant and then grows and maintains that invariant. Dijkstra algorithm. The visited nodes are the nodes whose shortest path from source node are known. By using the nodes that we know shortest path, it "expands" on graph, knowing new node's shortest path from source. The algorithm iteratively add new nodes into the invariant, until it expands to destination node. Dynamic programming. The problem is separated into sub-problems. There is no cycle dependency between sub-problems. One problem's result can be quickly calculated from sub-problem's results (e.g. max, min). Querying hash map can skip data because hash ( a ) ≠ hash ( b ) \text{hash}(a) \neq \text{hash}(b) hash ( a )  = hash ( b ) implies a ≠ b a \neq b a  = b . Querying ordered search tree can skip data because ( a < b ) ∧ ( b < c ) (a < b) \land (b < c) ( a < b ) ∧ ( b < c ) implies a < c a < c a < c . Parallelization often utilize associativity: a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . For example, a ∗ ( b ∗ ( c ∗ d ) ) = ( a ∗ b ) ∗ ( c ∗ d ) a*(b*(c*d))=(a * b) * (c * d) a ∗ ( b ∗ ( c ∗ d )) = ( a ∗ b ) ∗ ( c ∗ d ) , where a ∗ b a*b a ∗ b and c ∗ d c*d c ∗ d don't depend on each other and can be computed in parallel. Examples: sum, product, max, min, max-by, min-by, list concat, set union, function combination, logical and, logical or. (Associativity with identity is monoid.) User name cannot duplicate. Bank account balance should never be negative. No over-spend. Product inventory count should never be negative. No over-sell. One room in hotel cannot be booked two times with time overlap. The product should be shipped after the order is paid. No lost notification or duplicated notification. The user can't view or change information that's out of their permission. User cannot use a functionality if subscription ends. The reduntant data, derived data and acceleration data structure (index, cache) should stay consistent with base data (source-of-truth). The client side data should be consistent with server side data. Memory safety invariants. Pointer should point to valid data. Should not use-after-free. Only free once. etc. Should free unused memory. Thread safety invariants. Many operations involve 3 stages: read-compute-write. It will malfunction when other thread mutates between reading data and writing data. The previous read result is no longer valid. Some operations involve more stages (many reads and writes). If the partially-modified state is exposed, invariant is also violatated. Some non-thread-safe data structure should not be shared between threads. Some non-thread-safe data structure must be accessed under lock. The modification of some action should be cancelled after a subsequent action fails. (Ad-hoc transaction, application-managed rollback) Immediate invariant maintenance Delayed invariant maintenance (tolerant stale data. cache, batched processing) The database/framework/OS/language etc. is responsible of maintaining invariant. For example, database maintains the validity of index and materialized view. If they don't have bugs, the invariant won't be violated. The application code is responsible for maintaining the invariant. This is more error-prone . Make the derived data always compute-on-demand . No longer need to manually maintain invariant. But it may cost performance. Caching of immutable compute result can make it faster, while still maintaining the semantic of compute-on-demand. Store the derived data as mutable state and manually keep consistency with source-of-truth. This is the most error-prone solution. All modifications to base data should "notify" the derived data to update accordingly. Sometimes notify is to call a function. Sometimes notify involves networking. A more complex case: the derived data need to modified in a way that reflect to base data. This violates single source-of-truth . It's even more error-prone. Even more complex: the client side need to reduce visible latency by predicting server side data, and wrong prediction need to be corrected by server side data. It not only violates single source-of-truth but also often require rollback and replay mechanism. Relying on other tools (database/framework/OS/language etc.) to maintain the invariant, as previously mentioned. Most users use few common features of a software. Most complexity (and bugs) come from very few features requirements. Most time of CPU is spent executing few hot code. Most data access targets few hot data (cache require this to be effective). Most branches are biased to one in execution (branch prediction require this to be effective). Most developers use few languages, libraries and frameworks. (Matthew effect of ecosystem) Most code and development efforts are for fixing edge cases. Few code and development efforts are spent on main case handling. 7 Most bugs that users see are caused by few easy-to-trigger bugs. Only a small portion of transisters in hardware are used in most times. Many transistors are rarely used. (Many transistors are for rarely-used instructions. Hardware defects related to them have higher probability to evade the test. See also: Silent Data Corruptions at Scale , Cores that don’t count ) Branch prediction assumes that it will execute the high-probability branch. If it predicts wrongly, speculative execution rolls back. Cache assumes that it will access hot data. If it accesses outside of hot data, cache is not hit. Optimistic concurrency control assumes there will be no concurrency conflict. If there do is a conflict, it rolls back and retries. It requires fewer waiting and communication than pessimistic concurrency control (locking), unless there are many contentions. Computation-data duality: Factory pattern. Turn object creation code into factory object. Prototype pattern. Turn object creation code into copying prototype. Chain of Responsibility pattern. Multiple processor objects process command objects in a pipeline. Command pattern. Turn command (action) into object. Interpreter pattern. Interpret data as computation. Iterator pattern / generator. Turn iteration code into state machine. Strategy pattern. Turn strategy into object. Observer pattern. Turn event handling code into an observer. Mutation-data duality: Command pattern. It also involves computation-data duality. The command can both represent mutation, action and computation. Partial computation and multi-stage computation: Builder pattern. Turn the process of creating object into multiple stages. Each stage builds a part. Chain of Responsibility pattern. The processing is separated into a pipeline. It's also in computation-data duality. Generalized view: Adapter pattern. View one interface as another interface. Bridge pattern. View the underlying different implentations as one interface. Composite pattern. View multiple objects as one object. Decorator pattern. Wrap an object, changing its behavior and view it as the same object. Facade pattern. View multiple complex interfaces as one simple interface. Proxy pattern. Proxy object provides a view into other things. Template method pattern. View different implementations as the same interface methods. Flyweight pattern. Save memory by sharing common data. View shared data as owned data. Invariant production, grow and maintenance: There is no GoF pattern that tightly corresponds to it. Visitor pattern. Can be replaced by pattern matching. State pattern. Make state a polymorphic object. Memento pattern. Backup the state to allow rollback. It exists mainly because in OOP data is tied to behavior. The separate memento is just data, decoupled with behavior. Memento pattern doesn't involve turning mutation into data, so it's not mutation-data duality. Singleton pattern. It's similar to global variable, but can be late-initialized, can be ploymorphic, etc. Mediator pattern. One abstraction to centrally manage other abstractions. It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩

0 views
Abhinav Sarkar 3 months ago

A Fast Bytecode VM for Arithmetic: The Compiler

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this post, we write the compiler for our AST to bytecode, and a decompiler for the bytecode. This post was originally published on abhinavsarkar.net . This is the second post in a series of posts: AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST , the interpreter jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode . Bytecode is a flattened and compact representation of a program, usually manifested as a byte array. Bytecode is essentially an Instruction Set (IS), but custom-made to be executed by a Virtual Machine (VM), instead of a physical machine. Each bytecode instruction is one byte in size (that’s where it gets its name from). A bytecode and its VM are created in synergy so that the execution is as efficient as possible 1 . Compiling source code to bytecode and executing it in a VM also allows the program to be run on all platforms that the VM supports without the developer caring much about portability concerns. The most popular combo of bytecode and VM is probably the Java bytecode and the Java virtual machine . The VM s can be stack-based or register-based . In a stack-based VM , all values created during the execution of a program are stored only in a Stack data-structure residing in the memory. Whereas, in a register-based VM , there is also an additional set of fixed number of registers that are used to store values in preference to the stack 2 . Register-based VM s are usually faster, but stack-based VM s are usually simpler to implement. For our purpose, we choose to implement a stack-based VM . We are going to write a compiler that compiles our expression AST to bytecode. But first, let’s design the bytecode for our stack-based VM . Here is our expression AST as a reminder: Let’s figure out the right bytecode for each case. First, we create Opcodes for each bytecode, which are sort of mnemonics for actual bytecode. Think of them as Assembly is to Machine Code . For a number literal, we need to put it directly in the bytecode so that we can use it later during the execution. We also need an opcode to push it on the stack. Let’s call it with an parameter. Binary operations recursively use for their operands. To evaluate a binary operation, we need its operands to be evaluated before, so we compile them first to bytecode. After that, all we need is an opcode per operator. Let’s call them , , , and for , , , and operators respectively. Variables and expressions are more complex 3 . In the AST interpreter we chucked the variables in a map, but we cannot do that in a VM . There is no environment map in a VM , and all values must reside in the stack. How do we have variables at all then? Let’s think for a bit. Each expression, after being evaluated in the VM , must push exactly one value on the stack: its result. expressions are a trivial case. When a binary operation is evaluated, first its left operand is evaluated. That pushes one value on the stack. Then its right operand is evaluated, and that pushes another value on the stack. Finally, the operation pops the two values from the top of the stack, does its thing, and pushes the resultant value back on the stack—again one value for the entire expression. A expression binds a variable’s value to its name, and then the variable can be referred from the body of the expression. But how can we refer to a variable when the stack contains only values, not names? Let’s imagine that we are in middle of evaluating a large expression, wherein we encounter a expression. First we evaluate its assignment expression, and that pushes a value on the top of the stack. Let’s say that the stack has values at this point. After this we get to evaluate the body expression. At all times when we are doing that, the value from assignment stays at the same point in the stack because evaluating sub-expressions, no matter how complicated, only adds new values to the stack, without popping an existing value from before. Therefore, we can use the stack index of the assignment value ( ) to refer to it from within the body expression. So, we encode as an opcode and an integer index into the stack. We choose to use a to index the stack, limiting us to a stack depth of 256. We encode the variable references with an opcode , which when executed gets the value from the stack at the given index and pushes it on the stack. For a expression, after we compile its assignment and body expressions, we need to make sure that the exactly-one-value invariant holds. Evaluating the assignment and body pushes two values on the stack, but we can have only one! So we overwrite the assignment value with the body value, and pop the stack to remove the body value. We invent a new opcode to do this, called so because its effect is equivalent to swapping the topmost two values on the stack, and then popping the new top value 4 . Putting all the opcodes together, we have the ADT: Notice that we also assigned bytecodes—that is, a unique byte value—to each above, which are just their ordinals. Now we are ready to write the compiler. The compiler takes an expression with the bytecode size, and compiles it to a strict of that size. Recall that in the previous post , we wrote our parser such that the bytecode size for each AST node was calculated while parsing it. This allows us to pre-allocate a bytestring of required size before compiling the AST . We compile to actual bytes here, and don’t use the opcodes. We use the function from the module that allocates enough memory for the provided bytecode size, and gives us a pointer to the allocated memory. We call this pointer for frame pointer . Then we traverse the AST recursively, writing bytes for opcodes and arguments for each case. We use pointer arithmetic and the function to write the bytes. numbers are encoded as two bytes in little endian fashion. In the recursive traversal function , we pass and return the current stack pointer and instruction pointer . We update these correctly for each case 5 . We also take care of checking that the pointers stay in the right bounds, failing which we throw appropriate errors. We also pass an parameter that is similar to the variable names to values environment we use in the AST interpreter, but this one tracks variable names to stack indices at which they reside. We update this information before compiling the body of a expression to capture the stack index of its assignment value. When compiling a expression, we use the map to lookup the variable’s stack index, and encode it in the bytecode. At the end of compilation, we check that the entire bytestring is filled with bytes till the very end, failing which we throw an error. This check is required because otherwise the bytestring may have garbage bytes, and may fail inexplicably during execution. All the errors are thrown in the monad using the function, and are caught after compilation using the function. The final result or error is returned wrapped into . Let’s see it in action: You can verify that the resultant bytes are indeed correct. I assume that it is difficult for you to read raw bytes. We’ll fix this in a minute. Meanwhile, let’s ponder upon some performance characteristics of our compiler. You may be wondering why I chose to write the compiler in this somewhat convoluted way of pre-allocating a bytestring and using pointers. The answer is: performance. I didn’t actually start with pointers. I iterated through many different data and control structures to find the fastest one. The table below shows the compilation times for a benchmark expression file when using different data structures to implement the function: I started with the bread-and-butter data structure of Haskellers, the humble and known to be slow , which was indeed quite slow. Next, I moved on to and thereafter , which are known to be faster at concatenation/consing. Then I abandoned the use of intermediate data structures completely, choosing to use a bytestring to create the bytestring. Finally, I had the epiphany that the bytestring size was known at compile time, and rewrote the function to pre-allocate the bytestring, thereby reaching the fastest solution. I also tried using , which has more-or-less same performance of bytestring, but it is inconvenient to use because there are no functions to do IO with bytearrays. So I’d anyway need to use bytestrings for reading from STDIN or writing to STDOUT, and converting to-and-fro between bytearray and bytestring is a performance killer. Thus, I decided to stick to bytestrings. The pre-allocated bytestring approach is 80 times faster than using lists, and almost 10 times faster than using . For such gain, I’m okay with the complications it brings to the code. Here are the numbers in a chart (smaller is better): The other important data structure used here is the map (or dictionary) in which we add the mappings from identifiers to their stack indices. This data structure needs to be performant because we do a lookup for each variable we encounter while compiling. I benchmarked compilation for some data structures: Strict hashmap turns out to be the fasted one, but interestingly, linked list is a close second. Mutable hashtable is the slowest even though I expected it to be the fastest. Here are the times in a chart (smaller is better): Another choice I had to make was how to write the function. I ended up passing and returning pointers and environment map, and throwing errors in , but a number of solutions are possible. I tried out some of them, and noted the compilation times for the benchmark expression file: I tried putting the pointer in s and state instead of passing them back-and-forth. I tried putting the environment in a config. I tried using for throwing errors instead of using IO errors. Then I tried various combinations of these monad transformers. Finally, I also tried converting the function to be tail-recursive by using Continuation-passing style (CPS), and then defunctionalizing the continuations, as well as, using the monad transformer. All of these approaches resulted in slower code. The times are interesting to compare (smaller is better): There is no reason to use s here because they result in slower and uglier code. Using one monad transformer at a time results in slight slowdowns, which may be worth the improvement in the code. But using more than one of them degrades performance by a lot. Also, there is no improvement caused by CPS conversion, because GHC is smart enough to optimize the non tail-recursive code to be faster then handwritten tail-recursive one that allocates a lot of closures (or objects in case of defunctionalization). Moving on … It is a hassle to read raw bytes in the compiler output. Let’s write a decompiler to aid us in debugging and testing the compiler. First, a disassembler that converts bytes to opcodes: A disassembled program is a sequence of opcodes. We simply go over each byte of the bytecode, and append the right opcode for it to the program, along with any parameters it may have. Note that we do not verify that the disassembled program is correct. Here are the helpers that read instruction bytes and their arguments from a bytestring: Next, we decompile the opcodes to an expression: Decompilation is the opposite of compilation. While compiling there is an implicit stack of expressions that are yet to be compiled. We make that stack explicit here, capturing expressions as they are decompiled from opcodes. For compound expressions, we inspect the stack and use the already decompiled expressions as the operands of the expression being decompiled. This way we build up larger expressions from smaller ones, culminating in the single top-level expression at the end 7 . Finally, we check the stack to make sure that there is only one expression left in it. Note that like the disassembler, we do not verify that the decompiled expression is correct. There is one tricky thing in decompilation: we lose the names of the variables when compiling, and are left with only stack indices. So while decompiling, we generate variable names from their stack indices by indexing a list of unique names. Let’s see it in action: That’s all for compilation and decompilation. Now, we use them together to make sure that everything works. We write some unit tests for the compiler, targeting both success and failure cases: In each test, we parse and compile an expression, and then disassemble the compiled bytes, which we match with expected list of opcodes, or an error message. Let’s put these tests with the parser tests, and run them: Awesome, it works! That’s it for this post. Let’s update our checklist: In the next part, we write a virtual machine that runs our compiled bytecode, and do some benchmarking. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! There are VM s that execute hardware IS s instead of bytecode. Such VM s are also called Emulators because they emulate actual CPU hardware. Some examples are QEMU and video game console emulators . ↩︎ VM s use virtual registers instead of actual CPU registers, which are often represented as a fixed size array of 1, 2, 4 or 8 byte elements. ↩︎ I call them variables here but they do not actually vary. A better name is let bindings. ↩︎ We could have used two separate opcodes here: and . That would result in same final result when evaluating an expression, but we’d have to execute two instructions instead of one for expressions. Using a single instruction speeds up execution, not only because we reduce the number of instructions, but also because we don’t need to do a full swap, only a half swap is enough because we pop the stack anyway after the swap. This also shows how we can improve the performance of our VM s by inventing specific opcodes for particular operations. ↩︎ Notice the use of strict s here, for performance reasons. ↩︎ Used as Association List . ↩︎ The decompiler is a bottom-up shift-reduce parser from the opcodes to the expression tree. ↩︎ If you liked this post, please leave a comment .

0 views
Abhinav Sarkar 4 months ago

A Fast Bytecode VM for Arithmetic: The Parser

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this post, we write the parser for our expression language to an AST , and an AST interpreter. This post was originally published on abhinavsarkar.net . This is the first post in a series of posts: The language that we are going to work with is that of basic arithmetic expressions, with integer values, and addition, subtraction, multiplication and integer division operations. However, our expression language has a small twist: it is possible to introduce a variable using a binding and use the variable in the expressions in the body of 1 . Furthermore, we use the same syntax for as Haskell does. Here are some examples of valid expressions in our language: The only gotcha here is that the body of a expression extends as far as possible while accounting for nested s. It becomes clear when we look at parsed expressions later. The eventual product is a command-line tool that can run different commands. Let’s start with a demo of the tool: We can parse an expression, or compile it to bytecode. We can also disassemble bytecode to opcodes, or decompile it back to an expression. We can interpret an expression either as an AST or as bytecode. We can also run a bytecode file directly. Finally, we have a handy command to generate random expressions for testing/benchmarking purposes 2 . Let’s start. Since this is Haskell, we start with listing many language extensions and imports: We use the extension here that enables a lot of useful GHC extensions by default. We are using the bytestring and attoparsec libraries for parsing, strict , containers and unordered-containers for compilation, deepseq , mtl and primitive for interpreting, and QuickCheck for testing. The first step is to parse an expression into an Abstract Syntax Tree (AST). We represent the AST as Haskell Algebraic Data Type s (ADTs): We add instances for ADT s so that we can pretty-print the parsed AST 3 . Now, we can start parsing. The EBNF grammar for expressions is as follows: The , , , and productions take care of having the right precedence of arithmetic operations. The and productions are trivial. Our language is fairly oblivious of whitespaces; we allow zero-or-more spaces at most places. The expressions grammar is pretty standard, except we require one-or-more spaces after the and keywords to make them unambiguous. We use the parser combinator library attoparsec for creating the parser. attoparsec works directly with bytestrings so we don’t incur the cost of decoding unicode characters 4 5 . We write the parser in a top-down recursive-descent fashion, same as the grammar, starting with the parser: One small complication: our parsers not only return the parsed expressions, but also the number of bytes they occupy when compiled to bytecode. We gather this information while building the AST in parts, and propagate it upward in the tree. We use the bytecode size later in the compilation pass 6 . Both and chain the right higher precedence parsers with the right operators between them 7 using the combinator. uses lookahead to dispatch between one of the primary parsers, which is faster than using backtracking . simply skips the parenthesis, and recursively calls . uses the and parsers from the attoparsec library to parse an optionally signed integer. We restrict the numbers to 2-byte integers (-32768–32767 inclusive) 8 . The helper from attoparsec names parsers so that the error message shown in case of failures point to the right parser. and are straightforward. We restrict identifiers to upper-and-lowercase ASCII alphabetic letters. We also check that our reserved keywords ( and ) are not used as identifiers. Finally, we write the parser for expressions: In we use to parse the variable name, and recursively call to parse the assignment and body expressions, while making sure to correctly parse the spaces. The helper parser is used to parse known string tokens ( , and ), and provide good error messages in case of failures. Talking about error messages … Let’s figure out an error handling strategy. We use an type wrapped in to propagate the errors in our program: The type also captures the in which the error is thrown. is a type alias that represents either an error or a result. Finally, we put all the parsers together to write the function. Our function uses the function from attoparsec to run the over an input. The function deals with intricacies of how attoparsec returns the parsing result. Basically, we inspect the returned result and throw appropriate errors with useful error messages. We use from the typeclass that works with all its instances, which is one of. Finally, we throw away the bytecode size from the result of in the function. The parser is done. But as good programmers, we must make sure that it works correctly. Let’s write some unit tests. We use the hspec library to write unit tests for our program. Each test is written as a spec 9 . We have a bunch of tests for the parser, testing both success and failure cases. Notice how spaces are treated in the expressions. Also notice how the expressions are parsed. We’ll add property-based tests for the parser in the next post. There is not much we can do with the parsed AST s at this point. Let’s write an interpreter to evaluate our AST s. The AST interpreter is a standard and short recursive interpreter with an environment mapping variables to their values: This interpreter serves both as a performance baseline for the bytecode VM we write later, and as a definitional interpreter for testing the VM 10 . We are careful in detecting division-by-zero and arithmetic overflow errors, but we ignore possible integer overflow/underflow errors that may be caused by the arithmetic operations. We write some unit tests for the interpreter following the same pattern as the parser: Now, we can run the parser and interpreter tests to make sure that everything works correctly. Awesome, it works! That’s it for this post. Let’s update our checklist: In the next part , we write a bytecode compiler for our expression AST . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Variables are scoped to the body of the expressions they are introduced in, that is, our language has lexical scoping . Also, variables with same name in inner s shadow the variables in outer s. ↩︎ If you are wondering why do this at all, when we can directly run the expressions while parsing, I think this is a great little project to learn how to write performant bytecode compilers and VM s in Haskell. ↩︎ Bangs ( ) that enforce strictness are placed in the ADT (and also in the later code) at the right positions that provide performance benefits. The right positions were found by profiling the program. A bang placed at a wrong position (for example in front of inside ) may ruin the compiler provided optimizations and make the overall program slower. ↩︎ attoparsec is very fast, but there are faster parsing libraries in Haskell. On the other hand, attoparsec does not provided great error messages. If the user experience were a higher priority, I’d use the megaparsec library. I find attoparsec to have the right balance of performance, developer experience and user experience. Handwritten parsers from scratch could be faster, but they’d be harder to maintain and use. ↩︎ I wrote the first version of the parser using the library that comes with Haskell standard library. I rewrote it to use attoparsec and found that the rewritten parser was more than 10x faster. ↩︎ You don’t need to think about the bytecode size of expressions right now. It’ll become clear when we go over compilation in the next post. ↩︎ Certain functions such as are inlined using the pragma to improve the program performance. The functions to inline were chosen by profiling. ↩︎ Since the numbers need to be encoded into bytes when we compile to bytecode, we need to choose some encoding for them. For simpler code, we choose 2-byte integers. ↩︎ Testing your parsers is crucial because that’s your programming languages’ interface to the users. Also because writing (fast) parsers is difficult and error-prone. Most of the bugs I found in this program were in the parser. ↩︎ Again, notice the carefully placed bangs to enforce strictness. Try to figure out why they are placed at some places and not at others. ↩︎ If you liked this post, please leave a comment .

1 views
(think) 4 months ago

Learning OCaml: Having Fun with the Fun Module

When I started to play with OCaml I was kind of surprised that there was no (identity) function that was available out-of-box (in module, that’s auto-opened). A quick search lead me to the Fun module, which is part of the standard library and is nested under . It was introduced in OCaml 4.08, alongside other modules such as , and . 1 The module provides a few basic combinators for working with functions. Let’s go over them briefly: The identity function: returns its single argument unchanged. Returns a function that always returns the first argument, ignoring its second argument. Composes two functions, applying the second function to the result of the first. Haskell and F# have special syntax for function composition, but that’s not the case in OCaml. (although you can easily map this to some operator if you wish to do so) Also, introduced a bit later than the other functions in the module - namely in OCaml 5.2. Reverses the order of arguments to a two-argument function. Negates a boolean-returning function, returning the opposite boolean value. Useful when you want to provide a pair of inverse predicates (e.g. and ) I believe that those functions are pretty self-explanatory, but still below we’ll go over a few examples of using them: Admittedly the examples are not great, but I hope they managed to convey how to use the various combinators. Those are definitely not the type of functions that you would use every day, but they can be useful in certain situations. Obviously I needed at some point to discover the module in the first place, and all of the functions there can be considered “classic” combinators in functional programming. In practice most often I need and , and infrequently and . Right now I’m struggling to come up with good use-cases for , but I’m sure those exist. Perhaps you’ll share some examples in the comments? How often do you use the various combinators? Which ones do you find most useful? I find myself wondering if such fundamental functions shouldn’t have been part of module directly, but overall I really like the modular standard library approach that OCaml’s team has been working towards in the past several years. 2 The important thing in the end of the day is to know that these functions exist and you can make use of them. Writing this short article will definitely help me to remember this. That’s all I have for you today. Keep hacking! It was part of some broader efforts to slim down and move in the direction of a more modular standard library.  ↩ And obviously you can open the module if you wish to at whatever level you desire.  ↩ The identity function: returns its single argument unchanged. Returns a function that always returns the first argument, ignoring its second argument. Composes two functions, applying the second function to the result of the first. Haskell and F# have special syntax for function composition, but that’s not the case in OCaml. (although you can easily map this to some operator if you wish to do so) Also, introduced a bit later than the other functions in the module - namely in OCaml 5.2. Reverses the order of arguments to a two-argument function. Negates a boolean-returning function, returning the opposite boolean value. Useful when you want to provide a pair of inverse predicates (e.g. and ) It was part of some broader efforts to slim down and move in the direction of a more modular standard library.  ↩ And obviously you can open the module if you wish to at whatever level you desire.  ↩

0 views

Crafting a dependent typechecker, part 1

This series is intended to give an interested layperson some idea of what goes on behind-the-scenes in a dependent typechecker, and how they might implement one of their own. I personally hold the belief that dependent languages (or similar) will eventually be the future of programming, so knowing this information seems worthwhile to me. One may also simply be interested for the sake of it, or may be interested in implementing their own dependent language. This series does assume a small base amount of knowledge around dependent programming languages. There are many tutorials out there far better than I could provide, so I will refer the reader to the wider internet if they are curious. (Return here afterwards, though!). Additionally, the code snippets will be in OCaml, although a Haskell or Rust version may be made available at a later date. The technique we will implement in this series can most succinctly be described as "Bidirectional typechecking, utilizing Normalization by Evaluation". I do not expect the reader to understand that sentence for at least a little while ;). All code snippets will be collated at the end of each page, so scroll all the way down if you're only interested in that. The following are resources I highly recommend for exploring this topic on your own. If you have implemented a typechecker before, you may have come to the realization that deciding when two things are the same is a very large component of typechecking. Whether it be simply checking that two arguments to are both integers, or larger problems with polymorphism, the arguable core of typechecking is deciding equality. In a dependent language, our types can (and often do) contain expressions that must be evaluated. Therefore, in order to explore how we check equality in the presence of evaluation, we will first start by considering how we do this for a very simple language - integer arithmetic. Take two expressions: How might we decide whether these are equal? The obvious solution is to evaluate them. Indeed, (sparing the reader the derivation), we can see they both evaluate to , and so they are equal. Let us now introduce a small extension: Variables with a known value, using "let". We will refer to the variables themselves as "bound variables", as they are bound by the "let". In this case, they are additionally bound to a value. Terms containing only bound variables are referred to "closed" terms - this alludes to, for example, closed and open systems in the real world. Getting back to evaluation, we could simply substitute the computed value: And then evaluate. (In this case, our result is 117.) However, what if the variable occurs multiple times? The following is a little bad, but not awful: But what if was the following? Duplicating that would clearly not be ideal! Perhaps we could compute in advance, and then substitute that result in. This is much better: (In fact, there is yet another optimization to do: What if we don't need to compute it at all, say because it's not used? We will explore this possibility later.) A simple OCaml interpretation of the terms of this language may be the following. (sidetrack: A "term" is some element of a language. Above, we have the language of (restricted) integer arithmetic, and our terms are expressions of said. Hence, we call our OCaml interpretation of this language .) We then must make an environment that associates, or "binds", our names to terms. is often called a "binder" because it functions as the source of this binding. We will make our context a list of pairs , and will reimpliement the lookup function for example's sake, although a suitable function is available as . Note: In this series, we will not be considering the possibility that a name is used without being previously defined in some way. This is an error, and can be handled via normals means, e.g. erroring. We now note something curious: Our lookup function need not return a fully computed value! It could return , which we would have to further evaluate before we could do anything with. It would be possible of course to manually insert invariants ensuring that this is not the case, but we could still accidentally violate them. Here we hit one of the most important distinctions in our exploration of dependent typechecking. A "term", as previously seen, is any element of a chosen language. Above, this was (some subset of) integer arithmetic. We could also consider the language of OCaml programs, of which the above examples would be terms. However, two terms are not always easy to compare. We may need to evaluate one, or both, and we don't know whether that will be needed until we start comparing them. Instead, what if we defined another type that was guaranteed to be reduced to some suitable level? For this simple case, we could restrict this to contain only fully evaluated terms (i.e., integers). We prefix these constructors with to ensure they're distinct. Then, we could modify our environment and lookup function to only deal with values: and now we know that will always return something we can handle immediately, with no further processing. In the general case, we call this "suitably-reduced" constraint a "normal form". We can say that contains only elements of in this chosen normal form. Now we can finish our little evaluator. We do a slight hack to perform arithmetic inside values; if our value type contained more than just the constructor , we would have to consider how to handle that case. This is something we will cover soon. In the case, we compute the definition first, then extend our environment with the new name and value before computing the rest of the expression. We then, by returning a , have a guarantee that whatever comes out of is as simple as we want it to be. Comparing for equality is then very simple: We use to evaluate our terms into values, then because our values are in a nice normal form, we can compare them easily guaranteed. So far we have covered: In the next part, we will explore how to take these concepts and apply them to something much closer to a "real" programming language, with expressions and application, which will require the introduction of a "closure". From there, we can explore how to deal with open terms, containing free variables, and onwards to dependent types. András Kovacs' "Elaboration Zoo", found here . It was a primary source of inspiration for this series. "A tutorial implementation of a dependently typed lambda calculus", by Andres Löh, Conor McBride, and Wouter Swierstra; found here . "Checking Dependent Types with Normalization by Evaluation: A Tutorial", by David Christiansen, found here , for the more lisp-minded. Languages and terms. Simple evaluation strategies for let bound variables & closed terms. Values and normal forms.

0 views
seated.ro 4 months ago

Devlog 001

This is a general devlog covering what I’ve been up to over the past few months! Hakyll is a Haskell library for generating static websites (yes, you do not, in fact, need Next.js for your personal site). The previous iteration of my website was written with Go + templ, and it was not static. I wrote everything myself, and it was, quite frankly, a horror to maintain. The templ LSP actually hinders more than it helps, and since I did not statically generate the content, it was kind of slow for a personal website. I did not like it. I added some very minor CSS fixes and a theme toggle button for my friend’s new website ludwigabap.com, and I saw the light that is Hakyll. :kneel: So I nerd-sniped myself into rewriting my personal website in Hakyll so that it generates beautiful HTML at build time (it uses Tailwind, too). All the dynamic bits of the website, like my silly stats and coding time, are served from a single Rust Axum server binary on my Hetzner VM that serves other stuff too (Memegrep’s server, among other things). This segues into my next nerd-snipe, or rather, my favorite new pill: the Nix pill. My entire website is built with a single Nix flake, including the Hakyll build step, the Cargo build step, and generating the resulting Docker image. Although, to be honest, the learning curve is very steep and I simply do not have it in me to master this language anytime soon, I like the principles, and it seems to me the least bad build system out there. I do most of my work on a MacBook running a NixOS VM as well. (seatedro/dotnix on GitHub.) It’s so much nicer to have a real dev workflow versus whatever the fuck macOS is. Sorry, but Homebrew is not good software. The reason I started programming was that I was playing a game called Midtown Madness as a wee lad and decided that one day I would build a cool game like that. So, after 15 years, I’ve decided to embark on a small adventure to build a cool physics simulation (and eventually a voxel) engine. It’s written entirely in Zig because I enjoy writing Zig and did not want to touch C++ (although there were times I thought about committing this sin). When Ember gets far enough, I might write separate devlogs, but with what’s written so far, there isn’t enough to justify one. I was knee-deep in the mines because I wanted extremely specific features like multi-viewport docking with ImGui, and the off-the-shelf Zig libraries did not want to ship that for some reason. So there were many adventures in getting this shit to compile neatly. Ember now uses SDL3 for windowing and has an abstract rendering API (rudimentary, but the backends are in place) with SDLRenderer3, OpenGL, and WGPU-native backends that can be neatly switched at . I’m using a bunch of cool resources to learn more about how to do this because I’ve never really written any game-dev related code before. I just want to build some cool simulations, and I will get there no matter what. github / seatedro / ember After 5 years, I have finally built a new personal computer. I will be running NixOS as my main operating system with Hyprland and Wayland. I will also have Windows installed on a separate drive mainly for video games, though I expect to use Linux more often. It’s so nice not having to worry about my 256 GB SSD getting filled up on my MacBook Air. :> Also, I joined Exa last month to work on the back-end team, and it has been so fun! I am soaking up as much knowledge from my significantly smarter peers at a good pace. As per custom, I revisited @ludwigABAP ’s post: On becoming competitive when joining a new company . (PS: it’s on a new website.) Back to the code cave I go—going to extract the polymorphic.

0 views