A Fast Bytecode VM for Arithmetic: The Virtual Machine
In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this final post, we write the virtual machine that executes our bytecode, and benchmark it. This post was originally published on abhinavsarkar.net . This post is part of the series: A Fast Bytecode VM for Arithmetic . Bytecode Virtual Machines (VMs) are known to be faster than AST -walking interpreters. That’s why many real-world programming languages these days are implemented with bytecode VM s, for example, Java , Python , PHP , and Raku . The reason is partially, the flat and compact nature of bytecode itself. But VM s also have a few other tricks up their sleeves that make them highly performant. In this post, we write a VM for our arithmetic expression language, and explore some of these performance tricks. But first, we need to finish a pending task. We wrote some unit tests for our compiler in the last post, but unit tests cover only the cases we can think of. A compiler has to deal with any input, and with just unit tests we cannot be sure of its correctness. To test our compiler and other components for correctness, we use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input, and then to automatically generate a large number of arbitrary inputs and make sure that the properties are indeed true for them 1 2 . Since we are writing an arithmetic expression parser/compiler/ VM , we generate arbitrary expression AST s, and use them to assert certain invariants of our program. With QuickCheck, we write generators for the inputs for our tests. These generators are composable just like parser combinators are. We use the library provided generators to write small generators that we combine to create larger ones. Let’s start: First come the basic generators: Moving on to composite generators: generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. Finally, we have some instances of QuickCheck’s type class to tie everything together: We can apply them in GHCi: Notice that the generated samples increase in complexity. With the generators in place, we define our properties next. Let’s test our parser first: This property is a simple round-trip test for the parser and printer: we parse the string representation of a generated expression, and assert that it gives back the same expression. The second property is a more involved round-trip test for the compiler and decompiler: This asserts that compiling an expression, then disassembling and decompiling it, and finally compiling it again should result in the original bytecode 3 . This requires a helper function to get the size of an expression: We run these tests in a later section. This ends our short detour. Now for the main event: the virtual machine. Our VM is a stack-based machine that operates on a stack of values and executes the compiled bytecode. Our goal is to be as fast as possible. For a quick reminder, these are our s: And now, the heart of the VM : The function is where the action happens. It is way more complex than , but the complexity has a reason, namely performance. runs inside the monad wrapped with the monad transformer. monad lets us use mutable data structures locally while ensuring the function remains externally pure. monad transformer adds support for throwing and propagating errors in a pure manner. We use for our stack, which is a mutable array of unboxed primitive types, in our case an array of values. Using a mutable unboxed array is much faster than using an immutable and/or boxed one like or due to reduced allocation and/or pointer chasing. The core of the VM is the function, a tight, tail-recursive loop that GHC compiles into an efficient machine loop, as we see later. It takes the stack pointer ( ), instruction pointer ( ) 4 , and the stack as arguments. At the top of each loop, a block of guard clauses checks for stack overflow, underflow, and other error conditions before branching on the current opcode. Placing these checks at the top instead of inside the opcode cases is a deliberate choice. This may make the code slightly harder to understand, but it significantly improves the performance of the loop by moving all branching at the beginning of the loop, resulting in code that is more friendly to the CPU’s Branch Predictor . Also notice how we reduce the number of checks by working with a range of opcodes at once in the guard. The checks are also sorted so as to be most performant, guided by profiling and benchmarking 5 . The handling of each opcode is actually pretty straightforward. We use different specific operations to read and write to the stack, while taking care of doing the required bound and arithmetic checks. We also use the functions that we wrote earlier . After carrying out each operation, we reenter the loop by calling it tail-recursively with the right stack and instruction pointers. Finally, we make sure that the execution terminated correctly by checking the state of the stack, and return its first element. We see later that the VM is quite fast, but how does GHC achieve this performance? To see the magic, we can look at GHC ’s intermediate language: Core . Core is a simpler functional language than Haskell to which GHC compiles Haskell. The simpler nature of Core makes it easier for GHC to optimize it, and compile it further. We can get the Core code for a program by compiling with the GHC option . The actual Core code for our VM is too verbose to show here, but here is a simplified C-like pseudo-code version of our loop: A few key optimizations are worth pointing out: The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. In short, GHC has turned our high-level, declarative Haskell code into a low-level loop that looks remarkably like one we would write in C. We get the safety and expressiveness of Haskell, while GHC does the heavy lifting to produce highly optimized code. It’s the best of both worlds! We must test the VM to make sure it works correctly 6 . We reuse the success and failure tests for the AST interpreter, as the bytecode interpreter should yield the same result: We also add a property-based test this time: for any given expression, interpreting the AST should produce the same result as compiling it to bytecode and executing it in the VM 7 . Our test suite is complete now: And finally, we run all tests together: Happily, all tests pass. Now for the fun part: benchmarking. We use the criterion library to benchmark the code. We have a benchmark suite to measure the performance of each pass, the two interpreters ( AST and bytecode), and the full end-to-end runs 8 . We compile with the following GHC options: Here are the results in a more digestible format: Here are the times in a chart (smaller is better): Let’s break down these numbers: I can already see readers thinking, “Sure that’s fast, but is it faster than C/Rust/Zig/my favourite language?” Let’s find out. To get a better sense of our VM ’s performance, I rewrote it in C. The C implementation is a classic manual approach: a hand-written tokenizer and recursive-descent parser, s with pointers for the AST , and manual memory management and error propagation. The VM is a simple loop with a statement for dispatching opcodes 9 . To compare our Haskell code against the C code, we need to write the last Haskell module, the CLI app that we demonstrated in the first post : We compile with the following GHC options 10 : And for the C version, we compile using GCC: Now, let’s see how they stack up against each other. We use hyperfine to run the two executables. Here’s a summary of the results: I have subtracted the times of previous passes to get the times for individual passes. Here’s the same in a chart (smaller is better): As expected, the C implementation is faster across the board, between 1.5x to 2.6x. The biggest difference is in parsing, where the hand-written C parser is more than twice as fast as our combinator-based one. On the other hand, the Haskell VM is only 50% slower than the C VM . In my opinion, the Haskell code’s performance is quite respectable, especially given the safety, expressiveness and conciseness benefits, as illustrated by the code sizes 11 : The Haskell implementation is almost half the size of the C code. I don’t know about you but I’m perfectly happy with the half as small, half as fast tread-off. The benchmark results for the VM s become less surprising when I compare the C function with the GHC Core code for 12 . This structure is almost a 1-to-1 match with the GHC Core code we saw earlier. The C loop corresponds to the optimized function that GHC generates, the statement is almost identical to the analysis on the raw opcode byte, and the C stack array is equivalent to the GHC uses. GHC effectively compiles our high-level Haskell into a low-level code that is structurally identical to what we wrote by hand in C 13 . This explains why the performance is in the same ballpark. The remaining performance gap is probably due to the thin layer of abstraction that the Haskell runtime still maintains, but it’s remarkable how close we can get to C-like performance. While our Haskell program is fast, we can improve certain things: Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. And that’s a wrap! We successfully built a bytecode compiler and virtual machine in Haskell. We covered parsing, AST interpretation, compilation, and bytecode execution, as well as, debugging and testing functionalities. Let’s update our checklist: The journey from a simple AST interpreter to a bytecode VM has been a rewarding one. We saw a significant performance improvement, learned about how compilers and VM s work, and how to write performant code in Haskell. While our Haskell implementation isn’t as fast as the hand-written C version, it’s far more concise and, I would argue, easier to reason about. It’s a great demonstration of Haskell’s power for writing high-performance—yet safe and elegant—code. See the full code at: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎ If you liked this post, please leave a comment . Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine (VM). Unit testing and property-based testing for our VM . Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. The Compiler The Virtual Machine (you are here) Introduction Testing the Compiler The Virtual Machine Testing the VM Benchmarking the VM Benchmarking Against C Future Directions generates number expressions by using QuickCheck’s built-in function. generates variable expressions by choosing from the set of passed valid variable names. generates valid identifiers from combinations of letters a—z and A—Z, and discarding ones that are reserved keywords. generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. Parsing and decompiling are slow : At ~573ms and ~506ms, these are by far the slowest passes. This isn’t surprising. Parsing with parser combinators has a known trade-off of expressiveness for performance. Decompiling is a shift-reduce parser that reconstructs an AST from a linear stream of opcodes, and we didn’t spend any time optimizing it. Compilation is fast : At ~51ms, compilation is an order of magnitude faster than parsing. This is thanks to pre-calculating the bytecode size during the parsing phase, which allows us to pre-allocate a single and fill it in with low-level pointer operations. Bytecode interpretation is blazingly fast : At just ~16ms, our VM ’s interpreter is over 3 times faster than the AST interpreter (~50ms), which proves our belief that bytecode interpreters are faster. End-to-end runs : Interestingly, the total time to run via bytecode (~638ms) is slightly slower than the run via AST (~617ms). This is because the cost of parsing, compiling, and then interpreting is higher than just parsing and interpreting. The real win for a bytecode VM comes when you compile once and run many times , amortizing the initial compilation cost. Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine. Unit testing and property-based testing for our VM. Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. ArithVMLib.hs ArithVMSpec.hs ArithVMBench.hs ArithVMApp.hs Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎