A Fast Bytecode VM for Arithmetic: The Compiler
In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this post, we write the compiler for our AST to bytecode, and a decompiler for the bytecode. This post was originally published on abhinavsarkar.net . This is the second post in a series of posts: AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST , the interpreter jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode . Bytecode is a flattened and compact representation of a program, usually manifested as a byte array. Bytecode is essentially an Instruction Set (IS), but custom-made to be executed by a Virtual Machine (VM), instead of a physical machine. Each bytecode instruction is one byte in size (that’s where it gets its name from). A bytecode and its VM are created in synergy so that the execution is as efficient as possible 1 . Compiling source code to bytecode and executing it in a VM also allows the program to be run on all platforms that the VM supports without the developer caring much about portability concerns. The most popular combo of bytecode and VM is probably the Java bytecode and the Java virtual machine . The VM s can be stack-based or register-based . In a stack-based VM , all values created during the execution of a program are stored only in a Stack data-structure residing in the memory. Whereas, in a register-based VM , there is also an additional set of fixed number of registers that are used to store values in preference to the stack 2 . Register-based VM s are usually faster, but stack-based VM s are usually simpler to implement. For our purpose, we choose to implement a stack-based VM . We are going to write a compiler that compiles our expression AST to bytecode. But first, let’s design the bytecode for our stack-based VM . Here is our expression AST as a reminder: Let’s figure out the right bytecode for each case. First, we create Opcodes for each bytecode, which are sort of mnemonics for actual bytecode. Think of them as Assembly is to Machine Code . For a number literal, we need to put it directly in the bytecode so that we can use it later during the execution. We also need an opcode to push it on the stack. Let’s call it with an parameter. Binary operations recursively use for their operands. To evaluate a binary operation, we need its operands to be evaluated before, so we compile them first to bytecode. After that, all we need is an opcode per operator. Let’s call them , , , and for , , , and operators respectively. Variables and expressions are more complex 3 . In the AST interpreter we chucked the variables in a map, but we cannot do that in a VM . There is no environment map in a VM , and all values must reside in the stack. How do we have variables at all then? Let’s think for a bit. Each expression, after being evaluated in the VM , must push exactly one value on the stack: its result. expressions are a trivial case. When a binary operation is evaluated, first its left operand is evaluated. That pushes one value on the stack. Then its right operand is evaluated, and that pushes another value on the stack. Finally, the operation pops the two values from the top of the stack, does its thing, and pushes the resultant value back on the stack—again one value for the entire expression. A expression binds a variable’s value to its name, and then the variable can be referred from the body of the expression. But how can we refer to a variable when the stack contains only values, not names? Let’s imagine that we are in middle of evaluating a large expression, wherein we encounter a expression. First we evaluate its assignment expression, and that pushes a value on the top of the stack. Let’s say that the stack has values at this point. After this we get to evaluate the body expression. At all times when we are doing that, the value from assignment stays at the same point in the stack because evaluating sub-expressions, no matter how complicated, only adds new values to the stack, without popping an existing value from before. Therefore, we can use the stack index of the assignment value ( ) to refer to it from within the body expression. So, we encode as an opcode and an integer index into the stack. We choose to use a to index the stack, limiting us to a stack depth of 256. We encode the variable references with an opcode , which when executed gets the value from the stack at the given index and pushes it on the stack. For a expression, after we compile its assignment and body expressions, we need to make sure that the exactly-one-value invariant holds. Evaluating the assignment and body pushes two values on the stack, but we can have only one! So we overwrite the assignment value with the body value, and pop the stack to remove the body value. We invent a new opcode to do this, called so because its effect is equivalent to swapping the topmost two values on the stack, and then popping the new top value 4 . Putting all the opcodes together, we have the ADT: Notice that we also assigned bytecodes—that is, a unique byte value—to each above, which are just their ordinals. Now we are ready to write the compiler. The compiler takes an expression with the bytecode size, and compiles it to a strict of that size. Recall that in the previous post , we wrote our parser such that the bytecode size for each AST node was calculated while parsing it. This allows us to pre-allocate a bytestring of required size before compiling the AST . We compile to actual bytes here, and don’t use the opcodes. We use the function from the module that allocates enough memory for the provided bytecode size, and gives us a pointer to the allocated memory. We call this pointer for frame pointer . Then we traverse the AST recursively, writing bytes for opcodes and arguments for each case. We use pointer arithmetic and the function to write the bytes. numbers are encoded as two bytes in little endian fashion. In the recursive traversal function , we pass and return the current stack pointer and instruction pointer . We update these correctly for each case 5 . We also take care of checking that the pointers stay in the right bounds, failing which we throw appropriate errors. We also pass an parameter that is similar to the variable names to values environment we use in the AST interpreter, but this one tracks variable names to stack indices at which they reside. We update this information before compiling the body of a expression to capture the stack index of its assignment value. When compiling a expression, we use the map to lookup the variable’s stack index, and encode it in the bytecode. At the end of compilation, we check that the entire bytestring is filled with bytes till the very end, failing which we throw an error. This check is required because otherwise the bytestring may have garbage bytes, and may fail inexplicably during execution. All the errors are thrown in the monad using the function, and are caught after compilation using the function. The final result or error is returned wrapped into . Let’s see it in action: You can verify that the resultant bytes are indeed correct. I assume that it is difficult for you to read raw bytes. We’ll fix this in a minute. Meanwhile, let’s ponder upon some performance characteristics of our compiler. You may be wondering why I chose to write the compiler in this somewhat convoluted way of pre-allocating a bytestring and using pointers. The answer is: performance. I didn’t actually start with pointers. I iterated through many different data and control structures to find the fastest one. The table below shows the compilation times for a benchmark expression file when using different data structures to implement the function: I started with the bread-and-butter data structure of Haskellers, the humble and known to be slow , which was indeed quite slow. Next, I moved on to and thereafter , which are known to be faster at concatenation/consing. Then I abandoned the use of intermediate data structures completely, choosing to use a bytestring to create the bytestring. Finally, I had the epiphany that the bytestring size was known at compile time, and rewrote the function to pre-allocate the bytestring, thereby reaching the fastest solution. I also tried using , which has more-or-less same performance of bytestring, but it is inconvenient to use because there are no functions to do IO with bytearrays. So I’d anyway need to use bytestrings for reading from STDIN or writing to STDOUT, and converting to-and-fro between bytearray and bytestring is a performance killer. Thus, I decided to stick to bytestrings. The pre-allocated bytestring approach is 80 times faster than using lists, and almost 10 times faster than using . For such gain, I’m okay with the complications it brings to the code. Here are the numbers in a chart (smaller is better): The other important data structure used here is the map (or dictionary) in which we add the mappings from identifiers to their stack indices. This data structure needs to be performant because we do a lookup for each variable we encounter while compiling. I benchmarked compilation for some data structures: Strict hashmap turns out to be the fasted one, but interestingly, linked list is a close second. Mutable hashtable is the slowest even though I expected it to be the fastest. Here are the times in a chart (smaller is better): Another choice I had to make was how to write the function. I ended up passing and returning pointers and environment map, and throwing errors in , but a number of solutions are possible. I tried out some of them, and noted the compilation times for the benchmark expression file: I tried putting the pointer in s and state instead of passing them back-and-forth. I tried putting the environment in a config. I tried using for throwing errors instead of using IO errors. Then I tried various combinations of these monad transformers. Finally, I also tried converting the function to be tail-recursive by using Continuation-passing style (CPS), and then defunctionalizing the continuations, as well as, using the monad transformer. All of these approaches resulted in slower code. The times are interesting to compare (smaller is better): There is no reason to use s here because they result in slower and uglier code. Using one monad transformer at a time results in slight slowdowns, which may be worth the improvement in the code. But using more than one of them degrades performance by a lot. Also, there is no improvement caused by CPS conversion, because GHC is smart enough to optimize the non tail-recursive code to be faster then handwritten tail-recursive one that allocates a lot of closures (or objects in case of defunctionalization). Moving on … It is a hassle to read raw bytes in the compiler output. Let’s write a decompiler to aid us in debugging and testing the compiler. First, a disassembler that converts bytes to opcodes: A disassembled program is a sequence of opcodes. We simply go over each byte of the bytecode, and append the right opcode for it to the program, along with any parameters it may have. Note that we do not verify that the disassembled program is correct. Here are the helpers that read instruction bytes and their arguments from a bytestring: Next, we decompile the opcodes to an expression: Decompilation is the opposite of compilation. While compiling there is an implicit stack of expressions that are yet to be compiled. We make that stack explicit here, capturing expressions as they are decompiled from opcodes. For compound expressions, we inspect the stack and use the already decompiled expressions as the operands of the expression being decompiled. This way we build up larger expressions from smaller ones, culminating in the single top-level expression at the end 7 . Finally, we check the stack to make sure that there is only one expression left in it. Note that like the disassembler, we do not verify that the decompiled expression is correct. There is one tricky thing in decompilation: we lose the names of the variables when compiling, and are left with only stack indices. So while decompiling, we generate variable names from their stack indices by indexing a list of unique names. Let’s see it in action: That’s all for compilation and decompilation. Now, we use them together to make sure that everything works. We write some unit tests for the compiler, targeting both success and failure cases: In each test, we parse and compile an expression, and then disassemble the compiled bytes, which we match with expected list of opcodes, or an error message. Let’s put these tests with the parser tests, and run them: Awesome, it works! That’s it for this post. Let’s update our checklist: In the next part, we write a virtual machine that runs our compiled bytecode, and do some benchmarking. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! There are VM s that execute hardware IS s instead of bytecode. Such VM s are also called Emulators because they emulate actual CPU hardware. Some examples are QEMU and video game console emulators . ↩︎ VM s use virtual registers instead of actual CPU registers, which are often represented as a fixed size array of 1, 2, 4 or 8 byte elements. ↩︎ I call them variables here but they do not actually vary. A better name is let bindings. ↩︎ We could have used two separate opcodes here: and . That would result in same final result when evaluating an expression, but we’d have to execute two instructions instead of one for expressions. Using a single instruction speeds up execution, not only because we reduce the number of instructions, but also because we don’t need to do a full swap, only a half swap is enough because we pop the stack anyway after the swap. This also shows how we can improve the performance of our VM s by inventing specific opcodes for particular operations. ↩︎ Notice the use of strict s here, for performance reasons. ↩︎ Used as Association List . ↩︎ The decompiler is a bottom-up shift-reduce parser from the opcodes to the expression tree. ↩︎ If you liked this post, please leave a comment .