GreatReads - Blog Aggregator · Phoenix Framework

A Short Survey of Compiler Targets

As an amateur compiler developer, one of the decisions I struggle with is choosing the right compiler target. Unlike the 80’s when people had to target various machine architectures directly, now there are many mature options available. This is a short and very incomplete survey of some of the popular and interesting options. A compiler can always directly output machine code or assembly targeted for one or more architectures. A well-known example is the Tiny C Compiler . It’s known for its speed and small size, and it can compile and run C code on the fly. Another such example is Turbo Pascal . You could do this with your compiler too, but you’ll have to figure out the intricacies of the Instruction set of each architecture (ISA) you want to target, as well as, concepts like Register allocation . Most modern compilers actually don’t emit machine code or assembly directly. They lower the source code down to a language-agnostic Intermediate representation (IR) first, and then generate machine code for major architectures (x86-64, ARM64, etc.) from it. The most prominent tool in this space is LLVM . It’s a large, open-source compiler-as-a-library. Compilers for many languages such as Rust , Swift , C/C++ (via Clang ), and Julia use LLVM as an IR to emit machine code. An alternative is the GNU C compiler (GCC), via its GIMPLE IR, though no compilers seem to use it directly. GCC can be used as a library to compile code, much like LLVM, via libgccjit . It is used in Emacs to Just-in-time (JIT) compile Elisp . Cranelift is another new option in this space, though it supports only few ISAs. For those who find LLVM or GCC too large or slow to compile, minimalist alternatives exist. QBE is a small backend focused on simplicity, targeting “70% of the performance in 10% of the code”. It’s used by the language Hare that prioritizes fast compile times. Another option is libFIRM , which uses a graph-based SSA representation instead of a linear IR. Sometimes you are okay with letting other compilers/runtimes take care of the heavy lifting. You can transpile your code to a another established high-level language and leverage that language’s existing compiler/runtime and toolchain. A common target in such cases is C. Since C compilers exist for nearly all platforms, generating C code makes your language highly portable. This is the strategy used by Chicken Scheme and Vala . Or you could compile to C++ instead, like Jank , if that’s your thing. There is also C– , a subset of C targeted by GHC and OCaml . Another ubiquitous target is JavaScript (JS), which is one of the two options (other being WebAssembly ) for running code natively in a web browser or one of the JS runtimes ( Node , Deno , Bun ). Multiple languages such as TypeScript , PureScript , Reason , ClojureScript , Dart and Elm transpile to JS. Nim interestingly, can transpile to C, C++ or JS. Another target similar to JS is Lua , a lightweight and embeddable scripting language, which languages such as MoonScript and Fennel transpile to. A more niche approach is to target a Lisp dialect. Compiling to Chez Scheme , for example, allows you to leverage its macro system, runtime, and compiler. The Idris 2 and Racket use Chez Scheme as their primary backend targets. This is a common choice for application languages. You compile to a portable bytecode for a Virtual machine (VM). VMs generally come with features like Garbage collection , JIT compilation , and security sandboxing. The Java Virtual Machine (JVM) is probably the most popular one. It’s the target for many languages including Java , Kotlin , Scala , Groovy , and Clojure . Its main competitor is the Common Language Runtime , originally developed by Microsoft , which is targeted by languages such as C# , F# , and Visual Basic.NET . Another notable VM is the BEAM , originally built for Erlang . The BEAM VM isn’t built for raw computation speed but for high concurrency, fault tolerance, and reliability. Recently, new languages such as Elixir and Gleam have been created to target it. Finally, this category also includes MoarVM —the spiritual successor to the Parrot VM —built for the Raku (formerly Perl 6) language. WebAssembly (Wasm) is a relatively new target. It’s a portable binary instruction format focused on security and efficiency. Wasm is supported by all major browsers, but not limited to them. The WebAssembly System Interface (WASI) standard provides APIs for running Wasm in non-browser and non-JS environments. Wasm is now targeted by many languages such as Rust , C/C++ , Go , Kotlin , Scala , Zig , and Haskell . Meta-tracing frameworks are a more complex category. These are not the targets for your compiler backend, instead, you use them to build a custom JIT compiler for your language by specifying an interpreter for it. The most well-known example is PyPy , an implementation of Python , created using the RPython framework. Another such framework is GraalVM/Truffle , a polyglot VM and meta-tracing framework from Oracle . Its main feature is zero-cost interoperability: code from GraalJS , TruffleRuby , and GraalPy can all run on the same VM, and can call each other directly. Move past the mainstream, and you’ll discover a world of unconventional and esoteric compiler targets. Developers pick them for academic curiosity, artistic expression, or to test the boundaries of viable compilation targets. Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ? I’m going to write a compiler from C++ to JSFuck. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . Machine Code / Assembly Intermediate Representations Other High-level Languages Virtual Machines / Bytecode WebAssembly Meta-tracing Frameworks Unconventional Targets Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ?

C++

Backend

Programming

0 views

Abhinav Sarkar 1 months ago

A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this final post, we write the virtual machine that executes our bytecode, and benchmark it. This post was originally published on abhinavsarkar.net . This post is part of the series: A Fast Bytecode VM for Arithmetic . Bytecode Virtual Machines (VMs) are known to be faster than AST -walking interpreters. That’s why many real-world programming languages these days are implemented with bytecode VM s, for example, Java , Python , PHP , and Raku . The reason is partially, the flat and compact nature of bytecode itself. But VM s also have a few other tricks up their sleeves that make them highly performant. In this post, we write a VM for our arithmetic expression language, and explore some of these performance tricks. But first, we need to finish a pending task. We wrote some unit tests for our compiler in the last post, but unit tests cover only the cases we can think of. A compiler has to deal with any input, and with just unit tests we cannot be sure of its correctness. To test our compiler and other components for correctness, we use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input, and then to automatically generate a large number of arbitrary inputs and make sure that the properties are indeed true for them 1 2 . Since we are writing an arithmetic expression parser/compiler/ VM , we generate arbitrary expression AST s, and use them to assert certain invariants of our program. With QuickCheck, we write generators for the inputs for our tests. These generators are composable just like parser combinators are. We use the library provided generators to write small generators that we combine to create larger ones. Let’s start: First come the basic generators: Moving on to composite generators: generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. Finally, we have some instances of QuickCheck’s type class to tie everything together: We can apply them in GHCi: Notice that the generated samples increase in complexity. With the generators in place, we define our properties next. Let’s test our parser first: This property is a simple round-trip test for the parser and printer: we parse the string representation of a generated expression, and assert that it gives back the same expression. The second property is a more involved round-trip test for the compiler and decompiler: This asserts that compiling an expression, then disassembling and decompiling it, and finally compiling it again should result in the original bytecode 3 . This requires a helper function to get the size of an expression: We run these tests in a later section. This ends our short detour. Now for the main event: the virtual machine. Our VM is a stack-based machine that operates on a stack of values and executes the compiled bytecode. Our goal is to be as fast as possible. For a quick reminder, these are our s: And now, the heart of the VM : The function is where the action happens. It is way more complex than , but the complexity has a reason, namely performance. runs inside the monad wrapped with the monad transformer. monad lets us use mutable data structures locally while ensuring the function remains externally pure. monad transformer adds support for throwing and propagating errors in a pure manner. We use for our stack, which is a mutable array of unboxed primitive types, in our case an array of values. Using a mutable unboxed array is much faster than using an immutable and/or boxed one like or due to reduced allocation and/or pointer chasing. The core of the VM is the function, a tight, tail-recursive loop that GHC compiles into an efficient machine loop, as we see later. It takes the stack pointer ( ), instruction pointer ( ) 4 , and the stack as arguments. At the top of each loop, a block of guard clauses checks for stack overflow, underflow, and other error conditions before branching on the current opcode. Placing these checks at the top instead of inside the opcode cases is a deliberate choice. This may make the code slightly harder to understand, but it significantly improves the performance of the loop by moving all branching at the beginning of the loop, resulting in code that is more friendly to the CPU’s Branch Predictor . Also notice how we reduce the number of checks by working with a range of opcodes at once in the guard. The checks are also sorted so as to be most performant, guided by profiling and benchmarking 5 . The handling of each opcode is actually pretty straightforward. We use different specific operations to read and write to the stack, while taking care of doing the required bound and arithmetic checks. We also use the functions that we wrote earlier . After carrying out each operation, we reenter the loop by calling it tail-recursively with the right stack and instruction pointers. Finally, we make sure that the execution terminated correctly by checking the state of the stack, and return its first element. We see later that the VM is quite fast, but how does GHC achieve this performance? To see the magic, we can look at GHC ’s intermediate language: Core . Core is a simpler functional language than Haskell to which GHC compiles Haskell. The simpler nature of Core makes it easier for GHC to optimize it, and compile it further. We can get the Core code for a program by compiling with the GHC option . The actual Core code for our VM is too verbose to show here, but here is a simplified C-like pseudo-code version of our loop: A few key optimizations are worth pointing out: The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. In short, GHC has turned our high-level, declarative Haskell code into a low-level loop that looks remarkably like one we would write in C. We get the safety and expressiveness of Haskell, while GHC does the heavy lifting to produce highly optimized code. It’s the best of both worlds! We must test the VM to make sure it works correctly 6 . We reuse the success and failure tests for the AST interpreter, as the bytecode interpreter should yield the same result: We also add a property-based test this time: for any given expression, interpreting the AST should produce the same result as compiling it to bytecode and executing it in the VM 7 . Our test suite is complete now: And finally, we run all tests together: Happily, all tests pass. Now for the fun part: benchmarking. We use the criterion library to benchmark the code. We have a benchmark suite to measure the performance of each pass, the two interpreters ( AST and bytecode), and the full end-to-end runs 8 . We compile with the following GHC options: Here are the results in a more digestible format: Here are the times in a chart (smaller is better): Let’s break down these numbers: I can already see readers thinking, “Sure that’s fast, but is it faster than C/Rust/Zig/my favourite language?” Let’s find out. To get a better sense of our VM ’s performance, I rewrote it in C. The C implementation is a classic manual approach: a hand-written tokenizer and recursive-descent parser, s with pointers for the AST , and manual memory management and error propagation. The VM is a simple loop with a statement for dispatching opcodes 9 . To compare our Haskell code against the C code, we need to write the last Haskell module, the CLI app that we demonstrated in the first post : We compile with the following GHC options 10 : And for the C version, we compile using GCC: Now, let’s see how they stack up against each other. We use hyperfine to run the two executables. Here’s a summary of the results: I have subtracted the times of previous passes to get the times for individual passes. Here’s the same in a chart (smaller is better): As expected, the C implementation is faster across the board, between 1.5x to 2.6x. The biggest difference is in parsing, where the hand-written C parser is more than twice as fast as our combinator-based one. On the other hand, the Haskell VM is only 50% slower than the C VM . In my opinion, the Haskell code’s performance is quite respectable, especially given the safety, expressiveness and conciseness benefits, as illustrated by the code sizes 11 : The Haskell implementation is almost half the size of the C code. I don’t know about you but I’m perfectly happy with the half as small, half as fast tread-off. The benchmark results for the VM s become less surprising when I compare the C function with the GHC Core code for 12 . This structure is almost a 1-to-1 match with the GHC Core code we saw earlier. The C loop corresponds to the optimized function that GHC generates, the statement is almost identical to the analysis on the raw opcode byte, and the C stack array is equivalent to the GHC uses. GHC effectively compiles our high-level Haskell into a low-level code that is structurally identical to what we wrote by hand in C 13 . This explains why the performance is in the same ballpark. The remaining performance gap is probably due to the thin layer of abstraction that the Haskell runtime still maintains, but it’s remarkable how close we can get to C-like performance. While our Haskell program is fast, we can improve certain things: Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. And that’s a wrap! We successfully built a bytecode compiler and virtual machine in Haskell. We covered parsing, AST interpretation, compilation, and bytecode execution, as well as, debugging and testing functionalities. Let’s update our checklist: The journey from a simple AST interpreter to a bytecode VM has been a rewarding one. We saw a significant performance improvement, learned about how compilers and VM s work, and how to write performant code in Haskell. While our Haskell implementation isn’t as fast as the hand-written C version, it’s far more concise and, I would argue, easier to reason about. It’s a great demonstration of Haskell’s power for writing high-performance—yet safe and elegant—code. See the full code at: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎ If you liked this post, please leave a comment . Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine (VM). Unit testing and property-based testing for our VM . Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. The Compiler The Virtual Machine (you are here) Introduction Testing the Compiler The Virtual Machine Testing the VM Benchmarking the VM Benchmarking Against C Future Directions generates number expressions by using QuickCheck’s built-in function. generates variable expressions by choosing from the set of passed valid variable names. generates valid identifiers from combinations of letters a—z and A—Z, and discarding ones that are reserved keywords. generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. Parsing and decompiling are slow : At ~573ms and ~506ms, these are by far the slowest passes. This isn’t surprising. Parsing with parser combinators has a known trade-off of expressiveness for performance. Decompiling is a shift-reduce parser that reconstructs an AST from a linear stream of opcodes, and we didn’t spend any time optimizing it. Compilation is fast : At ~51ms, compilation is an order of magnitude faster than parsing. This is thanks to pre-calculating the bytecode size during the parsing phase, which allows us to pre-allocate a single and fill it in with low-level pointer operations. Bytecode interpretation is blazingly fast : At just ~16ms, our VM ’s interpreter is over 3 times faster than the AST interpreter (~50ms), which proves our belief that bytecode interpreters are faster. End-to-end runs : Interestingly, the total time to run via bytecode (~638ms) is slightly slower than the run via AST (~617ms). This is because the cost of parsing, compiling, and then interpreting is higher than just parsing and interpreting. The real win for a bytecode VM comes when you compile once and run many times , amortizing the initial compilation cost. Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine. Unit testing and property-based testing for our VM. Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. ArithVMLib.hs ArithVMSpec.hs ArithVMBench.hs ArithVMApp.hs Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎

Haskell

Programming

0 views

Abhinav Sarkar 3 months ago

A Fast Bytecode VM for Arithmetic: The Compiler

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this post, we write the compiler for our AST to bytecode, and a decompiler for the bytecode. This post was originally published on abhinavsarkar.net . This is the second post in a series of posts: AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST , the interpreter jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode . Bytecode is a flattened and compact representation of a program, usually manifested as a byte array. Bytecode is essentially an Instruction Set (IS), but custom-made to be executed by a Virtual Machine (VM), instead of a physical machine. Each bytecode instruction is one byte in size (that’s where it gets its name from). A bytecode and its VM are created in synergy so that the execution is as efficient as possible 1 . Compiling source code to bytecode and executing it in a VM also allows the program to be run on all platforms that the VM supports without the developer caring much about portability concerns. The most popular combo of bytecode and VM is probably the Java bytecode and the Java virtual machine . The VM s can be stack-based or register-based . In a stack-based VM , all values created during the execution of a program are stored only in a Stack data-structure residing in the memory. Whereas, in a register-based VM , there is also an additional set of fixed number of registers that are used to store values in preference to the stack 2 . Register-based VM s are usually faster, but stack-based VM s are usually simpler to implement. For our purpose, we choose to implement a stack-based VM . We are going to write a compiler that compiles our expression AST to bytecode. But first, let’s design the bytecode for our stack-based VM . Here is our expression AST as a reminder: Let’s figure out the right bytecode for each case. First, we create Opcodes for each bytecode, which are sort of mnemonics for actual bytecode. Think of them as Assembly is to Machine Code . For a number literal, we need to put it directly in the bytecode so that we can use it later during the execution. We also need an opcode to push it on the stack. Let’s call it with an parameter. Binary operations recursively use for their operands. To evaluate a binary operation, we need its operands to be evaluated before, so we compile them first to bytecode. After that, all we need is an opcode per operator. Let’s call them , , , and for , , , and operators respectively. Variables and expressions are more complex 3 . In the AST interpreter we chucked the variables in a map, but we cannot do that in a VM . There is no environment map in a VM , and all values must reside in the stack. How do we have variables at all then? Let’s think for a bit. Each expression, after being evaluated in the VM , must push exactly one value on the stack: its result. expressions are a trivial case. When a binary operation is evaluated, first its left operand is evaluated. That pushes one value on the stack. Then its right operand is evaluated, and that pushes another value on the stack. Finally, the operation pops the two values from the top of the stack, does its thing, and pushes the resultant value back on the stack—again one value for the entire expression. A expression binds a variable’s value to its name, and then the variable can be referred from the body of the expression. But how can we refer to a variable when the stack contains only values, not names? Let’s imagine that we are in middle of evaluating a large expression, wherein we encounter a expression. First we evaluate its assignment expression, and that pushes a value on the top of the stack. Let’s say that the stack has values at this point. After this we get to evaluate the body expression. At all times when we are doing that, the value from assignment stays at the same point in the stack because evaluating sub-expressions, no matter how complicated, only adds new values to the stack, without popping an existing value from before. Therefore, we can use the stack index of the assignment value ( ) to refer to it from within the body expression. So, we encode as an opcode and an integer index into the stack. We choose to use a to index the stack, limiting us to a stack depth of 256. We encode the variable references with an opcode , which when executed gets the value from the stack at the given index and pushes it on the stack. For a expression, after we compile its assignment and body expressions, we need to make sure that the exactly-one-value invariant holds. Evaluating the assignment and body pushes two values on the stack, but we can have only one! So we overwrite the assignment value with the body value, and pop the stack to remove the body value. We invent a new opcode to do this, called so because its effect is equivalent to swapping the topmost two values on the stack, and then popping the new top value 4 . Putting all the opcodes together, we have the ADT: Notice that we also assigned bytecodes—that is, a unique byte value—to each above, which are just their ordinals. Now we are ready to write the compiler. The compiler takes an expression with the bytecode size, and compiles it to a strict of that size. Recall that in the previous post , we wrote our parser such that the bytecode size for each AST node was calculated while parsing it. This allows us to pre-allocate a bytestring of required size before compiling the AST . We compile to actual bytes here, and don’t use the opcodes. We use the function from the module that allocates enough memory for the provided bytecode size, and gives us a pointer to the allocated memory. We call this pointer for frame pointer . Then we traverse the AST recursively, writing bytes for opcodes and arguments for each case. We use pointer arithmetic and the function to write the bytes. numbers are encoded as two bytes in little endian fashion. In the recursive traversal function , we pass and return the current stack pointer and instruction pointer . We update these correctly for each case 5 . We also take care of checking that the pointers stay in the right bounds, failing which we throw appropriate errors. We also pass an parameter that is similar to the variable names to values environment we use in the AST interpreter, but this one tracks variable names to stack indices at which they reside. We update this information before compiling the body of a expression to capture the stack index of its assignment value. When compiling a expression, we use the map to lookup the variable’s stack index, and encode it in the bytecode. At the end of compilation, we check that the entire bytestring is filled with bytes till the very end, failing which we throw an error. This check is required because otherwise the bytestring may have garbage bytes, and may fail inexplicably during execution. All the errors are thrown in the monad using the function, and are caught after compilation using the function. The final result or error is returned wrapped into . Let’s see it in action: You can verify that the resultant bytes are indeed correct. I assume that it is difficult for you to read raw bytes. We’ll fix this in a minute. Meanwhile, let’s ponder upon some performance characteristics of our compiler. You may be wondering why I chose to write the compiler in this somewhat convoluted way of pre-allocating a bytestring and using pointers. The answer is: performance. I didn’t actually start with pointers. I iterated through many different data and control structures to find the fastest one. The table below shows the compilation times for a benchmark expression file when using different data structures to implement the function: I started with the bread-and-butter data structure of Haskellers, the humble and known to be slow , which was indeed quite slow. Next, I moved on to and thereafter , which are known to be faster at concatenation/consing. Then I abandoned the use of intermediate data structures completely, choosing to use a bytestring to create the bytestring. Finally, I had the epiphany that the bytestring size was known at compile time, and rewrote the function to pre-allocate the bytestring, thereby reaching the fastest solution. I also tried using , which has more-or-less same performance of bytestring, but it is inconvenient to use because there are no functions to do IO with bytearrays. So I’d anyway need to use bytestrings for reading from STDIN or writing to STDOUT, and converting to-and-fro between bytearray and bytestring is a performance killer. Thus, I decided to stick to bytestrings. The pre-allocated bytestring approach is 80 times faster than using lists, and almost 10 times faster than using . For such gain, I’m okay with the complications it brings to the code. Here are the numbers in a chart (smaller is better): The other important data structure used here is the map (or dictionary) in which we add the mappings from identifiers to their stack indices. This data structure needs to be performant because we do a lookup for each variable we encounter while compiling. I benchmarked compilation for some data structures: Strict hashmap turns out to be the fasted one, but interestingly, linked list is a close second. Mutable hashtable is the slowest even though I expected it to be the fastest. Here are the times in a chart (smaller is better): Another choice I had to make was how to write the function. I ended up passing and returning pointers and environment map, and throwing errors in , but a number of solutions are possible. I tried out some of them, and noted the compilation times for the benchmark expression file: I tried putting the pointer in s and state instead of passing them back-and-forth. I tried putting the environment in a config. I tried using for throwing errors instead of using IO errors. Then I tried various combinations of these monad transformers. Finally, I also tried converting the function to be tail-recursive by using Continuation-passing style (CPS), and then defunctionalizing the continuations, as well as, using the monad transformer. All of these approaches resulted in slower code. The times are interesting to compare (smaller is better): There is no reason to use s here because they result in slower and uglier code. Using one monad transformer at a time results in slight slowdowns, which may be worth the improvement in the code. But using more than one of them degrades performance by a lot. Also, there is no improvement caused by CPS conversion, because GHC is smart enough to optimize the non tail-recursive code to be faster then handwritten tail-recursive one that allocates a lot of closures (or objects in case of defunctionalization). Moving on … It is a hassle to read raw bytes in the compiler output. Let’s write a decompiler to aid us in debugging and testing the compiler. First, a disassembler that converts bytes to opcodes: A disassembled program is a sequence of opcodes. We simply go over each byte of the bytecode, and append the right opcode for it to the program, along with any parameters it may have. Note that we do not verify that the disassembled program is correct. Here are the helpers that read instruction bytes and their arguments from a bytestring: Next, we decompile the opcodes to an expression: Decompilation is the opposite of compilation. While compiling there is an implicit stack of expressions that are yet to be compiled. We make that stack explicit here, capturing expressions as they are decompiled from opcodes. For compound expressions, we inspect the stack and use the already decompiled expressions as the operands of the expression being decompiled. This way we build up larger expressions from smaller ones, culminating in the single top-level expression at the end 7 . Finally, we check the stack to make sure that there is only one expression left in it. Note that like the disassembler, we do not verify that the decompiled expression is correct. There is one tricky thing in decompilation: we lose the names of the variables when compiling, and are left with only stack indices. So while decompiling, we generate variable names from their stack indices by indexing a list of unique names. Let’s see it in action: That’s all for compilation and decompilation. Now, we use them together to make sure that everything works. We write some unit tests for the compiler, targeting both success and failure cases: In each test, we parse and compile an expression, and then disassemble the compiled bytes, which we match with expected list of opcodes, or an error message. Let’s put these tests with the parser tests, and run them: Awesome, it works! That’s it for this post. Let’s update our checklist: In the next part, we write a virtual machine that runs our compiled bytecode, and do some benchmarking. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! There are VM s that execute hardware IS s instead of bytecode. Such VM s are also called Emulators because they emulate actual CPU hardware. Some examples are QEMU and video game console emulators . ↩︎ VM s use virtual registers instead of actual CPU registers, which are often represented as a fixed size array of 1, 2, 4 or 8 byte elements. ↩︎ I call them variables here but they do not actually vary. A better name is let bindings. ↩︎ We could have used two separate opcodes here: and . That would result in same final result when evaluating an expression, but we’d have to execute two instructions instead of one for expressions. Using a single instruction speeds up execution, not only because we reduce the number of instructions, but also because we don’t need to do a full swap, only a half swap is enough because we pop the stack anyway after the swap. This also shows how we can improve the performance of our VM s by inventing specific opcodes for particular operations. ↩︎ Notice the use of strict s here, for performance reasons. ↩︎ Used as Association List . ↩︎ The decompiler is a bottom-up shift-reduce parser from the opcodes to the expression tree. ↩︎ If you liked this post, please leave a comment .

Programming Haskell

0 views

Abhinav Sarkar 4 months ago

A Fast Bytecode VM for Arithmetic: The Parser

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this post, we write the parser for our expression language to an AST , and an AST interpreter. This post was originally published on abhinavsarkar.net . This is the first post in a series of posts: The language that we are going to work with is that of basic arithmetic expressions, with integer values, and addition, subtraction, multiplication and integer division operations. However, our expression language has a small twist: it is possible to introduce a variable using a binding and use the variable in the expressions in the body of 1 . Furthermore, we use the same syntax for as Haskell does. Here are some examples of valid expressions in our language: The only gotcha here is that the body of a expression extends as far as possible while accounting for nested s. It becomes clear when we look at parsed expressions later. The eventual product is a command-line tool that can run different commands. Let’s start with a demo of the tool: We can parse an expression, or compile it to bytecode. We can also disassemble bytecode to opcodes, or decompile it back to an expression. We can interpret an expression either as an AST or as bytecode. We can also run a bytecode file directly. Finally, we have a handy command to generate random expressions for testing/benchmarking purposes 2 . Let’s start. Since this is Haskell, we start with listing many language extensions and imports: We use the extension here that enables a lot of useful GHC extensions by default. We are using the bytestring and attoparsec libraries for parsing, strict , containers and unordered-containers for compilation, deepseq , mtl and primitive for interpreting, and QuickCheck for testing. The first step is to parse an expression into an Abstract Syntax Tree (AST). We represent the AST as Haskell Algebraic Data Type s (ADTs): We add instances for ADT s so that we can pretty-print the parsed AST 3 . Now, we can start parsing. The EBNF grammar for expressions is as follows: The , , , and productions take care of having the right precedence of arithmetic operations. The and productions are trivial. Our language is fairly oblivious of whitespaces; we allow zero-or-more spaces at most places. The expressions grammar is pretty standard, except we require one-or-more spaces after the and keywords to make them unambiguous. We use the parser combinator library attoparsec for creating the parser. attoparsec works directly with bytestrings so we don’t incur the cost of decoding unicode characters 4 5 . We write the parser in a top-down recursive-descent fashion, same as the grammar, starting with the parser: One small complication: our parsers not only return the parsed expressions, but also the number of bytes they occupy when compiled to bytecode. We gather this information while building the AST in parts, and propagate it upward in the tree. We use the bytecode size later in the compilation pass 6 . Both and chain the right higher precedence parsers with the right operators between them 7 using the combinator. uses lookahead to dispatch between one of the primary parsers, which is faster than using backtracking . simply skips the parenthesis, and recursively calls . uses the and parsers from the attoparsec library to parse an optionally signed integer. We restrict the numbers to 2-byte integers (-32768–32767 inclusive) 8 . The helper from attoparsec names parsers so that the error message shown in case of failures point to the right parser. and are straightforward. We restrict identifiers to upper-and-lowercase ASCII alphabetic letters. We also check that our reserved keywords ( and ) are not used as identifiers. Finally, we write the parser for expressions: In we use to parse the variable name, and recursively call to parse the assignment and body expressions, while making sure to correctly parse the spaces. The helper parser is used to parse known string tokens ( , and ), and provide good error messages in case of failures. Talking about error messages … Let’s figure out an error handling strategy. We use an type wrapped in to propagate the errors in our program: The type also captures the in which the error is thrown. is a type alias that represents either an error or a result. Finally, we put all the parsers together to write the function. Our function uses the function from attoparsec to run the over an input. The function deals with intricacies of how attoparsec returns the parsing result. Basically, we inspect the returned result and throw appropriate errors with useful error messages. We use from the typeclass that works with all its instances, which is one of. Finally, we throw away the bytecode size from the result of in the function. The parser is done. But as good programmers, we must make sure that it works correctly. Let’s write some unit tests. We use the hspec library to write unit tests for our program. Each test is written as a spec 9 . We have a bunch of tests for the parser, testing both success and failure cases. Notice how spaces are treated in the expressions. Also notice how the expressions are parsed. We’ll add property-based tests for the parser in the next post. There is not much we can do with the parsed AST s at this point. Let’s write an interpreter to evaluate our AST s. The AST interpreter is a standard and short recursive interpreter with an environment mapping variables to their values: This interpreter serves both as a performance baseline for the bytecode VM we write later, and as a definitional interpreter for testing the VM 10 . We are careful in detecting division-by-zero and arithmetic overflow errors, but we ignore possible integer overflow/underflow errors that may be caused by the arithmetic operations. We write some unit tests for the interpreter following the same pattern as the parser: Now, we can run the parser and interpreter tests to make sure that everything works correctly. Awesome, it works! That’s it for this post. Let’s update our checklist: In the next part , we write a bytecode compiler for our expression AST . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Variables are scoped to the body of the expressions they are introduced in, that is, our language has lexical scoping . Also, variables with same name in inner s shadow the variables in outer s. ↩︎ If you are wondering why do this at all, when we can directly run the expressions while parsing, I think this is a great little project to learn how to write performant bytecode compilers and VM s in Haskell. ↩︎ Bangs ( ) that enforce strictness are placed in the ADT (and also in the later code) at the right positions that provide performance benefits. The right positions were found by profiling the program. A bang placed at a wrong position (for example in front of inside ) may ruin the compiler provided optimizations and make the overall program slower. ↩︎ attoparsec is very fast, but there are faster parsing libraries in Haskell. On the other hand, attoparsec does not provided great error messages. If the user experience were a higher priority, I’d use the megaparsec library. I find attoparsec to have the right balance of performance, developer experience and user experience. Handwritten parsers from scratch could be faster, but they’d be harder to maintain and use. ↩︎ I wrote the first version of the parser using the library that comes with Haskell standard library. I rewrote it to use attoparsec and found that the rewritten parser was more than 10x faster. ↩︎ You don’t need to think about the bytecode size of expressions right now. It’ll become clear when we go over compilation in the next post. ↩︎ Certain functions such as are inlined using the pragma to improve the program performance. The functions to inline were chosen by profiling. ↩︎ Since the numbers need to be encoded into bytes when we compile to bytecode, we need to choose some encoding for them. For simpler code, we choose 2-byte integers. ↩︎ Testing your parsers is crucial because that’s your programming languages’ interface to the users. Also because writing (fast) parsers is difficult and error-prone. Most of the bugs I found in this program were in the parser. ↩︎ Again, notice the carefully placed bangs to enforce strictness. Try to figure out why they are placed at some places and not at others. ↩︎ If you liked this post, please leave a comment .

Programming Haskell

1 views

Abhinav Sarkar 6 months ago

Reading Time Estimates for Pandoc Based Blog Generators

If you, like me, are one of those who have written their own static-site generators based on Pandoc using Haskell, or use a framework made on top of it 1 , this post provides a little code snippet to compute the reading time estimate for your blog posts. You can stick this in your setup at the right place, and get pretty accurate estimates 2 . This note was originally published on abhinavsarkar.net . The function takes a tree data structure, and recursively traverses it to get the count of textual words. For raw HTML code embedded in the tree, it parses the HTML code using the tagsoup library, and extracts the inner text after removing the tags. The words are counted after removing words that are composed of only punctuation. The function then divides the word count by a words-per-minute ( ) parameter — that I’ve set here to 220, close to the average reading speed of literate adult humans — to return the reading time estimate in minutes. You can change the parameter as per your liking. That’s it. I hope this is helpful to you. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Like Hakyll , Slick or Ema . ↩︎ I cross-checked the estimates with those provided by the Miniflux feed reader, and they are quite close. ↩︎ If you liked this note, please leave a comment .

Programming Haskell

HTML

0 views

Abhinav Sarkar 7 months ago

Running a Goaccess Server on NixOS

Goaccess is an open source real-time web log analyzer. We can use it to parse a server access log file, such as of Nginx , and see the analysis report in a terminal in real-time. However, Goaccess also comes with an HTTP server built into it that can serve the same real-time report over HTTP ( demo ). This note was originally published on abhinavsarkar.net . This is how you can use it to set up a Goaccess service that analyzes an Nginx server access log, and exposes the Goaccess server over Nginx acting as a reverse proxy. After deploying this, the Goaccess report will be available at https://goaccess.example.net . That’s all for setting up a Goaccess server on NixOS. I hope this helps someone. If you have any questions or suggestions, please feel free to leave a comment. Thanks for reading! If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! If you liked this note, please leave a comment .

DevOps

Tutorial

0 views

Abhinav Sarkar 10 months ago

Interpreting Brainfuck in Haskell

Writing an interpreter for Brainfuck is almost a rite of passage for any programming language implementer, and it’s my turn now. In this post, we’ll write not one but four Brainfuck interpreters in Haskell. Let’s go! This post was originally published on abhinavsarkar.net . Brainfuck (henceforth BF) is the most famous of esoteric programming languages. Its fame lies in the fact that it is extremely minimalist, with only eight instructions, and very easy to implement. Yet, it is Turing-complete and as capable as any other programming language 1 . Writing an interpreter for BF is a fun exercise, and so there are hundreds, maybe even thousands of them. Since BF is very verbose, optimizing BF interpreters is almost a sport, with people posting benchmarks of their creations. I can’t say that what I have in this post is novel, but it was definitely a fun exercise for me. BF has eight instructions of one character each. A BF program is a sequence of these instructions. It may have other characters as well, which are treated as comments and are ignored while executing. An instruction pointer (IP) points at the next instruction to be executed, starting with the first instruction. The instructions are executed sequentially, except for the jump instructions that may cause the IP to jump to remote instructions. The program terminates when the IP moves past the last instruction. BF programs work by modifying data in a memory that is an array of at least 30000 byte cells initialized to zero. A data pointer (DP) points to the current byte of the memory to be modified, starting with the first byte of the memory. BF programs can also read from standard input and write to standard output, one byte at a time using the ASCII character encoding. The eight BF instructions each consist of a single character: Each matches exactly one and vice versa, and the comes first. Together, they add conditions and loops to BF . Some details are left to implementations. In our case, we assume that the memory cells are signed bytes that underflow and overflow without errors. Also, accessing the memory beyond array boundaries wraps to the opposite side without errors. For a taste, here is a small BF program that prints when run: As you can imagine, interpreting BF is easy, at least when doing it naively. So instead of writing one interpreter, we are going to write four, with increasing performance and complexity. First, some imports: We use the extension here that enables a lot of useful GHC extensions by default. Our non-base imports come from the memory and vector libraries. We abstract the interpreter interface as a typeclass: An is specified by a data type and two functions: parses a string to a , and interprets the parsed . For modeling the mutable memory, we use a mutable unboxed of signed bytes ( ) from the vector package. Since our interpreter runs in , this works well for us. The DP hence, is modelled as a index in this vector, which we name the type. We wrap the with a . creates a new memory array of bytes initialized to zero. returns the size of the memory. , and are for reading from, writing to and modifying the memory respectively. and increment and decrement the array index respectively, taking care of wrapping at boundaries. Now we write the function using the typeclass functions: The function calls the and functions for the right interpreter with a new memory and the input string read from the file specified in the command line argument. We make sure to filter out non- BF characters when reading the input file. With the setup done, let’s move on to our first interpreter. A BF program can be interpreted directly from its string representation, going over the characters and executing the right logic for them. But strings in Haskell are notoriously slow because they are implemented as singly linked-lists of characters. Indexing into strings has \(O(n)\) time complexity, so it is not a good idea to use them directly. Instead, we use a char Zipper 2 . Zippers are a special view of data structures, which allow one to navigate and easily update them. A zipper has a focus or cursor which is the current element of the data structure we are “at”. Alongside, it also captures the rest of the data structure in a way that makes it easy to move around it. We can update the data structure by updating the element at the focus 3 . This zipper is a little different from the usual implementations because we need to know when the focus of the zipper has moved out the program boundaries. Hence, we model the focus as . creates a char zipper from a string. and move the focus left and right respectively, taking care of setting the focus to if we move outside the program string. Parsing the program is thus same as creating the char zipper from the program string. For interpreting the program, we write this function: Our main driver here is the tail-recursive function that takes the memory index and the program as inputs. It then gets the current focus of the program zipper, and executes the BF logic accordingly. If the current focus is , it means the program has finished running. So we end the execution. Otherwise, we switch over the character and do what the BF spec tells us to do. For and , we increment or decrement respectively the value in the memory cell at the current index, and go to the next character. For and , we increment or decrement the memory index respectively, and go to the next character. For , we read an ASCII encoded character from the standard input, and write it to the memory at the current memory index as a byte. For , we read the byte from the memory at the current memory index, and write it out to the standard output as an ASCII encoded character. After either cases, we go to the next character. For , we read the byte at the current memory index, and if it is zero, we skip right over the part of the program till the matching is found. Otherwise, we go to the next character. For , we skip left over the part of the program till the matching is found, if the current memory byte is non-zero. Otherwise, we go to the next character. The next two functions implement the skipping logic: The tail-recursive functions and skip over parts of the program by moving the focus to right and left respectively, till the matching bracket is found. Since the loops can contain nested loops, we keep track of the depth of loops we are in, and return only when the depth becomes zero. If we move off the program boundaries while skipping, we throw an error. That’s it! We now have a fully functioning BF interpreter. To test it, we use these two BF programs: and . solves the Tower of Hanoi puzzle with animating the solution process as ASCII art: prints an ASCII art showing the Mandelbrot set : Both of these BF programs serve as good benchmarks for BF interpreters. Let’s test ours by compiling and running it 4 : That seems quite slow. We can do better. Instead of executing BF programs from their string representations, we can parse them to an Abstract Syntax Tree (AST). This allows us to match brackets only once at parse time, instead of doing it repeatedly at run time. We capture loops as AST nodes, allowing us to skip them trivially. We represent the BF AST as a Haskell Algebraic Data Type (ADT): There is one constructor per BF instruction, except for loops where the constructor captures both the start and end of loop instructions. We use immutable boxed vectors for lists of instructions instead of Haskell lists so that we can index into them in \(O(1)\) . We use the parse combinator library to write a recursive-decent parser for BF : All cases except the loop one are straightforward. For loops, we call the parser recursively to parse the loop body. Note that the parser matches the loop brackets correctly. If the brackets don’t match, the parser fails. Next, we interpret the BF AST : The AST interpreter code is quite similar to the string interpreter one. This time we use an integer as the IP to index the vector. All cases except the loop one are pretty much same as before. For loops, we read the byte at the current memory index, and if it is zero, we skip executing the AST node and go to the next instruction. Otherwise, we recursively interpret the loop body and go to the next instruction, taking care of passing the updated memory index returned from the recursive call to the execution of the next instruction. And we are done. Let’s see how it performs: Great! runs 2x faster, whereas runs 2.6x faster. Can we do even better? AST interpreters are well known to be slow because of how AST nodes are represented in the computer’s memory. The AST nodes contain pointers to other nodes, which may be anywhere in the memory. So while interpreting an AST , it jumps all over the memory, causing a slowdown. One solution to this is to convert the AST into a more compact and optimized representation known as Bytecode . That’s what our next interpreter uses. We reuse the parser from the AST interpreter, but then we convert the resultant AST into bytecode by translating and assembling it 5 . We use the byte array data type from the memory package to represent bytecode. Unlike AST , bytecode has a flat list of instructions—called Opcodes —that can be encoded in a single byte each, with optional parameters. Because of its flat nature and compactness, bytecode is more CPU friendly to execute, which is where it gets its performance from. The downside is that bytecode is not human readable unlike AST . We use the ADT to model the BF opcodes. For now, it corresponds one-to-one with the ADT . The function translates to : The function assembles to bytecode byte array: The function assembles an to a list of bytes ( s). For all cases except for , we simply return a unique byte for the opcode. For , we first recursively assemble the loop body. We encode both the body and the body length in the assembled bytecode, so that the bytecode interpreter can use the body length to skip over the loop body when required. We use two bytes to encode the body length, so we first check if the body length plus three is over 65536 ( \(= 2^8*2^8\) ). If so, we throw an error. Otherwise, we return: We encode the body length at the end again so that we can use it to jump backward to the start of the loop, to continue looping. Let’s look at this example to understand the loop encoding better: Let’s focus on the last twelve bytes. The diagram below shows the meaning of the various bytes: The example also demonstrates the flat nature of assembled bytecode. Now, all we have to do is to interpret it: Instead of using integer indices in the bytecode array and memory vector, this time we use C-style direct pointers 6 : In Haskell, the pointer type is parametrized by the type of the data it points to. We have two types of pointers here, one that points to the bytecode program, and another that points to the memory cells. So in this case, the IP and DP are actually pointers. The function here is again the core of the interpreter loop. We track the current IP and DP in it, and execute the logic corresponding to the opcode at the current memory location. ends when the IP points to the end of the program byte array. Most of the cases in are similar to previous interpreters. Only difference is that we use pointers to read the current opcode and memory cell. For the loop start opcode, we read the byte pointed to by the DP , and if it is zero, we read the next two bytes from the program bytecode, and use it as the offset to jump the IP by to skip over the loop body. Otherwise, we jump the IP by 3 bytes to skip over the loop start opcode and encoded loop body length bytes. For the loop end opcode, we follow similar steps, except we jump backward to the start of the loop. The helper functions for doing pointer arithmetic are following: and implement wrapping of pointers as we do for memory indices in and . Let’s see what the results of our hard work are: 1.3x and 2.3x speedups for and respectively over the AST interpreter. Not bad. But surely we can do even better? We can optimize our bytecode interpreter by emitting specialized opcodes for particular patterns of opcodes that occur frequently. Think of it as replacing every occurrence of a long phrase in a text with a single word that means the same, leading to a shorter text and faster reading time. Since BF is so verbose, there are many opportunities for optimizing BF bytecode 7 . We are going to implement only one simple optimization, just to get a taste of how to do it. The optimizing bytecode interpreter is pretty much same as the bytecode interpreter, with the function called between the translation and assembly phases. The pattern of opcode we are optimizing for is and . Both of these BF opcodes when executed, decrement or increment the current memory cell till it becomes zero. In effect, these patterns clear the current cell. We start the process by adding a new for clearing a cell: The function recursively goes over the , and emits optimized ones by replacing the patterns that clear the current cell with : Then we modify the function to emit a unique byte for : Finally, we modify the bytecode interpreter to execute the opcode. We can see how the patterns and that may execute operations tens, maybe hundreds, of times, are replaced by a single operation in the interpreter now. This is what gives us the speedup in this case. Let’s run it: runs 2.7x faster, whereas is barely 1% faster as compared to the non-optimizing bytecode interpreter. This demonstrates how different optimizations apply to different programs, and hence the need to implement a wide variety of them to be able to optimize all programs well. It’s time for a final comparison of the run times of the four interpreters: The final interpreter is 7x faster than the baseline one for , and 6x faster for . Here’s the same data as a chart: That’s it for this post. I hope you enjoyed it and took something away from it. In a future post, we’ll explore more optimization for our BF interpreter. The full code for this post is available here . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! BF is Turning-complete. That means it can be used to implement any computable program. However, it is a Turing tarpit, which means it is not feasible to write any useful programs in it because of its lack of abstractions. ↩︎ A string interpreter also serves as an useful baseline for measuring the performance of BF interpreters. That’s why I decided to use strings instead of or , which are more performant. ↩︎ I am a big fan of zippers, as evidenced by this growing list of posts that I use them in. ↩︎ We use Nix for getting the dependency libraries. ↩︎ If you are unfamiliar, is the left-to-right function composition function: While the only way to access byte arrays is pointers, we could have continued accessing the memory vector using indices. I benchmarked both methods, and found that using pointers for memory access sped up the execution of by 1.1x and by 1.6x as compared to index-based access. It’s also nice to learn how to use pointers in Haskell. This is why we chose to use vectors for the memory. ↩︎ See BFC , which touts itself as “an industrial-grade Brainfuck compiler”, with a huge list of optimizations. ↩︎ If you liked this post, please leave a comment .

Programming Haskell

Tutorial

0 views

Abhinav Sarkar 10 months ago

Solving Advent of Code “Seating System” with Comonads and Stencils

In this post, we solve the Advent of Code 2020 “Seating System” challenge in Haskell using comonads and stencils. This post was originally published on abhinavsarkar.net . Here’s a quick summary of the challenge: The seat layout fits on a grid. Each position is either floor ( ), an empty seat ( ), or an occupied seat ( ). For example, the initial seat layout might look like this: All decisions are based on the number of occupied seats adjacent to a given seat (one of the eight positions immediately up, down, left, right, or diagonal from the seat). The following rules are applied to every seat simultaneously: This is a classic Cellular Automaton problem. We need to write a program that simulates seats being occupied till no further seats are emptied or occupied, and returns the final number of occupied seats. Let’s solve this in Haskell. First, some imports: We use the extension here that enables a lot of useful GHC extensions by default. Our non-base imports come from the comonad , massiv and vector libraries. Quoting the Wikipedia page on Cellular Automaton (CA): Let’s model the automaton of the challenge using Haskell: A cell in the grid can be in empty, occupied or floor state. We encode this with the pattern synonyms , and over the over 1 . The function parses a character to a . The function implements the automaton rule . We are going to solve this puzzle in three different ways. So, let’s abstract the details and solve it top-down. We solve the challenge using the typeclass that all our different solutions implement. A grid is specified by three functions: The function calculates the number of finally occupied seats for any instance of the typeclass by running the simulation till it converges 2 . Now, we use to solve the challenge in three ways depending on the command line argument supplied: We have set up the top ( ) and the bottom ( ) of our solutions. Now let’s work on the middle part. To simulate a CA , we need to focus on each cell of the automaton grid, and run the rule for the cell. What is the first thing that come to minds of functional programmers when we want to focus on a part of a data structure? Zippers !. Zippers are a special view of data structures, which allow one to navigate and easily update them. A zipper always has a focus or cursor which is the current element of the data structure we are “at”. Alongside, it also captures the rest of the data structure in a way that makes it easy to move around it. We can update the data structure by updating the element at the focus. The first way to solve the challenge is the zipper for once-nested lists. Let’s start with creating the zipper for a simple list: A list zipper has a focus element, and two lists that capture the elements to the left and right of the focus. We use it through these functions: Let’s see it all in action: Great! Now, what is the zipper for a once-nested list? A once-nested zipper, of course: is a over a zipper of zippers. It has functions similar to for getting focus, position and size, for conversions to-and-from lists of lists, and for pretty-printing. Next, the functions to move the focus in the grid: Let’s check them out in GHCi: It works as expected. Now, how do we use this to simulate a CA ? A CA requires us to focus on each cell of the grid, and run a rule for the cell that depends on the neighbours of the cell. An Haskell abstraction that neatly fits this requirement is Comonad . Comonads are duals of Monads 3 . We don’t need to learn everything about them for now. For our purpose, Comonad provides an interface that exactly lines up with what is needed for simulating CA : Assuming we can make a comonad instance, the signatures for the above functions for would be: For as a CA comonad: The nice part is, we need to implement only the and functions, and the generation of the new grid is taken care of automatically by the default implementation of the function. Let’s write the comonad instance for . First, we write the comonad instance for : for simply returns the input zipper’s focus element. returns a zipper of zippers, with the input zipper as its focus, and the left and right lists of zippers as variation of the input zipper with all possible focuses. Trying out the functions in GHCi gives a better idea: Great! Now we use similar construction to write the comonad instance for : It works in similar fashion: I’ve rearranged the output of running the last line of the code above for clarity: We can see a grid of grids, with one inner grid focussed at each possible focus of the input grid. Now we finally implement the automaton: returns the neighbour cells of the currently focussed cell of the grid. It does so by moving the focus in all eight directions, and extracting the new focuses. We also make sure to return unique cells by their position. implements one step of the CA using the function of the typeclass. We call with a function that takes the current grid, and returns the result of running the CA on its focus and the neighbours of the focus. Finally, we plug in our functions into the instance of . That’s it! Let’s compile and run the code 4 : I verified with the Advent of Code website that the result is correct. We also see the time elapsed, which is 2.7 seconds. That seems pretty high. Can we do better? The problem with the zipper approach is that lists in Haskell are too slow. Some operations on them like are \(O(n)\) . The are also lazy in spine and value, and build up thunks. We could switch to a different list-like data structure 5 , or cache the grid size and neighbour indices for each index to make it run faster. Or we could try an entirely different approach. Let’s think about it for a bit. Zippers intermix two things together: the data in the grid, and the focus. When running a step of the CA , the grid data does not change when focussing on all possible focuses, only the focus itself changes. What if we separate the data from the focus? Maybe that’ll make it faster. Let’s try it out. Let’s model the grid as combination of a 2D array and an index into the array. We are using the arrays from the massiv library. is massiv’s way of representing an index into an 2D array, and is essentially same as a two-tuple of s. here means a 2D boxed array of s. massiv uses representation strategies to decide how arrays are actually represented in the memory, among which are boxed, unboxed, primitive, storable, delayed etc. Even though primitive and storable arrays are faster, we have to go with boxed arrays here because the instance of exists only for boxed and delayed arrays, and boxed ones are the faster among the two for our purpose. It is actually massively 6 easier to write the instance for : The implementation simply looks up the element from the array at the focus index. This time, we don’t need to implement because it is easier to implement directly. We map with index ( ) over the grid, calling the function for the variation of the grid with the index as the focus. Next, we write the CA step: converts a list of lists of cells into an focussed at . finds the neighbours of the current focus of a grid by directly looking up the valid neighbour indices into the array. calls and to implement the CA step, much like the case. And finally, we create the instance of . Let’s compile and run it: Woah! It takes only 0.1 second this time. Can we do even better? massiv has a construct called Stencil that can be used for simulating CA : Stencil is abstract description of how to handle elements in the neighborhood of every array cell in order to compute a value for the cells in the new array. That sounds like exactly what we need. Let’s try it out next. With stencils, we do not need the instance of for the grid. So we can switch to the faster unboxed array representation: First five lines make an instance of the typeclass. We chose to make a wrapper over because has an instance. Then we define a new grid type that is an 2D unboxed array. Now, we define the stencil and the step function for our CA : We make a stencil of size 3-by-3, where the focus is at index relative to the stencil’s top-left cell. In the callback function, we use the supplied function to get the neighbours of the focus by using indices relative to the focus, and call with the cells at focus and neighbour indices. Then we write the step function that maps the stencil over the grid. Finally we put everything together in the instance of . Let’s compile and run it: It is only a bit faster than the previous solution. But, this time we have another trick up our sleeves. Did you notice we sneaked in there? With stencils, we can now run the step for all cells in parallel! Let’s recompile it with the right options and run it again: The option enables multithreading, and the option makes the process use all CPU cores 7 . We get a nice speedup of 2x over the single-threaded version. Since you’ve read the entire post, here is a bonus visualization of the CA simulation for you (warning: lots of fast blinking): That’s it for this post! I hope you enjoyed it and took something away from it. The full code for this post is available here . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The reason for using a instead of a is explained in the Stencil section. ↩︎ If you are unfamiliar, is the left-to-right function composition function: This short post by Bartosz Milewski explains how comonads and monads are related. ↩︎ We use Nix for getting the dependency libraries. ↩︎ I did try a variation with instead of lists, and it was twice as fast. ↩︎ Pun very much intended. ↩︎ I tried running the process with different values of and found that gave the fastest results. So, Amdahl’s law applies here. ↩︎ If you liked this post, please leave a comment .

Programming

Tutorial Haskell

0 views

Abhinav Sarkar 11 months ago

Interesting Links for December 2024

A special Haskell edition of some interesting articles I recently read on the internet, starting with some Haskell-in-practice articles: I like this new trend of making scripting dialects of full-fledged programming languages. Joining the likes of Babashka and Small Java is Hell, a Shell scripting Haskell dialect by Chris Done. I usually write my scripts in Python, but I sorely miss static typing and functional programming, so naturally, I’m excited for Hell. It is in a nascent stage now, but hopefully, it will grow into something nice and usable. Many think that Haskell is too complex for real-world projects. I really like this solid advice by Patrick Thomson for putting Haskell in real-world use: Towards Faster Iteration in Industrial Haskell , where he writes about Haskell editors and tooling, GHC extensions, type system, building, and deployment. Haskell is one of the few programming languages that are lazy by default, and often this is a source of a lot of headaches for programmers, causing space leaks, slow computation, or hard to debug stack traces. But sometimes laziness can be harnessed for writing better programs, like Jasper Van der Jeugt does to create an efficient and elegant layout algorithm for creating photo collages in Lazy Layout . I love it when people use advanced programming techniques to solve their day-to-day problems! In Planning Weekly Workouts in 100 Lines of Haskell , Rodrigo Mesquita uses Logic Programming in Haskell to create a custom workout plan for themselves. For some unknown reasons, functional programming and music seem to mesh well together. Maybe it is because both have combinatorial and compositional natures. In Cheap Guitars and Drums in Haskell , Alp Mestanogullari uses Haskell to do digital audio synthesis by implementing the Karplus-Strong algorithm . Another interesting use of Haskell to solve problems in clean and interesting way: Vehbi Sinan Tunalioglu uses the Diagrams library to generate dynamic OpenGraph preview image for their website pages as detailed in More Haskell Diagrams: Dynamic OpenGraph Images . Currently I do this for my website in the crudest way possible: open the page in a browser manually and save a screenshot. Naturally, I end up doing it only once per page and the previews are not dynamic. I plan to switch to using Vehbi’s technique in future. Moving on to some Haskell-for-fun articles: Water Sort is a puzzle game where you have to sort coloured water into bottles. In Water Sort in Haskell Nicolas Audinet de Pieuchon creates the game as a terminal UI in Haskell using the Elm architecture . I like how Haskell and functional programming make writing and understanding such software easy by cleanly separating the game logic and the rendering logic. I’ve done Avdent of Code in Haskell for a couple of years now, and that has made me quite interested in efficient and convenient data structures in Haskell. Brent Yorgey has written about a bunch of them in their Competitive Programming in Haskell series, and the latest article is about efficiently calculating a measure for sliding windows of lists: Stacks, Queues, and Monoidal Sliding Windows . Next, some Haskell concepts articles: Haskell is notorious as a hard to learn language and I think some of that ill fame in deserved. Because of a powerful and flexible type-system, Haskell lets users solve their problems in many different ways, some of which could be too obtuse for many to understand. Seven Levels of Type Safety in Haskell: Lists by Justin Le shows seven different ways of working with lists in Haskell, only one of which is what we know as List ( ) in usual Haskell. It’s still fun to learn about these, and who knows, some of them may come handy some day. is one of those clever Haskell things that are very cool and terse but take a while to understand. In by Example Gil Mizrahi explains how it works by providing motivating examples. But to be honest, I took me quite some tries to get it even after reading a bunch about it. Arrows are a part of Haskell that I don’t much understand, or use. Haskell even has special syntax for them. In Why Arrows? Alexis King explains why we need arrows at all. Instead of commenting on it, I’m gonna quote a part: > The key insight behind arrows comes from the following observation: it’s impossible to analyze the structure of a monadic function without applying it because functions are opaque—the only thing we can do with one is apply it. Instead, we need to build our computation out of “function-like values” that are not opaque—we must be able to do more than just apply them. Here is an interesting use of a rather complicated concept from Category Theory: MangoIV shows in Codensity for Resource Management a way to use the Codensity monad for conveniently managing resources in Haskell code. If you have read any Haskell code, you’d know that Haskellers love writing terse code (myself included). That includes using a lot of single-letter variable names, which may make the code very unreadable for an unacquainted eye. But there is a method to this madness, and Jack Kelly captures a comprehensive knowledge about such names in A Dictionary of Single-Letter Variable Names . Stephen Diehl writes a thorough tutorial about how to generate X86 assembly from scratch for JIT compilation , and how to make it easy by exploiting a monadic interface in Monads to Machine Code . Finally, some Haskell philosophy to finish it off: That’s it for this year. Have a happy and prosperous 2025! If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This note was originally published on abhinavsarkar.net . If you liked this note, please leave a comment .

Haskell

Programming

Open Source

Java

0 views

Abhinav Sarkar 1 years ago

Interesting Links for November 2024

A special Programming Languages: Theory, Design and Implementation edition of some interesting articles I recently read on the internet: There is something amazing about making your own programming language. In “You Should Make a New Programming Language” Nicole Tietz-Sokolsaya puts forward some great reasons to do the same, but I do it just for the sheer excitement of witnessing a program written in my own language run. Why aren’t there programming languages that are convenient to write but slow by default, and allow the programmer to drop to a harder to write but more performant form, if required? Alex Kladov ponders on this question in “On Ousterhout’s Dichotomy” , and offers a possible solution. I am big fan of Algebraic data types , and consider them an indispensable tool in the modern programmers’ toolbox. In “Where Does the Name ‘Algebraic Data Type’ Come From?” Li-yao Xia investigates the possible sources of the name, going back to the programming languages from half a century ago. Follow Casey Rodarmor through the rabbithole to learn where an unexpected newline character comes from in this entertaining and enlightening article “Whence ‘\n’?” . Turnstyle is an esoteric, graphical functional language by Jasper Van der Jeugt. I have never seem anything like it before. It’s truly mind-blowing and I’m still trying to understand how it works. As good programmers, we try to stay away from the dark corners of programming languages, but Justine Tunney takes a head-first dive into them and comes up with an enthralling tale in the article “Weird Lexical Syntax” . I am not going to lie, I love Lisps! I must have implemented at least a dozen of them by now. If you are like me, you may have wondered “Why Is It Easy to Implement a Lisp?” . Eli Bendersky puts forward a compelling argument. How better to implement a fast (and small) Lisp than to compile it to LLVM IR. Using Clojure this time, John Jacobsen showcases it in “To The Metal… Compiling Your Own Language(s)” . Phil Eaton takes an ingenious approach for “Compiling Dynamic Programming Languages” , one that has never occurred to me before, but now will be a part of my toolbox forever. Here’s another technique that I was only vaguely familiar with: JIT compilation using macros. In “Runtime Optimization with Eval” Gary Verhaegen demonstrates this technique using Clojure. When compiling dynamically typed programming languages, we need to tag pointers to data with the runtime type information. In “What Is the Best Pointer Tagging Method?” Troy Hinckley describes some good ways of doing the same. I relish Max Bernstein’s articles about programming language implementation techniques. In “What’s in an e-graph?” they describe an optimization technique using e-graphs used in compilers. I love atypical uses of Programming Language Theory. Adam Dueck explains their PLT adventure in “How I Learned Pashto Grammar Through Programming Syntax Trees” . Brainfuck, the most popular of esoteric programming languages, has been a lot on my mind recently. And who better to learn about compiling BF from than Wilfred Hughes. In “An Optimising BF Compiler” they go over the algorithms they used to write “An Industrial-Grade Brainfuck Compiler” . And lastly, from the wicked mind of Srijan Paul, comes a twist: “Compiling to Brainf#ck” about their programming language Meep that, you guessed right, compiles to BF. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This note was originally published on abhinavsarkar.net . If you liked this note, please leave a comment .

Programming

Open Source Clojure

0 views

Abhinav Sarkar 1 years ago

Going REPLing with Haskeline

So you went ahead and created a new programming language, with an AST, a parser, and an interpreter. And now you hate how you have to write the programs in your new language in files to run them? You need a REPL ! In this post, we’ll create a shiny REPL with lots of nice features using the Haskeline library to go along with your new PL that you implemented in Haskell. This post was originally published on abhinavsarkar.net . First a short demo: That is a pretty good REPL, isn’t it? You can even try it online 1 , running entirely in your browser. Let’s assume that we have created a new small Lisp 2 , just large enough to be able to conveniently write and run the Fibonacci function that returns the nth Fibonacci number . That’s it, nothing more. This lets us focus on the features of the REPL 3 , not the language. We have a parser to parse the code from text to an AST, and an interpreter that evaluates an AST and returns a value. We are not going into the details of the parser and the interpreter, just listing the type signatures of the functions they provide is enough for this post. Let’s start with the AST: That’s right! We named our little language FiboLisp. FiboLisp is expression oriented; everything is an expression. So naturally, we have an AST. Writing the Fibonacci function requires not many syntactic facilities. In FiboLisp we have: We also have function definitions, captured by , which records the function name, its parameter names, and its body as an expression. And finally we have s, which are a bunch of function definitions to define, and another bunch of expressions to evaluate. Short and simple. We don’t need anything more 4 . This is how the Fibonacci function looks in FiboLisp: We can see all the AST types in use here. Note that FiboLisp is lexically scoped. The module also lists a bunch of keywords ( ) that can appear in the car 5 position of a Lisp expression, that we use later for auto-completion in the REPL, and some functions to convert the AST types to nice looking strings. For the parser, we have this pared-down code: The essential function is , which takes the code as a string, and returns either a on failure, or a on success. If the parser detects that an S-expression is not properly closed, it returns an error. We also have this pretty-printer module that converts function ASTs back to pretty Lisp code: Finally, the last thing before we hit the real topic of this post, the FiboLisp interpreter: We have elided the details again. All that matters to us is the function that takes a program, and returns either a runtime error or a value. is the runtime representation of the values of FiboLisp expressions, and all we care about is that it can be n and fully evaluated via 6 . also takes a function, that’ll be demystified when we get into implementing the REPL. Lastly, we have a map of built-in functions and a list of built-in values. We expose them so that they can be treated specially in the REPL. If you want, you can go ahead and fill in the missing code using your favourite parsing and pretty-printing libraries 7 , and the method of writing interpreters. For this post, those implementation details are not necessary. Let’s package all this functionality into a module for ease of importing: Now, with all the preparations done, we can go REPLing. The main functionality that a REPL provides is entering expressions and definitions, one at a time, that it R eads, E valuates, and P rints, and then L oops back, letting us do the same again. This can be accomplished with a simple program that prompts the user for an input and does all these with it. However, such a REPL will be quite lackluster. These days programming languages come with advanced REPLs like IPython and nREPL , which provide many functionalities beyond simple REPLing. We want FiboLisp to have a great REPL too. You may have already noticed some advanced features that our REPL provides in the demo. Let’s state them here: Haskeline — the Haskell library that we use to create the REPL — provides only basic functionalities, upon which we build to provide these features. Let’s begin. As usual, we start the module with many imports 8 : Notice that we import the previously shown module qualified as , and Haskeline as . Another important library that we use here is terminfo , which helps us do colored output. A REPL must preserve the context through a session. In case of FiboLisp, this means we should be able to define a function 9 as one input, and then use it later in the session, one or many times 10 . The REPL should also respect the REPL settings through the session till they are unset. Additionally, the REPL has to remember whether it is in middle of writing a multiline input. To support multiline input, the REPL also needs to remember the previous indentation, and the input done in previous lines of a multiline input. Together these form the : Let’s deal with settings first. We set and unset settings using the and commands. So, we write the code to parse setting the settings: Nothing fancy here, just splitting the input into words and going through them to make sure they are valid. The REPL is a monad that wraps over : also lets us do IO — is it really a REPL if you can’t do printing — and deal with exceptions. Additionally, we have a read-only state that is a function, which will be explained soon. The REPL starts in the single line mode, with no indentation, functions definitions, settings, or previously seen input. Let’s go top-down. We write the function that is the entry point of this module: This sets up Haskeline to run our REPL using the functions we provide in the later sections: and . This also demystifies the read-only state of the REPL: a function that adds colors to our output strings, depending on the capabilities of the terminal in which our REPL is running in. We also set up a history file to remember the previous REPL inputs. When the REPL starts, we output some messages in nice colors, which are defined as: Off we go ing now: We infuse our with the powers of Haskeline by wrapping it with Haskeline’s monad transformer, and call it the type. In the function, we , it, and again. We also deal with the user quitting the REPL (the case), and hitting Ctrl + C to interrupt typing or a running evaluation (the handling for ). Wait a minute! What is that imperative looking doing in our Haskell code? That’s right, we are looking through some lenses! If you’ve never encountered lenses before, you can think of them as pairs of setters and getters. The lenses above are for setting and getting the corresponding fields from the data type 11 . The , , and functions are for getting, setting and modifying respectively the state in the monad using lenses. We see them in action at the beginning of the function when we use to set the various fields of to their initial values in the monad. All that is left now is actually reading the input, evaluating it and printing the results. Haskeline gives us functions to read the user’s input as text. However, being Haskellers, we prefer some structure around it: We’ve got all previously mentioned cases covered with the data type. We also do some input validation and capture errors for the failure cases with the constructor. is used for when the user quits the REPL. Here is how we read the input: We use the function provided by Haskeline to show a prompt and read user’s input as a string. The prompt shown depends on the of the REPL state. In the mode we show , where in the mode we show . If there is no input, that means the user has quit the REPL. In that case we return , which is handled in the function. If the input is empty, we read more input, preserving the previous indentation ( ) in the mode. If the input starts with , we parse it for various commands: The and cases are straightforward. In case of , we make sure to check that the file asked to be loaded is located somewhere inside the current directory of the REPL or its recursive subdirectories. Otherwise, we deny loading by returning a . We parse the settings using the function we wrote earlier. If the input is not a command, we parse it as code: We append the previously seen input (in case of multiline input) with the current input and parse it using the function provided by the module. If parsing fails with an , it means that the input is incomplete. In that case, we set the REPL line mode to , REPL indentation to the current indentation, and seen input to the previously seen input appended with the current input, and read more input. If it is some other error, we return a with it. If the result of parsing is a program, we return it as a input. That’s it for reading the user input. Next, we evaluate it. Recall that the function calls the function with the read input: The cases of , and are straightforward. For settings, we insert or remove the setting from the REPL settings, depending on it being set or unset. For the other cases, we call the respective helper functions. For a command, we check if the requested identifier maps to a user-defined or builtin function, and if so, print its source. Otherwise we print an error. For a command, we check if the requested file exists. If so, we read and parse it, and interpret the resultant program. In case of any errors in reading or parsing the file, we catch and print them. Finally, we come to the workhorse of the REPL: the interpretation of the user provided program: We start by collecting the user defined functions in the current input with the previously defined functions in the session such that current functions override the previous functions with the same names. At this point, if the setting is set, we print the program AST. Then we invoke the function provided by the module. Recall that the function takes the program to interpret and a function of type . This function is a color-adding wrapper over the function returned by the Haskeline function 12 . This function allows non-REPL code to safely print to the Haskeline driven REPL without garbling the output. We pass it to the function so that the interpret can invoke it when the user code invokes the builtin function or similar. We make sure to and the value returned by the interpreter so that any lazy values or errors are fully evaluated 13 , and the measured elapsed time is correct. If the interpreter returns an error, we print it. Else we convert the value to a string, and if is it not empty 14 , we print it. Finally, we print the execution time if the setting is set, and set the REPL defs to the current program defs. That’s all! We have completed our REPL. But wait, I think we forgot one thing … The REPL would work fine with this much code, but it would not be a good experience for the user, because they’d have to type everything without any help from the REPL. To make it convenient for the user, we provide contextual auto-completion functionality while typing. Haskeline lets us plug in our custom completion logic by setting a completion function, which we did way back at the start. Now we need to implement it. Haskeline provides us the function to easily create our own completion function. It takes a callback function that it calls with the current word being completed (the word immediately to the left of the cursor), and the content of the line before the word (to the left of the word), reversed. We use these to return different completion lists of strings. Going case by case: This covers all cases, and provides helpful completions, while avoiding bad ones. And this completes the implementation of our wonderful REPL. I wrote this REPL while implementing a Lisp that I wrote 15 while going through the Essentials of Compilation book, which I thoroughly recommend for getting started with compilers. It started as a basic REPL, and gathered a lot of nice functionalities over time. So I decided to extract and share it here. I hope that this Haskeline tutorial helps you in creating beautiful and useful REPLs. Here is the complete code for the REPL. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The online demo is rather slow to load and to run, and works only on Firefox and Chrome. Even though I managed to put it together somehow, I don’t actually know how it exactly works, and I’m unable to fix the issues with it. ↩︎ Lisps are awesome and I absolutely recommend creating one or more of them as an amateur PL implementer. Some resources I recommend are: the Build Your Own Lisp book, and the Make-A-Lisp tutorial. ↩︎ REPLs are wonderful for doing interactive and exploratory programming where you try out small snippets of code in the REPL, and put your program together piece-by-piece. They are also good for debugging because they let you inspect the state of running programs from within. I still fondly remember the experience of connecting (or jacking in ) to running productions systems written in Clojure over REPL, and figuring out issues by dumping variables. ↩︎ We don’t even need . We can, and have to, define variables by creating functions, with parameters serving the role of variables. In fact, we can’t even assign or reassign variables. Functions are the only scoping mechanism in FiboLisp, much like old-school JavaScript with its IIFEs . ↩︎ car is obviously C ontents of the A ddress part of the R egister , the first expression in a list form in a Lisp. ↩︎ You may be wondering about why we need the instances for the errors and values. This will become clear when we write the REPL. ↩︎ I recommend the sexp-grammar library, which provides both parsing and printing facilities for S-expressions based languages. Or you can write something by yourself using the parsing and pretty-printing libraries like megaparsec and prettyprinter . ↩︎ We assume that our project’s Cabal file sets the default-language to GHC2021, and the default-extensions to , , , and . ↩︎ Recall that there is no way to define variables in FiboLisp. ↩︎ If the interpreter allows mutually recursive function definitions, functions can be called before defining them. ↩︎ We are using the basic-lens library here, which is the tiniest lens library, and provides only the five functions and types we see used here. ↩︎ Using the function returned from is not necessary in our case because the REPL blocks when it invokes the interpreter. That means, nothing but the interpreter can print anything while it is running. So the interpreter can actually print directly to and nothing will go wrong. However, imagine a case in which our code starts a background thread that needs to print to the REPL. In such case, we must use the Haskeline provided print function instead of printing directly. When printing to the REPL using it, Haskeline coordinates the prints so that the output in the terminal is not garbled. ↩︎ Now we see why we derive instances for errors and . ↩︎ Returned value could be of type void with no textual representation, in which case we would not print it. ↩︎ I wrote the original REPL code almost three years ago. I refactored, rewrote and improved a lot of it in the course of writing this post. As they say, writing is thinking. ↩︎ If you liked this post, please leave a comment .

Programming

Tutorial Haskell Clojure

0 views

Abhinav Sarkar 1 years ago

Getting Started with Nix for Haskell

So, you’ve heard of the new hotness that is Nix , for creating reproducible and isolated development environments, and want to use it for your new Haskell project? But you are unclear about how to get started? Then this is the guide you are looking for. This post was originally published on abhinavsarkar.net . Nix is notoriously hard to get started with. If you are familiar with Haskell, you may have an easier time learning the Nix language , but it is still difficult to figure out the various toolchains and library functions needed to put your knowledge of the Nix language to use. There are some frameworks for setting up Haskell projects with Nix, but again, they are hard to understand because of their large feature scopes. So, in this post, I’m going to show a really easy way for you to get started. But first, what does it mean to use Nix for a Haskell project? It means that all the dependencies of our projects — Haskell packages, and non-Haskell ones too — come from Nixpkgs , a repository of software configured and managed using Nix 1 . It also means that all the tools we use for development, such as builders, linters, style checkers, LSP servers, and everything else, also come from Nixpkgs 2 . And all of this happens by writing some configuration files in the Nix language. Start with creating a new directory for the project. For the purpose of this post, we name this project : The first thing to do is to set up the project to point to the Nixpkgs repo — more specifically, a particular fixed version of the repo — so that our builds are reproducible 3 . We do this by using Niv . Niv is a tool for pinning/locking down the version of the Nixpkgs repo, much like or . But instead of pinning each dependency at some version, we pin the entire repo (from which all the dependencies come) at a version. Run the following commands: Running drops us into a nested shell in which the executable is available. Running sets up Niv for our project, creating files. The file is where the Nixpkgs repo version is pinned 4 . If we open it now, it may look something like this: By default, Niv sets up the Nixpkgs repo, pinned to some version. Let’s pin it to the latest stable version as of the time of writing this post: 24.05. Run: Now, may look like this: Pinning is done. Now, let’s get some stuff from the repo. But wait, first we have to configure Nixpkgs. Create a file : Well, I lied. We could configure Nixpkgs if we had to 5 , but for this post, we leave all the settings empty, and just import it from Niv sources. At this point, we could start pulling things from Nixpkgs manually, but to make it declarative and reproducible, let’s create our own Nix shell. Create a file named : Ah! Now, the Nix magic is shining through. What does is, it creates a custom Nix shell with the things we mention already available in the shell. is how we create the custom shell, and are the tools we want available. We make and mandatorily available, because they are necessary for doing any Haskell development; and , , and 6 7 optionally available (depending on the passed flag), because we need them only when writing code. Exit the previous Nix shell, and start a new one to start working on the project 8 : Okay, I lied again, we are still setting up. In this new shell, , etc are not available but we can run now. We use it to initialize the Haskell project: After answering all the questions Cabal asks us, we are left with a file, along with some starter Haskell code in the right directories. Let’s build and run the starter code: Edit the file now to add some new Haskell dependency (without a version), such as . If we run now, Cabal will start downloading the package. Cancel that! We want our dependencies to come from Nixpkgs, not Hackage. For that we need to tell Nix about our Haskell project. Create a file : The file is the Nix representation of the Cabal package for our project. We use here, a tool that makes Nix aware of Cabal files, making it capable of pulling the right Haskell dependencies from Nixpkgs. We also configure Nix to not run Haddock on our code by setting the option 9 , since we are not going to write any doc for this demo project. Now, edit to make it aware of our new Nix package: We extend Haskell packages from Nixpkgs with our own package , and add an entry in the previously empty list. This makes all the Haskell dependencies we mention in available in the Nix shell. Exit the Nix shell now, and restart it by running: We can run now. Notice that nothing is downloaded from Hackage this time. Even better, we can now build our project using Nix: This builds our project in a truly isolated environment outside the Nix shell, and puts the results in the directory. Go ahead and try running it: Great! Now we can quit and restart the Nix shell without the option. This will download and set up all the fancy dev tools we configured. Then we can start our favorite editor from the terminal and have access to all of them in it 10 . This is all we need to get started on a Haskell project with Nix. There are some inconveniences in this setup, like we need to restart the Nix shell and the editor every time we modify our project dependencies, but these days most editors come with some extensions to do this automatically, without needing restarts. For more seamless experience in the terminal, we could install and that refresh the Nix shells automatically 11 . As a bonus, I’m going to show how to easily set up a Nix Flake for this project. Simply create a file: We reuse the package and shell Nix files we created earlier. We have to commit everything to our VSC at this point. After that, we can run the newfangled Nix commands such as 12 : If we upload the project to a public Github repo, anyone with Nix set up can run and/or install our package executable by running single commands: If that not super cool then I don’t know what is. Create a file and it to create a statically linked executable on Linux 13 , which can be run on any Linux machine without installing any dependency libraries or even Nix 14 : This post shows a quick and easy way to get started with using Nix for managing simple Haskell projects. Unfortunately, if we have any complex requirements, such as custom dependency versions, patched dependencies, custom non-Haskell dependencies, custom configuration for Nixpkgs, multi-component Haskell projects, using a different GHC version, custom build scripts etc, this setup does not scale. In such case you can either grow this setup by learning Nix in more depth with the help of the official Haskell with Nix docs and this great tutorial , or switch to using a framework like Nixkell or haskell-flake . This post only scratches the surface of all things possible to do with Nix. I hope I was able to showcase some benefits of Nix, and help you get started. Happy Haskelling and happy Nixing! If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! One big advantage that Nix has over using Cabal for managing Haskell projects is the Nix binary cache that provides pre-built libraries and executable for download. That means no more waiting for Cabal to build scores of dependencies from sources. ↩︎ Search Nixpkgs for packages at search.nixos.org . ↩︎ I’m assuming that you’ve already set up Nix at this point. If you have not, follow this guide . ↩︎ Of course, we can use Niv to manage any number of source repos, not just Nixpkgs. But we don’t need any other for this post. ↩︎ We could do all sort of interesting and useful things here, like patching some Nixpkgs packages with our own patches, reconfiguring the build flags of some packages, etc. ↩︎ is a Haskell linter, is a Haskell file formatter, and is an LSP server for Haskell. Other tools that I find useful are , the Haskell static analyzer, , the command runner, and , the Nix file formatter. All of them and more are available through Nixpkgs. You can start using them by adding them to . ↩︎ If you are wondering why we need to wrap only with all the stuff, that’s because, to work correctly is required to be compiled with same version of that your project is going to used. The other tools do not have this restriction. ↩︎ You may notice Nix downloading a lot of stuff from Nixpkgs. It may occasionally need to build a few things as well, if they are not available in the binary cache. You may need to tweak the and settings in the file if you are on a slow network. ↩︎ There are many more options that we can set here. These options roughly correspond to the command line options for the command. See a comprehensive list here . ↩︎ To update the tools and dependencies of the project, run , and restart the Nix shell. ↩︎ Use this file to configure for automatic refreshes for this project: First, we would have to modify our file to enable these commands by adding the line: This might take several hours to finish when run for the first time. Also, the config requires GHC >= 9.6. ↩︎ Another benefit of statically linked executables is, if you package them in Docker/OCI containers, the container sizes are much smaller than ones created for dynamically linked executables. ↩︎ If you liked this post, please leave a comment .

Tutorial Haskell

DevOps

Shell

0 views

Abhinav Sarkar 1 years ago

Interesting Links for July 2024

Here are some interesting things I recently read on the internet: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This note was originally published on abhinavsarkar.net . If you liked this note, please leave a comment .

Entertainment

Media

Culture

1 views

Abhinav Sarkar 1 years ago

Interesting Links for June 2024

Web Development

0 views

Abhinav Sarkar 1 years ago

Interesting Links for May 2024

Here are some interesting things I recently read on the Internet: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This note was originally published on abhinavsarkar.net . If you liked this note, please leave a comment .

Media

Entertainment

Writing

1 views

Abhinav Sarkar 1 years ago

Interesting Links for April 2024

Open Source

Web Development

Programming

0 views

Abhinav Sarkar 1 years ago

Solving Advent of Code ’23 “Aplenty” by Compiling

Every year I try to solve some problems from the Advent of Code (AoC) competition in a not straightforward way . Let’s solve the part one of the day 19 problem Aplenty by compiling the problem input to an executable file. This post was originally published on abhinavsarkar.net . What the problem presents as input is essentially a program. Here is the example input: Each line in the first section of the input is a code block. The bodies of the blocks have statements of these types: The problem calls the statements “rules” , the blocks “workflows” , and the program “system” . All blocks of the program operates on a set of four values: , , , and . The problem calls them “ratings” , and each set of ratings is for/forms a “part” . The second section of the input specifies a bunch of these parts to run the system against. This seems to map very well to a C program, with and returning and respectively, and jumps accomplished using s. So that’s what we’ll do: we’ll compile the problem input to a C program, then compile that to an executable, and run it to get the solution to the problem. And of course, we’ll do all this in Haskell. First some imports: First, we parse the input program to Haskell data types. We use the ReadP parser library built into the Haskell standard library. is a Haskell data type representing parts, and is an enum for, well, ratings 1 . Following that are parsers for parts and ratings, written in Applicative and Monadic styles using the basic parsers and combinators provided by the ReadP library. Finally, we have the function to run a parser on an input. We can try parsing parts in GHCi: Next, we represent and parse the program, I mean, the system: A is a map of workflows by their names. A has a name and a list of rules. A is either an , or an rule. An is either a to another workflow by name, or an or rule. The of an rule is a less that ( ) or a greater than ( ) of some of an input part with an integer value. Now, it’s time to parse the system: Parsing is straightforward as there are no recursive data types or complicated precedence or associativity rules here. We can exercise it in GHCi (output formatted for clarity): Excellent! We can now combine the part parser and the system parser to parse the problem input: Before moving on to translating the system to C, let’s write an interpreter so that we can compare the output of our final C program against it for validation. Each system has a workflow named “in”, where the execution of the system starts. Running the system results in if the run ends with an rule, or in if the run ends with a rule. With this in mind, let’s cook up the interpreter: The interpreter starts by running the rule to jump to the “in” workflow. Running a rule returns or for or rules respectively, or jumps to a workflow for rules. Jumping to a workflow looks it up in the system’s map of workflows, and sequentially runs each of its rules. An is run as previously mentioned. An rule evaluates its condition, and either runs the consequent rule if the condition is true, or moves on to running the rest of the rules in the workflow. That’s it for the interpreter. We can run it on the example input: The AoC problem requires us to return the sum total of the ratings of the parts that are accepted by the system: Let’s run it for the example input: It returns the correct answer! Next up, we generate some C code. But first, a quick digression to graphs. A Control-flow graph or CFG, is a graph of all possible paths that can be taken through a program during its execution. It has many uses in compilers, but for now, we use it to generate more readable C code. Using the module from the package, we write the function to create a control-flow graph for our system/program, and use it to topologically sort the workflows: is a simpler type for a graph of nodes of type . The function takes a the map from workflow names to workflows — that is, a system — and returns a control-flow graph of workflow names. It does this by finding jumps from workflows to other workflows, and connecting them. Then, the function uses the created CFG to topologically sort the workflows. We’ll see this in action in a bit. Moving on to … The compiler, for now, simply generates the C code for a given system. We write a typeclass for convenience: As mentioned before, and rules are converted to return and respectively, and rules are converted to s. rules become statements, and s become block labels followed by block statements. A is translated to a function that takes four parameters, , , and , and runs the workflows translated to blocks by executing . Finally, an is converted to a C file with the required includes, and a function that solves the problem by calling the function for all parts. Let’s throw in a function to put everything together. The function reads the input from the file provided as the command line argument, parses it and outputs the generated C code. Let’s run it now. We compile our compiler and run it to generate the C code for the example problem: This is the C code it generates: We see the function in action, sorting the blocks in the topological order of jumps between them, as opposed to the original input . Does this work? Only one way to know: Perfect! The solution matches the interpreter output. By studying the output C code, we spot some possibilities for optimizing the compiler output. Notice how the block returns same value ( ) regardless of which branch it takes: So, we should be able to replace it with: If we do this, the block becomes degenerate, and hence the jumps to the block can be inlined, turning the block from: which makes the statement in the block redundant as well. Hence, we can repeat the previous optimization and further reduce the generated code. Another possible optimization is to inline the blocks to which there are only single jumps from the rest of the blocks, for example the block. Let’s write these optimizations. goes over all workflows and repeatedly removes the statements from the end of the blocks that has same outcome as the statement previous to them. find the jumps to degenerate workflows and inlines them. It does this by first going over all workflows and creating a map of degenerate workflow names to the only rule in them, and then replacing the jumps to such workflows with the only rules. does two things: first, it finds blocks with only one jumper, and inlines their statements to the jump location. Then it finds blocks to which there are no jumps, and removes them entirely from the program. It uses the helper function that uses the control-flow graph of the system to find all workflows to which there are number of jumps, where is provided as an input to the function. Note the usage of the function here, which makes sure that we remove the blocks in topological order, accumulating as many statements as possible in the final program. With these functions in place, we write the function: We execute the three optimization functions repeatedly till a fixed point is reached for the resultant , that is, till there are no further possibilities of optimization. Finally, we change our function to apply the optimizations: Compiling the optimized compiler and running it as earlier, generates this C code for the function now: It works well 2 . We now have 1.7x fewer lines of code as compared to before 3 . This was another attempt to solve Advent of Code problems in somewhat unusual ways. This year we learned some basics of compilation. Swing by next year for more weird ways to solve simple problems. The full code for this post is available here . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! I love how I have to write XMAS horizontally and vertically a couple of time. ↩︎ I’m sure many more optimizations are possible yet. After all, this program is essentially a decision tree. ↩︎ For the actual problem input with 522 blocks, the optimizations reduce the LoC by 1.5x. ↩︎ If you liked this post, please leave a comment .

Programming

Tutorial

0 views