Posts in C (20 found)

Notes on the WASM Basic C ABI

The WebAssembly/tool-conventions repository contains "Conventions supporting interoperability between tools working with WebAssembly". Of special interest, in contains the Basic C ABI - an ABI for representing C programs in WASM. This ABI is followed by compilers like Clang with the wasm32 target. Rust is also switching to this ABI for extern "C" code. This post contains some notes on this ABI, with annotated code samples and diagrams to help visualize what the emitted WASM code is doing. Hereafter, "the ABI" refers to this Basic C ABI. In these notes, annotated WASM snippets often contain descriptions of the state of the WASM value stack at a given point in time. Unless otherwise specified, "TOS" refers to "Top Of value Stack", and the notation [ x  y ] means the stack has y on top, with x right under it (and possibly some other stuff that's not relevant to the discussion under x ); in this notation, the stack grows "to the right". The WASM value stack has no linear memory representation and cannot be addressed, so it's meaningless to discuss whether the stack grows towards lower or higher addresses. The value stack is simply an abstract stack, where values can be pushed onto or popped off its "top". Whenever addressing is required, the ABI specifies explicitly managing a separate stack in linear memory. This stack is very similar to how stacks are managed in hardware assembly languages (except that in the ABI this stack pointer is held in a global variable, and is not a special register), and it's called the "linear stack". By "scalar" I mean basic C types like int , double or char . For these, using the WASM value stack is sufficient, since WASM functions can accept an arbitrary number of scalar parameters. This C function: Will be compiled into something like: And can be called by pushing three values onto the stack and invoking call $add_three . The ABI specifies that all integral types 32-bit and smaller will be passed as i32 , with the smaller types appropriately sign or zero extended. For example, consider this C function: It's compiled to the almost same code as add_three : Except the last i32.extend8_s , which takes the lowest 8 bits of the value on TOS and sign-extends them to the full i32 (effectively ignoring all the higher bits). Similarly, when $add_three_chars is called, each of its parameters goes through i32.extend8_s . There are additional oddities that we won't get deep into, like passing __int128 values via two i64 parameters. C pointers are just scalars, but it's still educational to review how they are handled in the ABI. Pointers to any type are passed in i32 values; the compiler knows they are pointers, though, and emits the appropriate instructions. For example: Is compiled to: Recall that in WASM, there's no difference between an i32 representing an address in linear memory and an i32 representing just a number. i32.store expects [ addr  value ] on TOS, and does *addr = value . Note that the x parameter isn't needed any longer after the sum is computed, so it's reused later on to hold the return value. WASM parameters are treated just like other locals (as in C). According to the ABI, while scalars and single-element structs or unions are passed to a callee via WASM function parameters (as shown above), for larger aggregates the compiler utilizes linear memory. Specifically, each function gets a "frame" in a region of linear memory allocated for the linear stack. This region grows downwards from high to low addresses [1] , and the global $__stack_pointer points at the bottom of the frame: Consider this code: When do_work is compiled to WASM, prior to calling pair_calculate it copies pp into a location in linear memory, and passes the address of this location to pair_calculate . This location is on the linear stack, which is maintained using the $__stack_pointer global. Here's the compiled WASM for do_work (I also gave its local variable a meaningful name, for readability): Some notes about this code: Before pair_calculate is called, the linear stack looks like this: Following the ABI, the code emitted for pair_calculate takes Pair* (by reference, instead of by value as the original C code): Each function that needs linear stack space is responsible for adjusting the stack pointer and restoring it to its original place at the end. This naturally enables nested function calls; suppose we have some function a calling function b which, in turn, calls function c , and let's assume all of these need to allocate space on the linear stack. This is how the linear stack looks after c 's prologue: Since each function knows how much stack space it has allocated, it's able to properly restore $__stack_pointer to the bottom of its caller's frame before returning. What about returning values of aggregate types? According to the ABI, these are also handled indirectly; a pointer parameter is prepended to the parameter list of the function. The function writes its return value into this address. The following function: Is compiled to: Here's a function that calls it: And the corresponding WASM: Note that this function only uses 8 bytes of its stack frame, but allocates 16; this is because the ABI dictates 16-byte alignment for the stack pointer. There are some advanced topics mentioned in the ABI that these notes don't cover (at least for now), but I'll mention them here for completeness: This is similar to x86 . For the WASM C ABI, a good reason is provided for the direction: WASM load and store instructions have an unsigned constant called offset that can be used to add a positive offset to the address parameter without extra instructions. Since $__stack_pointer points to the lowest address in the frame, these offsets can be used to efficiently access any value on the stack. There are two instance of the pair pp in linear memory prior to the call to pair_calculate : the original one from the initialization statement (at offset 8), and a copy created for passing into pair_calculate (at offset 0). Theoretically, as pp is unused used after the call, the compiler could do better here and keep only a single copy. The stack pointer is decremented by 16, and restored at the end of the function. The first few instructions - where the stack pointer is adjusted - are usually called the prologue of the function. In the same vein, the last few instructions where the stack pointer is reset back to where it was at the entry are called the epilogue . "Red zone" - leaf functions have access to 128 bytes of red zone below the stack pointer. I found this difficult to observe in practice [2] . Since we don't issue system calls directly in WASM, it's tricky to conjure a realistic leaf function that requires the linear stack (instead of just using WASM locals). A separate frame pointer (global value) to be used for functions that require dynamic stack allocation (such as using C's VLAs ). A separate base pointer to be used for functions that require alignment > 16 bytes on the stack.

0 views
Max Bernstein 2 weeks ago

A catalog of side effects

Optimizing compilers like to keep track of each IR instruction’s effects . An instruction’s effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state). This post can be thought of as a continuation of What I talk about when I talk about IRs , specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have? Different compilers represent and track these effects differently. I’ve been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more. Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as “writes to a local variable”, “writes to a list”, or “reads from the stack” indicate what instructions can be re-ordered, duplicated, or removed entirely. For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR: The goal of effects is to communicate to the compiler if, for example, these two IR instructions can be re-ordered. The second instruction might write to a location that the first one reads. But it also might not! This is about knowing if and alias —if they are different names that refer to the same object. We can sometimes answer that question directly, but often it’s cheaper to compute an approximate answer: could they even alias? It’s possible that and have different types, meaning that (as long as you have strict aliasing) the and operations that implement these reads and writes by definition touch different locations. And if they look at disjoint locations, there need not be any explicit order enforced. Different compilers keep track of this information differently. The null effect analysis gives up and says “every instruction is maximally effectful” and therefore “we can’t re-order or delete any instructions”. That’s probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid. But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That’s where I am right now. So here’s a catalog of different compilers I have looked at recently. There are two main ways I have seen to represent effects: bitsets and heap range lists. We’ll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers. We’ll start with Cinder , a Python JIT, because that’s what I used to work on. Cinder tracks heap effects for its high-level IR (HIR) in instr_effects.h . Pretty much everything happens in the function, which is expected to know everything about what effects the given instruction might have. The data representation is a bitset representation of a lattice called an and that is defined in alias_class.h . Each bit in the bitset represents a distinct location in the heap: reads from and writes to each of these locations are guaranteed not to affect any of the other locations. Here is the X-macro that defines it: Note that each bit implicitly represents a set: does not refer to a specific list index, but the infinite set of all possible list indices. It’s any list index. Still, every list index is completely disjoint from, say, every entry in a global variable table. (And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected , is the container.) Like other bitset lattices, it’s possible to union the sets by or-ing the bits. It’s possible to query for overlap by and-ing the bits. If this sounds familiar, it’s because (as the repo notes) it’s a similar idea to Cinder’s type lattice representation . Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects): Union operations naturally hit a fixpoint at and intersection operations naturally hit a fixpoint at . All of this together lets the optimizer ask and answer questions such as: Let’s take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR: You can imagine that declares that it reads from the heap and declares that it writes to the heap. Because tuple and list pointers cannot be casted into one another and therefore cannot alias, these are disjoint heaps in our bitset. Therefore , therefore these memory operations can never interfere! They can (for example) be re-ordered arbitrarily. In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE. DCE involves first finding the set of instructions that need to be kept around because they are useful/important/have effects. So here is what the Cinder DCE looks like: There are some other checks in there but is right there at the core of it! Now that we have seen the bitset representation of effects and an implementation in Cinder, let’s take a look at a different representation and and an implementation in JavaScriptCore. I keep coming back to How I implement SSA form by Fil Pizlo , one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of “abstract heaps” felt very… well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click. It didn’t make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations—because, you know, JSC is six compilers in a trenchcoat [ citation needed ] . DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We’ll look at DOMJIT mostly because it’s a smaller example and also illustrates something else that’s interesting: builtins. We’ll come back to builtins in a minute. Let’s take a lookat how DOMJIT structures its abstract heaps : a YAML file. It’s a hierarchy. is a subheap of is a subheap of… and so on. A write to any is a write to is a write to … Sibling heaps are unrelated: and , for example, are disjoint. To get a feel for this, I wired up a simplified version of ZJIT’s bitset generator (for types! ) to read a YAML document and generate a bitset. It generated the following Rust code: It’s not a fancy X-macro, but it’s a short and flexible Ruby script. Then I took the DOMJIT abstract heap generator —also funnily enough a short Ruby script—modified the output format slightly, and had it generate its int pairs: It already comes with a little diagram, which is super helpful for readability. Any empty range(s) represent empty heap effects: if the start and end are the same number, there are no effects. There is no one value, but any empty range could be normalized to . Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me. What about checking overlap? Here is the implementation in JSC : (See also How to check for overlapping intervals and Range overlap in two compares for more fun.) While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap. With int ranges, it’s a little more complicated. An imprecise union of and can take the maximal range that covers both and . To get a more precise union, you have to keep track of both. In the worst case, if you want efficient arbitrary queries, you need to store your int ranges in an interval tree. So what gives? I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it’s more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn’t holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post: The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. […] you can have instructions report that they read/write multiple heaps […] you can have a utility function that produces such lists on demand. It’s important to note that this doesn’t actually involve any allocation of lists. JSC does this very clever thing where they have “functors” that they pass in as arguments that compress/summarize what they want to out of an instruction’s effects. Let’s take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat. Note that in the example below is a thin wrapper over the DFG compiler’s own equivalent: is the function that calls these functors ( or in this case) for each effect that the given IR instruction declares. I’ve pulled some relevant snippets of , which is quite long, that I think are interesting. First, some instructions (constants, here) have no effects. There’s some utility in the call but I didn’t understand fully. Then there are some instructions that conditionally have effects depending on the use types of their operands. 1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code. Some run-time IR guards that might cause side exits are annotated as such—they write to the heap. Local variable instructions read specific heaps indexed by what looks like the local index but I’m not sure. This means accessing two different locals won’t alias! Instructions that allocate can’t be re-ordered, it looks like; they both read and write the . This probably limits the amount of allocation sinking that can be done. Then there’s , which is the builtins stuff I was talking about. We’ll come back to that after the code block. (Remember that these operations are very similar to DOMJIT’s with a couple more details—and in some cases even contain DOMJIT s!) This node is the way for the DOM APIs in the browser—a significant chunk of the builtins, which are written in C++—to communicate what they do to the optimizing compiler. Without any annotations, the JIT has to assume that a call into C++ could do anything to the JIT state. Bummer! But because, for example, annotates what memory it reads from and what it doesn’t write to, the JIT can optimize around it better—or even remove the access completely. It means the JIT can reason about calls to known builtins the same way that it reasons about normal JIT opcodes. (Incidentally it looks like it doesn’t even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.) Last, we’ll look at Simple, which has a slightly different take on all of this. Simple is Cliff Click’s pet Sea of Nodes (SoN) project to try and showcase the idea to the world—outside of a HotSpot C2 context. This one is a little harder for me to understand but it looks like each translation unit has a that doles out different classes of memory nodes for each alias class. Each IR node then takes data dependencies on whatever effect nodes it might uses. Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): “Our approach is a form of TBAA similar to the ‘FieldTypeDecl’ algorithm described in the paper.” The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10 . Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets. HHVM , a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h . HHVM has a couple places that use this information, such as a definition-sinking pass , alias analysis , DCE , store elimination , refcount opts , and more. If you are wondering why the HHVM representation looks similar to the Cinder representation, it’s because some former HHVM engineers such as Brett Simmers also worked on Cinder! (note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource ) Android’s ART Java runtime also uses a bitset for its effect representation. It’s a very compact class called in nodes.h . The side effects are used in loop-invariant code motion , global value numbering , write barrier elimination , scheduling , and more. CoreCLR mostly uses a bitset for its class. This one is interesting though because it also splits out effects specifically to include sets of local variables ( ). V8 is also about six completely different compilers in a trenchcoat. Turboshaft uses a struct in operations.h called which is two bitsets for reads/writes of effects. This is used in value numbering as well a bunch of other small optimization passes they call “reducers”. Maglev also has this thing called in their IR nodes that also looks like a bitset and is used in their various reducers. It has effect query methods on it such as and . Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself. Guile Scheme looks like it has a custom tagging scheme type thing. Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8. Remember to ask the right questions of your IR when doing analysis. Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory , Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful. This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes).  ↩ where might this instruction write? (because CPython is reference counted and incref implies ownership) where does this instruction borrow its input from? do these two instructions’ write destinations overlap? This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes).  ↩

0 views
Pat Shaughnessy 2 weeks ago

YARV’s Internal Stack and Your Ruby Stack

I've started working on a new edition of Ruby Under a Microscope that covers Ruby 3.x. I'm working on this in my spare time, so it will take a while. Leave a comment or drop me a line and I'll email you when it's finished. The content of Chapter 3, about the YARV virtual machine, hasn't changed much since 2014. However, I did update all of the diagrams to account for some new values YARV now saves inside of each stack frame. And some of the common YARV instructions were renamed as well. I also moved some content that was previously part of Chapter 4 here into Chapter 3. Right now I'm rewriting Chapter 4 from scratch, describing Ruby's new JIT compilers. As we’ll see in a moment, YARV uses a stack internally to track intermediate values, arguments, and return values. YARV is a stack-oriented virtual machine. In addition to its own internal stack, YARV keeps track of your Ruby program’s call stack , recording which methods call which other methods, functions, blocks, lambdas, and so on. In fact, YARV is not just a stack machine—it’s a double-stack machine! It has to track the arguments and return values not only for its own internal instructions but also for your Ruby program. Figure 3-1 shows YARV’s basic registers and internal stack. YARV’s internal stack is on the left. The SP label is the stack pointer, or the location of the top of the stack. On the right are the instructions that YARV is executing. PC is the program counter, or the location of the current instruction. You can see the YARV instructions that Ruby compiled from the puts 2+2 example on the right side of Figure 3-1. YARV stores both the SP and PC registers in a C structure called rb_control_frame_t , along with the current value of Ruby’s self variable and some other values not shown here. At the same time, YARV maintains another stack of these rb_control_frame_t structures, as shown in Figure 3-2. This second stack of rb_control_frame_t structures represents the path that YARV has taken through your Ruby program, and YARV’s current location. In other words, this is your Ruby call stack—what you would see if you ran puts caller . The CFP pointer indicates the current frame pointer. Each stack frame in your Ruby program stack contains, in turn, a different value for the self, PC, and SP registers, as shown in Figure 3-1. Ruby also keeps track of type of code running at each level in your Ruby call stack, indicated by the “[BLOCK]”, “[METHOD]” notation in Figure 3-2. In order to help you understand this a bit better, here are a couple of examples. I’ll begin with the simple 2+2 example from Chapters 1 and 2, shown again in Listing 3-1. This one-line Ruby script doesn’t have a Ruby call stack, so I’ll focus on the internal YARV stack for now. Figure 3-3 shows how YARV will execute this script, beginning with the first instruction, putself . As you can see in Figure 3-3, YARV starts the program counter (PC) at the first instruction, and initially the stack is empty. Now YARV executes the putself instruction, and pushes the current value of self onto the stack, as shown in Figure 3-4. Because this simple script contains no Ruby objects or classes, the self pointer is set to the default top self object. This is an instance of the Object class that Ruby automatically creates when YARV starts. It serves as the receiver for method calls and the container for instance variables in the top-level scope. The top self object contains a single, predefined to_s method, which returns the string “main.” You can call this method by running the following command in the console: YARV will use this self value on the stack when it executes the opt_send_without_block instruction: self is the receiver of the puts method because I didn’t specify a receiver for this method call. Next, YARV executes putobject 2 . It pushes the numeric value 2 onto the stack and increments the PC again, as shown in Figure 3-5. This is the first step of the receiver (arguments) operation pattern described in “How Ruby Compiles a Simple Script” on page 34. First, Ruby pushes the receiver onto the internal YARV stack. In this example, the Fixnum object 2 is the receiver of the message/method + , which takes a single argument, also a 2. Next, Ruby pushes the argument 2, as shown in Figure 3-6. Finally, Ruby executes the + operation. In this case, opt_plus is an optimized instruction that will add two values: the receiver and the argument, as shown in Figure 3-7. As you can see in Figure 3-7, the opt_plus instruction leaves the result, 4, at the top of the stack. Now Ruby is perfectly positioned to execute the puts function call: The receiver self is first on the stack, and the single argument, 4, is at the top of the stack. (I’ll describe how method lookup works in Chapter 6.) Next, Figure 3-8 shows what happens when Ruby executes the puts method call. As you can see, the opt_send_without_block instruction leaves the return value, nil , at the top of the stack. Finally, Ruby executes the last instruction, leave , which finishes the execution of our simple, one-line Ruby program. Of course, when Ruby executes the puts call, the C code implementing the puts function will actually display the value 4 in the console output.

0 views
Simon Willison 3 weeks ago

Code research projects with async coding agents like Claude Code and Codex

I've been experimenting with a pattern for LLM usage recently that's working out really well: asynchronous code research tasks . Pick a research question, spin up an asynchronous coding agent and let it go and run some experiments and report back when it's done. Software development benefits enormously from something I call code research . The great thing about questions about code is that they can often be definitively answered by writing and executing code. I often see questions on forums which hint at a lack of understanding of this skill. "Could Redis work for powering the notifications feed for my app?" is a great example. The answer is always "it depends", but a better answer is that a good programmer already has everything they need to answer that question for themselves. Build a proof-of-concept, simulate the patterns you expect to see in production, then run experiments to see if it's going to work. I've been a keen practitioner of code research for a long time. Many of my most interesting projects started out as a few dozen lines of experimental code to prove to myself that something was possible. It turns out coding agents like Claude Code and Codex are a fantastic fit for this kind of work as well. Give them the right goal and a useful environment and they'll churn through a basic research project without any further supervision. LLMs hallucinate and make mistakes. This is far less important for code research tasks because the code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated to both themselves and to you that something really does work. They can't prove something is impossible - just because the coding agent couldn't find a way to do something doesn't mean it can't be done - but they can often demonstrate that something is possible in just a few minutes of crunching. I've used interactive coding agents like Claude Code and Codex CLI for a bunch of these, but today I'm increasingly turning to their asynchronous coding agent family members instead. An asynchronous coding agent is a coding agent that operates on a fire-and-forget basis. You pose it a task, it churns away on a server somewhere and when it's done it files a pull request against your chosen GitHub repository. OpenAI's Codex Cloud , Anthropic's Claude Code for web , Google Gemini's Jules , and GitHub's Copilot coding agent are four prominent examples of this pattern. These are fantastic tools for code research projects. Come up with a clear goal, turn it into a few paragraphs of prompt, set them loose and check back ten minutes later to see what they've come up with. I'm firing off 2-3 code research projects a day right now. My own time commitment is minimal and they frequently come back with useful or interesting results. You can run a code research task against an existing GitHub repository, but I find it's much more liberating to have a separate, dedicated repository for your coding agents to run their projects in. This frees you from being limited to research against just code you've already written, and also means you can be much less cautious about what you let the agents do. I have two repositories that I use for this - one public, one private. I use the public one for research tasks that have no need to be private, and the private one for anything that I'm not yet ready to share with the world. The biggest benefit of a dedicated repository is that you don't need to be cautious about what the agents operating in that repository can do. Both Codex Cloud and Claude Code for web default to running agents in a locked-down environment, with strict restrictions on how they can access the network. This makes total sense if they are running against sensitive repositories - a prompt injection attack of the lethal trifecta variety could easily be used to steal sensitive code or environment variables. If you're running in a fresh, non-sensitive repository you don't need to worry about this at all! I've configured my research repositories for full network access, which means my coding agents can install any dependencies they need, fetch data from the web and generally do anything I'd be able to do on my own computer. Let's dive into some examples. My public research repository is at simonw/research on GitHub. It currently contains 13 folders, each of which is a separate research project. I only created it two weeks ago so I'm already averaging nearly one a day! It also includes a GitHub Workflow which uses GitHub Models to automatically update the README file with a summary of every new project, using Cog , LLM , llm-github-models and this snippet of Python . Here are a some example research projects from the repo. node-pyodide shows an example of a Node.js script that runs the Pyodide WebAssembly distribution of Python inside it - yet another of my ongoing attempts to find a great way of running Python in a WebAssembly sandbox on a server. python-markdown-comparison ( transcript ) provides a detailed performance benchmark of seven different Python Markdown libraries. I fired this one off because I stumbled across cmarkgfm , a Python binding around GitHub's Markdown implementation in C, and wanted to see how it compared to the other options. This one produced some charts! came out on top by a significant margin: Here's the entire prompt I used for that project: Create a performance benchmark and feature comparison report on PyPI cmarkgfm compared to other popular Python markdown libraries - check all of them out from github and read the source to get an idea for features, then design and run a benchmark including generating some charts, then create a report in a new python-markdown-comparison folder (do not create a _summary.md file or edit anywhere outside of that folder). Make sure the performance chart images are directly displayed in the README.md in the folder. Note that I didn't specify any Markdown libraries other than - Claude Code ran a search and found the other six by itself. cmarkgfm-in-pyodide is a lot more fun. A neat thing about having all of my research projects in the same repository is that new projects can build on previous ones. Here I decided to see how hard it would be to get - which has a C extension - working inside Pyodide inside Node.js. Claude successfully compiled a 88.4KB file with the necessary C extension and proved it could be loaded into Pyodide in WebAssembly inside of Node.js. I ran this one using Claude Code on my laptop after an initial attempt failed. The starting prompt was: Figure out how to get the cmarkgfm markdown lover [typo in prompt, this should have been "library" but it figured it out anyway] for Python working in pyodide. This will be hard because it uses C so you will need to compile it to pyodide compatible webassembly somehow. Write a report on your results plus code to a new cmarkgfm-in-pyodide directory. Test it using pytest to exercise a node.js test script that calls pyodide as seen in the existing node.js and pyodide directory There is an existing branch that was an initial attempt at this research, but which failed because it did not have Internet access. You do have Internet access. Use that existing branch to accelerate your work, but do not commit any code unless you are certain that you have successfully executed tests that prove that the pyodide module you created works correctly. This one gave up half way through, complaining that emscripten would take too long. I told it: Complete this project, actually run emscripten, I do not care how long it takes, update the report if it works It churned away for a bit longer and complained that the existing Python library used CFFI which isn't available in Pyodide. I asked it: Can you figure out how to rewrite cmarkgfm to not use FFI and to use a pyodide-friendly way of integrating that C code instead? ... and it did. You can see the full transcript here . blog-tags-scikit-learn . Taking a short break from WebAssembly, I thought it would be fun to put scikit-learn through its paces on a text classification task against my blog: Work in a new folder called blog-tags-scikit-learn Download - a SQLite database. Take a look at the blog_entry table and the associated tags - a lot of the earlier entries do not have tags associated with them, where the later entries do. Design, implement and execute models to suggests tags for those earlier entries based on textual analysis against later ones Use Python scikit learn and try several different strategies Produce JSON of the results for each one, plus scripts for running them and a detailed markdown description Also include an HTML page with a nice visualization of the results that works by loading those JSON files. This resulted in seven files, four results files and a detailed report . (It ignored the bit about an HTML page with a nice visualization for some reason.) Not bad for a few moments of idle curiosity typed into my phone! That's just three of the thirteen projects in the repository so far. The commit history for each one usually links to the prompt and sometimes the transcript if you want to see how they unfolded. More recently I added a short file to the repo with a few extra tips for my research agents. You can read that here . My preferred definition of AI slop is AI-generated content that is published without human review. I've not been reviewing these reports in great detail myself, and I wouldn't usually publish them online without some serious editing and verification. I want to share the pattern I'm using though, so I decided to keep them quarantined in this one public repository. A tiny feature request for GitHub: I'd love to be able to mark a repository as "exclude from search indexes" such that it gets labelled with tags. I still like to keep AI-generated content out of search, to avoid contributing more to the dead internet . It's pretty easy to get started trying out this coding agent research pattern. Create a free GitHub repository (public or private) and let some agents loose on it and see what happens. You can run agents locally but I find the asynchronous agents to be more convenient - especially as I can run them (or trigger them from my phone) without any fear of them damaging my own machine or leaking any of my private data. Claude Code for web offers a free $250 of credits for their $20/month users for a limited time (until November 18, 2025). Gemini Jules has a free tier . There are plenty of other coding agents you can try out as well. Let me know if your research agents come back with anything interesting! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Code research Coding agents Asynchronous coding agents Give them a dedicated GitHub repository Let them rip with unlimited network access My simonw/research collection This is total slop, of course Try it yourself

0 views
Abhinav Sarkar 3 weeks ago

A Short Survey of Compiler Targets

As an amateur compiler developer, one of the decisions I struggle with is choosing the right compiler target. Unlike the 80’s when people had to target various machine architectures directly, now there are many mature options available. This is a short and very incomplete survey of some of the popular and interesting options. A compiler can always directly output machine code or assembly targeted for one or more architectures. A well-known example is the Tiny C Compiler . It’s known for its speed and small size, and it can compile and run C code on the fly. Another such example is Turbo Pascal . You could do this with your compiler too, but you’ll have to figure out the intricacies of the Instruction set of each architecture (ISA) you want to target, as well as, concepts like Register allocation . Most modern compilers actually don’t emit machine code or assembly directly. They lower the source code down to a language-agnostic Intermediate representation (IR) first, and then generate machine code for major architectures (x86-64, ARM64, etc.) from it. The most prominent tool in this space is LLVM . It’s a large, open-source compiler-as-a-library. Compilers for many languages such as Rust , Swift , C/C++ (via Clang ), and Julia use LLVM as an IR to emit machine code. An alternative is the GNU C compiler (GCC), via its GIMPLE IR, though no compilers seem to use it directly. GCC can be used as a library to compile code, much like LLVM, via libgccjit . It is used in Emacs to Just-in-time (JIT) compile Elisp . Cranelift is another new option in this space, though it supports only few ISAs. For those who find LLVM or GCC too large or slow to compile, minimalist alternatives exist. QBE is a small backend focused on simplicity, targeting “70% of the performance in 10% of the code”. It’s used by the language Hare that prioritizes fast compile times. Another option is libFIRM , which uses a graph-based SSA representation instead of a linear IR. Sometimes you are okay with letting other compilers/runtimes take care of the heavy lifting. You can transpile your code to a another established high-level language and leverage that language’s existing compiler/runtime and toolchain. A common target in such cases is C. Since C compilers exist for nearly all platforms, generating C code makes your language highly portable. This is the strategy used by Chicken Scheme and Vala . Or you could compile to C++ instead, like Jank , if that’s your thing. There is also C– , a subset of C targeted by GHC and OCaml . Another ubiquitous target is JavaScript (JS), which is one of the two options (other being WebAssembly ) for running code natively in a web browser or one of the JS runtimes ( Node , Deno , Bun ). Multiple languages such as TypeScript , PureScript , Reason , ClojureScript , Dart and Elm transpile to JS. Nim interestingly, can transpile to C, C++ or JS. Another target similar to JS is Lua , a lightweight and embeddable scripting language, which languages such as MoonScript and Fennel transpile to. A more niche approach is to target a Lisp dialect. Compiling to Chez Scheme , for example, allows you to leverage its macro system, runtime, and compiler. The Idris 2 and Racket use Chez Scheme as their primary backend targets. This is a common choice for application languages. You compile to a portable bytecode for a Virtual machine (VM). VMs generally come with features like Garbage collection , JIT compilation , and security sandboxing. The Java Virtual Machine (JVM) is probably the most popular one. It’s the target for many languages including Java , Kotlin , Scala , Groovy , and Clojure . Its main competitor is the Common Language Runtime , originally developed by Microsoft , which is targeted by languages such as C# , F# , and Visual Basic.NET . Another notable VM is the BEAM , originally built for Erlang . The BEAM VM isn’t built for raw computation speed but for high concurrency, fault tolerance, and reliability. Recently, new languages such as Elixir and Gleam have been created to target it. Finally, this category also includes MoarVM —the spiritual successor to the Parrot VM —built for the Raku (formerly Perl 6) language. WebAssembly (Wasm) is a relatively new target. It’s a portable binary instruction format focused on security and efficiency. Wasm is supported by all major browsers, but not limited to them. The WebAssembly System Interface (WASI) standard provides APIs for running Wasm in non-browser and non-JS environments. Wasm is now targeted by many languages such as Rust , C/C++ , Go , Kotlin , Scala , Zig , and Haskell . Meta-tracing frameworks are a more complex category. These are not the targets for your compiler backend, instead, you use them to build a custom JIT compiler for your language by specifying an interpreter for it. The most well-known example is PyPy , an implementation of Python , created using the RPython framework. Another such framework is GraalVM/Truffle , a polyglot VM and meta-tracing framework from Oracle . Its main feature is zero-cost interoperability: code from GraalJS , TruffleRuby , and GraalPy can all run on the same VM, and can call each other directly. Move past the mainstream, and you’ll discover a world of unconventional and esoteric compiler targets. Developers pick them for academic curiosity, artistic expression, or to test the boundaries of viable compilation targets. Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ? I’m going to write a compiler from C++ to JSFuck. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . Machine Code / Assembly Intermediate Representations Other High-level Languages Virtual Machines / Bytecode WebAssembly Meta-tracing Frameworks Unconventional Targets Brainfuck: An esoteric language with only eight commands, Brainfuck is Turing-complete and has been a target for compilers as a challenge. People have written compilers for C , Haskell and Lambda calculus . Lambda calculus: Lambda calculus is a minimal programming languages that expresses computation solely as functions and their applications. It is often used as the target of educational compilers because of its simplicity, and its link to the fundamental nature of computation. Hell , a subset of Haskell, compiles to Simply typed lambda calculus . SKI combinators: The SKI combinator calculus is even more minimal than lambda calculus. All programs in SKI calculus can be composed of only three combinators: S, K and I. MicroHs compiles a subset of Haskell to SKI calculus. JSFuck: Did you know that you can write all possible JavaScript programs using only six characters ? Well, now you know . Postscript: Postscript is also a Turing-complete programming language. Your next compiler could target it! Regular Expressions ? Lego ? Cellular automata ?

0 views
Pat Shaughnessy 1 months ago

Parsing: How Ruby Understands Your Code

I've started working on a new edition of Ruby Under a Microscope that covers Ruby 3.x. I'm working on this in my spare time, so it will take a while. Leave a comment or drop me a line and I'll email you when it's finished. Update : I’ve made a lot of progress so far this year. I had time to completely rewrite Chapters 1 and 2, which cover Ruby’s new Prism parser and the Ruby compiler which now handles the Prism AST. I also updated Chapter 3 about YARV and right now I’m working on rewriting Chapter 4 which will cover YJIT and possibly other Ruby JIT compilers. Here’s an excerpt from the new version of Chapter 1. Many thanks to Kevin Newton, who reviewed the content about Prism and had a number of corrections and great suggestions. Also thanks to Douglas Eichelberger who had some great feedback as well. I’ll post more excerpts from Chapters 2, 3 and 4 in the coming weeks. Thanks for everyone’s interest in Ruby Under a Microscope! Once Ruby converts your code into a series of tokens, what does it do next? How does it actually understand and run your program? Does Ruby simply step through the tokens and execute each one in order? No. Your code still has a long way to go before Ruby can run it. The next step on its journey through Ruby is called parsing , where words or tokens are grouped into sentences or phrases that make sense to Ruby. When parsing, Ruby takes into account the order of operations, methods, blocks, and other larger code structures. Ruby’s parsing engine defines Ruby’s syntax rules. Reading in tokens, Ruby matches the token types and the order the tokens appear with a large series of patterns. These patterns, indeed, are the heart and soul of the Ruby language. How we write a function call, how we define a method using the def keyword, how we write classes and modules - the patterns Ruby looks for define the language. Ruby’s parse algorithm has three high level steps: Let’s break down these ideas further, by following Ruby through the “Hello World” program. Afterwards, we’ll look at a second, slightly more complicated example. As we saw in the previous section, Ruby first converts the text in this code file into tokens. For Hello World, Ruby’s tokenizer produces these five tokens: To make the following diagrams simpler, let’s redraw these tokens in a more compact format: Using a single gray line of text, Figure 1-15 shows the five tokens from Figure 1-14 in a more compact format. First, PM_TOKEN_IDENTIFIER represents the word “puts” from the beginning of the program. Next, three tokens make up the string literal value: PM_TOKEN_STRING_BEGIN for the first double quote, followed by PM_TOKEN_STRING_VALUE for the words Hello and World, and PM_TOKEN_STRING_END represents the second quote. Finally, the program ends with PM_TOKEN_EOF to mark the end of the source code file. Now let’s follow Ruby as it processes the Hello World example using the three steps: identify, recurse and compare. First, identify . How does Ruby understand what the first token, PM_TOKEN_IDENTIFIER , means? Figure 1-16 represents the state of Ruby’s parser when it starts to parse this code. At this moment, Ruby is just getting started by inspecting the puts identifier. One of the patterns Ruby looks for matches the identifier; but what does this identifier mean? Ruby knows puts could be a local variable, or it could be the name of a function to call. Since there are no local variables defined in this program, Ruby determines that the puts identifier represents a function the program is calling. (It’s also possible that the program is about to create a new local variable like this: puts = "Hello World" . If that were the case, Ruby would see the assignment operator next and parse things differently.) What happens next? After matching the token to the function call pattern, Ruby records this match in a data structure called an abstract syntax tree (AST). Ruby and most other programming languages use ASTs to record the results of parsing tokens like this. As we’ll see, the AST’s tree structure is well suited for holding the nested, recursive structure of computer programs. Figure 1-17 shows the first node Ruby saves in the AST tree. In a moment, Ruby will begin to add more nodes to the AST. Before proceeding to the next token, let’s imagine the syntax pattern for a function call: Although in Ruby the parentheses are optional, so this pattern also applies: NOTE The original version of the Ruby parser used patterns or grammar rules like this directly with a tool called a parser generator. However, starting with Ruby 3.3, Ruby uses a new parser called Prism, which detects these patterns directly using hand written C code. After parsing the first token, Ruby inspects the second token. According to the function call pattern, Ruby knows the second token might represent the first argument to the function call. But, how many arguments are there? And what is each argument? The program in Listing 1-11 is very simple, but it could have instead printed a very complex expression - the arguments to puts could have run on for many lines and used hundreds of tokens. Second, recurse . To parse each of the arguments to puts, Ruby has to call itself. Figure 1-18 shows two levels of the Ruby parser’s call stack; the top line shows Ruby parsing the puts identifier token, and matching the function call pattern. The second line shows how Ruby called itself to parse the second token, PM_TOKEN_STRING_BEGIN , the leading quote of the string literal. Think of these lines as the backtrace of the Ruby parser. Figure 1-18 also shows a value 14 on the right side. While calling itself recursively, Ruby passes in a numeric value called the binding power . We’ll return to this later. Now that Ruby has called itself, Ruby starts the 3-step process all over again: identify, recurse and compare. This time, Ruby has to identify what the PM_TOKEN_STRING_BEGIN token means. This token always indicates the start of a string value. In this example PM_TOKEN_STRING_BEGIN represents the double quote that appears after puts . But the same token might represent a single quote or one of the other ways you can write a string in Ruby, for example using %Q or %q . Ruby’s new parser, Prism, next parses the string contents directly by processing the following two tokens: In this example, Ruby’s parser is done after finding the PM_TOKEN_STRING_END token and can continue to the next step. More complex strings - strings that contain interpolated values using #{} for example - might have required Ruby to call itself yet again to process more nested expressions. But for the simple "Hello World" string Ruby is done. To record the string value, Ruby creates a new AST node called PM_STRING_NODE . Figure 1-20 shows two AST nodes Ruby has created so far: the call node created earlier, and now a new string node. Ruby’s parser is a recursive descent parser . This Computer Science term describes parsers that resemble the grammar or syntax rules of the programs they parse, and call themselves recursively in a top-down manner as they process nested structures. Many modern programming languages today use this general approach. Identify : First, Ruby identifies what the next token represents. Ruby does this by comparing the token’s type - and possibly the types of the following tokens - with a large series of patterns. If one pattern matches, Ruby understands what your code means. If not, Ruby emits a syntax error. Recurse : Secondly, Ruby calls itself. Each value in one of the syntax patterns can itself be a subexpression - a smaller program that Ruby needs to parse. To do this, Ruby calls itself recursively. Compare : Third, Ruby compares the current token with the next token to determine which has a higher precedence. This comparison leads Ruby down a specific path, processing the tokens in a certain order.

0 views
Blargh 1 months ago

The strange webserver hot potato — sending file descriptors

I’ve previously mentioned my io-uring webserver tarweb . I’ve now added another interesting aspect to it. As you may or may not be aware, on Linux it’s possible to send a file descriptor from one process to another over a unix domain socket. That’s actually pretty magic if you think about it. You can also send unix credentials and SELinux security contexts, but that’s a story for another day. I want to run some domains using my webserver “tarweb”. But not all. And I want to host them on a single IP address, on the normal HTTPS port 443. Simple, right? Just use nginx’s ? Ah, but I don’t want nginx to stay in the path. After SNI (read: “browser saying which domain it wants”) has been identified I want the TCP connection to go directly from the browser to the correct backend. I’m sure somewhere on the internet there’s already an SNI router that does this, but all the ones I found stay in line with the request path, adding a hop. A few reasons: Livecount has an open websocket to every open browser tab in the world reading a given page, so they add up. (no, it doesn’t log. It just keeps count) I built a proof of concept SNI router . It is a frontline server receiving TCP connections, on which it then snoops the SNI from the TLS ClientHello, and routes the connection according to its given rules. Anything it reads from the socket is sent along to the real backend along with the file descriptor. So the backend (in my use that’s tarweb) needs to have code cooperating to receive the new connection. It’s not the cleanest code, but it works. I got ChatGPT to write the boring “parse the TLS record / ClientHello” parts. Rust is a memory safe language, so “how bad could it be?”. :-) It seems to work for all the currently used TLS versions. As I said, it requires the backend to be ready to receive “hey, here’s a file descriptor, and here’s the first few hundred bytes you should treat as if you’ve read them from the client”. File descriptors don’t have an operation to “unread”. If they did then this would be easier. Then it would “just” be a matter of giving a backend webserver a file descriptor. For some use cases that could mean starting a new webserver process that reads and writes from stdin/stdout. Not super efficient to go back to the fork-exec-per-connection model from the previous century, but it would achieve the direct connection. But the details are academic. We do need to pass along the snooped bytes somehow, or the TLS handshake won’t succeed. Which means it does need cooperation from the backend. Because the SNI router never writes to the client, and therefore doesn’t perform a TLS handshake, it doesn’t need any private keys or certificates. The SNI router has no secrets, and sees no secrets. I also added a mode that proxies the TCP connection, if some SNI should be routed to a different server. But of course then it’s not possible to pass the file descriptor. So encrypted bytes will bounce on the SNI router for that kind of flow. But still the SNI router is not able to decrypt anything. A downside is of course that bouncing the connection around the world will slow it down, add latency, and waste resources. So pass the file descriptor where possible. So now my setup has the SNI router accept the connection, and then throw the very file descriptor over to tarweb, saying “you deal with this TCP connection”. Tarweb does the TLS handshake, and then throws the TLS session keys over to the kernel, saying “I can’t be bothered doing encryption, you do it”, and then actually handles the HTTP requests. Well actually, there’s another strange indirection. When tarweb receives a file descriptor, it uses io-uring “registered files” to turn it into a “fixed file handle”, and closes the original file descriptor. On the kernel side there’s still a file descriptor of course, but there’s nothing in : This improves performance a bit on the linux kernel side. The SNI router does not use io-uring. At least not yet. The SNI router’s job is much smaller (doesn’t even do a TLS handshake), much more brief (it almost immediately passes the file descriptor to tarweb), and much less concurrency (because of the connections being so short lived as far as it’s concerned), that it may not be worth it. In normal use the SNI router only needs these syscalls per connection: At the risk of going off on an unrelated tangent, HTTP/3 (QUIC-based) has an interesting way of telling a client to “go over there” . A built in load balancer inside the protocol, you could say, sparing the load balancer needing to proxy everything. This opens up opportunities to steer not just on SNI, and is much more flexible than DNS, all without needing the “proxy” to be inline. E.g. say a browser is in Sweden, and you have servers in Norway and Italy. And say you have measured, and find that it would be best if the browser connected to your Norway server. But due to peering agreements and other fun stuff, Italy will be preferred on any BGP anycasted address. You then have a few possible options, and I do mean they’re all possible: The two DNS-based ones also have the valid concern that screwing up DNS can have bad consequences . If you can leave DNS alone that’s better. Back to HTTP/3. If you’ve set up HTTP/3 it may be because you care about latency. It’s then easier to act on information you have about every single connection. On an individual connection basis you can tell the browser in Sweden that it should now talk to the servers in Norway. All without DNS or anycast. Which is nice, because running a webserver is hard enough. Also running a dynamic DNS service or anycast has even more opportunities to blow up fantastically. I should add that HTTP/3 doesn’t have the “running out of file descriptors” problem. Being based on UDP you can run your entire service with just a single file descriptor. Connections are identified by IDs, not 5-tuples. So why didn’t I just use HTTP/3? No support for that (yet). From some skimming ESNI should “just work”, with just a minor decryption operation in the SNI router. ECH seems harder. It should still be doable, but the SNI router will need to do the full handshake, or close to it. And after taking its routing decision it needs to transfer the encryption state to the backend, along with the file descriptor. This is not impossible, of course. It’s similar to how tarweb passes the TLS session keys to the kernel. But it likely does mean that the SNI router needs to have access to both the TLS session keys and maybe even the domain TLS private keys. But that’s a problem for another day. Having all bytes bounce on the SNI router triples the number of total file descriptors for the connection. (one on the backend, then one each on the router for upstream and downstream). There are limits per process and system wide, and the more you have the more you need to juggle them in code. It also wastes CPU and RAM. I want the backend to know the real client IP address, via or similar, on the socket itself. I don’t want restarting nginx to cut existing connections to backends. I’d like to use TLS keys that the nginx user doesn’t have access to. I used for livecount , and last time I got blog posts on HackerNews nginx ran out of file descriptors, and started serving 500 for it serving just plain old static files on disk. For now I’ve moved livecount to a different port, but in the long run I want it back on port 443, and yet isolated from nginx so that the latter keeps working even if livecount is overloaded. for the new connection, a few hundred bytes of ClientHello, of same size to pass it on, to forget the file descriptor. Have the browser connect to , with Norway-specific IP addresses. Not great. People will start bookmarking these URLs, and what happens when you move your Norway servers to Denmark? now goes to servers in Denmark? Use DNS based load balancing, giving Swedish browsers the Norway unicast IPs. Yes… but this is WAY more work than you probably think. And WAY less reliable at giving the best experience for the long tail. And sometimes your most important customer is in that long tail. Try to traffic engineer the whole Internet with BGP announcement tweaks. Good luck with that, for the medium to long tail. Install servers in Sweden, and any other place you may have users. Then you can anycast your addresses from there, and have full control of how you proxy (or packet by packet traffic engineer over tunnels) them. Expensive if you have many locations you need to do this in. Some traffic will still go to the wrong anycast entry point, but pretty feasible though expensive. HTTP/3 is complex. You can build a weird io-uring kTLS based webserver on a weekend, and control everything (except TLS handshakes). Implementing HTTP/3 from scratch, and controlling everything, is a different beast. HTTP/1 needs to still work. Not all clients support HTTP/3, and HTTP/1 or 2 is even used to bootstrap HTTP/3 via its header. Preferred address in HTTP/3 is just a suggestion. Browsers don’t have to actually move. tarweb was first written in C++ . livecount keeps long lived connection . tarweb rewritten in Rust, and using io-uring . You can redirect with Javascript , but this still has the problem. I passed file descriptors between processes in injcode , but it was only ever a proof of concept that only worked on 32bit x86, and the code doesn’t look like it actually does it? Anyway I can’t expect to remember code from 17 years ago.

0 views
Simon Willison 1 months ago

Living dangerously with Claude

I gave a talk last night at Claude Code Anonymous in San Francisco, the unofficial meetup for coding agent enthusiasts. I decided to talk about a dichotomy I've been struggling with recently. On the one hand I'm getting enormous value from running coding agents with as few restrictions as possible. On the other hand I'm deeply concerned by the risks that accompany that freedom. Below is a copy of my slides, plus additional notes and links as an annotated presentation . I'm going to be talking about two things this evening... Why you should always use . (This got a cheer from the room full of Claude Code enthusiasts.) And why you should never use . (This did not get a cheer.) is a bit of a mouthful, so I'm going to use its better name, "YOLO mode", for the rest of this presentation. Claude Code running in this mode genuinely feels like a completely different product from regular, default Claude Code. The default mode requires you to pay constant attention to it, tracking everything it does and actively approving changes and actions every few steps. In YOLO mode you can leave Claude alone to solve all manner of hairy problems while you go and do something else entirely. I have a suspicion that many people who don't appreciate the value of coding agents have never experienced YOLO mode in all of its glory. I'll show you three projects I completed with YOLO mode in just the past 48 hours. I wrote about this one at length in Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code . I wanted to try the newly released DeepSeek-OCR model on an NVIDIA Spark, but doing so requires figuring out how to run a model using PyTorch and CUDA, which is never easy and is a whole lot harder on an ARM64 device. I SSHd into the Spark, started a fresh Docker container and told Claude Code to figure it out. It took 40 minutes and three additional prompts but it solved the problem , and I got to have breakfast and tinker with some other projects while it was working. This project started out in Claude Code for the web . I'm eternally interested in options for running server-side Python code inside a WebAssembly sandbox, for all kinds of reasons. I decided to see if the Claude iPhone app could launch a task to figure it out. I wanted to see how hard it was to do that using Pyodide running directly in Node.js. Claude Code got it working and built and tested this demo script showing how to do it. I started a new simonw/research repository to store the results of these experiments, each one in a separate folder. It's up to 5 completed research projects already and I created it less than 2 days ago. Here's my favorite, a project from just this morning. I decided I wanted to try out SLOCCount , a 2001-era Perl tool for counting lines of code and estimating the cost to develop them using 2001 USA developer salaries. .. but I didn't want to run Perl, so I decided to have Claude Code (for web, and later on my laptop) try and figure out how to run Perl scripts in WebAssembly. TLDR: it got there in the end ! It turned out some of the supporting scripts in SLOCCount were written in C, so it had to compile those to WebAssembly as well. And now tools.simonwillison.net/sloccount is a browser-based app which runs 25-year-old Perl+C in WebAssembly against pasted code, GitHub repository references and even zip files full of code. The wild thing is that all three of these projects weren't even a priority for me - they were side quests, representing pure curiosity that I could outsource to Claude Code and solve in the background while I was occupied with something else. I got a lot of useful work done in parallel to these three flights of fancy. But there's a reason has that scary name. It's dangerous to use Claude Code (and other coding agents) in this way! The reason for this is prompt injection , a term I coined three years ago to describe a class of attacks against LLMs that take advantage of the way untrusted content is concatenated together with trusted instructions. (It's named after SQL injection which shares a similar shape.) This remains an incredibly common vulnerability. Here's a great example of a prompt injection attack against a coding agent, described by Johann Rehberger as part of his Month of AI Bugs , sharing a new prompt injection report every day for the month of August. If a coding agent - in this case OpenHands - reads this file it can be tricked into grepping the available environment variables for (matching GitHub Personal Access Tokens) and sending that to the attacker's external server for "help debugging these variables". I coined another term to try and describe a common subset of prompt injection attacks: the lethal trifecta . Any time an LLM system combines access to private data with exposure to untrusted content and the ability to externally communicate , there's an opportunity for attackers to trick the system into leaking that private data back to them. These attacks are incredibly common . If you're running YOLO coding agents with access to private source code or secrets (like API keys in environment variables) you need to be concerned about the potential of these attacks. This is the fundamental rule of prompt injection: anyone who can get their tokens into your context should be considered to have full control over what your agent does next, including the tools that it calls. Some people will try to convince you that prompt injection attacks can be solved using more AI to detect the attacks. This does not work 100% reliably, which means it's not a useful security defense at all . The only solution that's credible is to run coding agents in a sandbox . The best sandboxes are the ones that run on someone else's computer! That way the worst that can happen is someone else's computer getting owned. You still need to worry about your source code getting leaked. Most of my stuff is open source anyway, and a lot of the code I have agents working on is research code with no proprietary secrets. If your code really is sensitive you need to consider network restrictions more carefully, as discussed in a few slides. There are lots of great sandboxes that run on other people's computers. OpenAI Codex Cloud, Claude Code for the web, Gemini Jules are all excellent solutions for this. I also really like the code interpreter features baked into the ChatGPT and Claude consumer apps. There are two problems to consider with sandboxing. The first is easy: you need to control what files can be read and written on the filesystem. The second is much harder: controlling the network connections that can be made by code running inside the agent. The reason network access is so important is that it represents the data exfiltration leg of the lethal trifecta. If you can prevent external communication back to an attacker they can't steal your private information, even if they manage to sneak in their own malicious instructions. Claude Code CLI grew a new sandboxing feature just yesterday, and Anthropic released an a new open source library showing how it works. The key to the implementation - at least on macOS - is Apple's little known but powerful command. This provides a way to run any command in a sandbox configured by a policy document. Those policies can control which files are visible but can also allow-list network connections. Anthropic run an HTTP proxy and allow the Claude Code environment to talk to that, then use the proxy to control which domains it can communicate with. (I used Claude itself to synthesize this example from Anthropic's codebase.) ... the bad news is that has been marked as deprecated in Apple's documentation since at least 2017! It's used by Codex CLI too, and is still the most convenient way to run a sandbox on a Mac. I'm hoping Apple will reconsider. So go forth and live dangerously! (But do it in a sandbox.) You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Abhinav Sarkar 1 months ago

A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this final post, we write the virtual machine that executes our bytecode, and benchmark it. This post was originally published on abhinavsarkar.net . This post is part of the series: A Fast Bytecode VM for Arithmetic . Bytecode Virtual Machines (VMs) are known to be faster than AST -walking interpreters. That’s why many real-world programming languages these days are implemented with bytecode VM s, for example, Java , Python , PHP , and Raku . The reason is partially, the flat and compact nature of bytecode itself. But VM s also have a few other tricks up their sleeves that make them highly performant. In this post, we write a VM for our arithmetic expression language, and explore some of these performance tricks. But first, we need to finish a pending task. We wrote some unit tests for our compiler in the last post, but unit tests cover only the cases we can think of. A compiler has to deal with any input, and with just unit tests we cannot be sure of its correctness. To test our compiler and other components for correctness, we use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input, and then to automatically generate a large number of arbitrary inputs and make sure that the properties are indeed true for them 1 2 . Since we are writing an arithmetic expression parser/compiler/ VM , we generate arbitrary expression AST s, and use them to assert certain invariants of our program. With QuickCheck, we write generators for the inputs for our tests. These generators are composable just like parser combinators are. We use the library provided generators to write small generators that we combine to create larger ones. Let’s start: First come the basic generators: Moving on to composite generators: generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. Finally, we have some instances of QuickCheck’s type class to tie everything together: We can apply them in GHCi: Notice that the generated samples increase in complexity. With the generators in place, we define our properties next. Let’s test our parser first: This property is a simple round-trip test for the parser and printer: we parse the string representation of a generated expression, and assert that it gives back the same expression. The second property is a more involved round-trip test for the compiler and decompiler: This asserts that compiling an expression, then disassembling and decompiling it, and finally compiling it again should result in the original bytecode 3 . This requires a helper function to get the size of an expression: We run these tests in a later section. This ends our short detour. Now for the main event: the virtual machine. Our VM is a stack-based machine that operates on a stack of values and executes the compiled bytecode. Our goal is to be as fast as possible. For a quick reminder, these are our s: And now, the heart of the VM : The function is where the action happens. It is way more complex than , but the complexity has a reason, namely performance. runs inside the monad wrapped with the monad transformer. monad lets us use mutable data structures locally while ensuring the function remains externally pure. monad transformer adds support for throwing and propagating errors in a pure manner. We use for our stack, which is a mutable array of unboxed primitive types, in our case an array of values. Using a mutable unboxed array is much faster than using an immutable and/or boxed one like or due to reduced allocation and/or pointer chasing. The core of the VM is the function, a tight, tail-recursive loop that GHC compiles into an efficient machine loop, as we see later. It takes the stack pointer ( ), instruction pointer ( ) 4 , and the stack as arguments. At the top of each loop, a block of guard clauses checks for stack overflow, underflow, and other error conditions before branching on the current opcode. Placing these checks at the top instead of inside the opcode cases is a deliberate choice. This may make the code slightly harder to understand, but it significantly improves the performance of the loop by moving all branching at the beginning of the loop, resulting in code that is more friendly to the CPU’s Branch Predictor . Also notice how we reduce the number of checks by working with a range of opcodes at once in the guard. The checks are also sorted so as to be most performant, guided by profiling and benchmarking 5 . The handling of each opcode is actually pretty straightforward. We use different specific operations to read and write to the stack, while taking care of doing the required bound and arithmetic checks. We also use the functions that we wrote earlier . After carrying out each operation, we reenter the loop by calling it tail-recursively with the right stack and instruction pointers. Finally, we make sure that the execution terminated correctly by checking the state of the stack, and return its first element. We see later that the VM is quite fast, but how does GHC achieve this performance? To see the magic, we can look at GHC ’s intermediate language: Core . Core is a simpler functional language than Haskell to which GHC compiles Haskell. The simpler nature of Core makes it easier for GHC to optimize it, and compile it further. We can get the Core code for a program by compiling with the GHC option . The actual Core code for our VM is too verbose to show here, but here is a simplified C-like pseudo-code version of our loop: A few key optimizations are worth pointing out: The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. In short, GHC has turned our high-level, declarative Haskell code into a low-level loop that looks remarkably like one we would write in C. We get the safety and expressiveness of Haskell, while GHC does the heavy lifting to produce highly optimized code. It’s the best of both worlds! We must test the VM to make sure it works correctly 6 . We reuse the success and failure tests for the AST interpreter, as the bytecode interpreter should yield the same result: We also add a property-based test this time: for any given expression, interpreting the AST should produce the same result as compiling it to bytecode and executing it in the VM 7 . Our test suite is complete now: And finally, we run all tests together: Happily, all tests pass. Now for the fun part: benchmarking. We use the criterion library to benchmark the code. We have a benchmark suite to measure the performance of each pass, the two interpreters ( AST and bytecode), and the full end-to-end runs 8 . We compile with the following GHC options: Here are the results in a more digestible format: Here are the times in a chart (smaller is better): Let’s break down these numbers: I can already see readers thinking, “Sure that’s fast, but is it faster than C/Rust/Zig/my favourite language?” Let’s find out. To get a better sense of our VM ’s performance, I rewrote it in C. The C implementation is a classic manual approach: a hand-written tokenizer and recursive-descent parser, s with pointers for the AST , and manual memory management and error propagation. The VM is a simple loop with a statement for dispatching opcodes 9 . To compare our Haskell code against the C code, we need to write the last Haskell module, the CLI app that we demonstrated in the first post : We compile with the following GHC options 10 : And for the C version, we compile using GCC: Now, let’s see how they stack up against each other. We use hyperfine to run the two executables. Here’s a summary of the results: I have subtracted the times of previous passes to get the times for individual passes. Here’s the same in a chart (smaller is better): As expected, the C implementation is faster across the board, between 1.5x to 2.6x. The biggest difference is in parsing, where the hand-written C parser is more than twice as fast as our combinator-based one. On the other hand, the Haskell VM is only 50% slower than the C VM . In my opinion, the Haskell code’s performance is quite respectable, especially given the safety, expressiveness and conciseness benefits, as illustrated by the code sizes 11 : The Haskell implementation is almost half the size of the C code. I don’t know about you but I’m perfectly happy with the half as small, half as fast tread-off. The benchmark results for the VM s become less surprising when I compare the C function with the GHC Core code for 12 . This structure is almost a 1-to-1 match with the GHC Core code we saw earlier. The C loop corresponds to the optimized function that GHC generates, the statement is almost identical to the analysis on the raw opcode byte, and the C stack array is equivalent to the GHC uses. GHC effectively compiles our high-level Haskell into a low-level code that is structurally identical to what we wrote by hand in C 13 . This explains why the performance is in the same ballpark. The remaining performance gap is probably due to the thin layer of abstraction that the Haskell runtime still maintains, but it’s remarkable how close we can get to C-like performance. While our Haskell program is fast, we can improve certain things: Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. And that’s a wrap! We successfully built a bytecode compiler and virtual machine in Haskell. We covered parsing, AST interpretation, compilation, and bytecode execution, as well as, debugging and testing functionalities. Let’s update our checklist: The journey from a simple AST interpreter to a bytecode VM has been a rewarding one. We saw a significant performance improvement, learned about how compilers and VM s work, and how to write performant code in Haskell. While our Haskell implementation isn’t as fast as the hand-written C version, it’s far more concise and, I would argue, easier to reason about. It’s a great demonstration of Haskell’s power for writing high-performance—yet safe and elegant—code. See the full code at: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎ If you liked this post, please leave a comment . Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine (VM). Unit testing and property-based testing for our VM . Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. The Compiler The Virtual Machine (you are here) Introduction Testing the Compiler The Virtual Machine Testing the VM Benchmarking the VM Benchmarking Against C Future Directions generates number expressions by using QuickCheck’s built-in function. generates variable expressions by choosing from the set of passed valid variable names. generates valid identifiers from combinations of letters a—z and A—Z, and discarding ones that are reserved keywords. generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. Parsing and decompiling are slow : At ~573ms and ~506ms, these are by far the slowest passes. This isn’t surprising. Parsing with parser combinators has a known trade-off of expressiveness for performance. Decompiling is a shift-reduce parser that reconstructs an AST from a linear stream of opcodes, and we didn’t spend any time optimizing it. Compilation is fast : At ~51ms, compilation is an order of magnitude faster than parsing. This is thanks to pre-calculating the bytecode size during the parsing phase, which allows us to pre-allocate a single and fill it in with low-level pointer operations. Bytecode interpretation is blazingly fast : At just ~16ms, our VM ’s interpreter is over 3 times faster than the AST interpreter (~50ms), which proves our belief that bytecode interpreters are faster. End-to-end runs : Interestingly, the total time to run via bytecode (~638ms) is slightly slower than the run via AST (~617ms). This is because the cost of parsing, compiling, and then interpreting is higher than just parsing and interpreting. The real win for a bytecode VM comes when you compile once and run many times , amortizing the initial compilation cost. Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine. Unit testing and property-based testing for our VM. Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. ArithVMLib.hs ArithVMSpec.hs ArithVMBench.hs ArithVMApp.hs Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎

0 views
Dangling Pointers 1 months ago

Automata All the Way Down

Fabs and EDA companies collaborate to provide the abstraction of synchronous digital logic to hardware designers. A hardware design comprises: A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle The transfer function cannot be too fancy. It can be large but cannot be defined with unbounded loops/recursion. The pragmatic reason for this restriction is that the function is implemented with physical gates on a chip, and each gate can only do one useful thing per clock cycle. You cannot loop the output of a circuit element back to itself without delaying the value by at least one clock cycle (via a state element). It feels to me like there is a deeper reason why this restriction must exist. Many people dabbled with synchronous digital logic in college. If you did, you probably designed a processor, which provides the stored program computer abstraction to software developers. And here comes the inception: you can think of a computer program as a transfer function. In this twisted mindset, the stored program computer abstraction enables software engineers to define transfer functions . For example, the following pseudo-assembly program: Can be thought of as the following transfer function: In the stored program computer abstraction, state elements are the architectural registers plus the contents of memory. As with synchronous digital logic, there are limits on what the transfer function can do. The switch statement can have many cases, but the body of each case block is defined by one instruction. Alternatively, you can define the transfer function at the basic block level (one case per basic block, many instructions inside of each case). Programming in assembly is a pain, so higher level languages were developed to make us less crazy. And here we go again, someone could write an interpreter for C. A user of this interpreter works at the C level of abstraction. Following along with our previous pattern, a C program comprises: A set of state elements (variables, both global and local) A transfer function For example, the following C function: Can be thought of with as the following transfer function: Think of and as intrinsics used to implement function calls. The key building blocks of the transfer function are state ments. It is easy to just store the term “statement” into your brain without thinking of where the term comes from. A state ment is a thing which can alter state . This transformation of an imperative program into a transfer function seems strange, but some PL folks do it all the time. In particular, the transfer function view is how small step operational semantics are defined. And of course this can keep going. One could write a Python interpreter in C, which allows development at a higher level of abstraction. But even at that level of abstraction, programs are defined in terms of state elements (variables) and a transfer function (statements). The term Turing Tax was originally meant to describe the performance loss associated with working at the stored-program computer level of abstraction instead of the synchronous digital logic level of abstraction. This idea can be generalized. At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction When the Turing property tax assessor sees this house, they tax each level of the house . In this system, an average Python statement requires (4 x 3 x 5) 60 clock cycles! Much engineering work goes into avoiding this problem (pipelined and superscalar processors, multi-threading, JIT compilation, SIMD). Partial evaluation is another way to avoid the Turing Tax. Partial evaluation transforms data into code . There must be some other method of creating abstractions which is more efficient. Self-modifying code is rarely used in the real world (outside of JIT compilers). Self-modifying code seems crazy to reason about but potentially could offer large performance gains. Partial evaluation is also rarely used but has a large potential. Subscribe now A set of state elements (e.g., registers and on-chip memories), which retain values from one clock cycle to another A transfer function , which maps the values of all state elements at clock cycle N to new values of all state elements at clock cycle A set of state elements (variables, both global and local) A transfer function Code vs Data At a particular level of abstraction, code defines the transfer function while data is held in the state elements. A particular set of bits can simultaneously be described as code at one level of abstraction, while defined as data at a lower level. This code/data duality is intimately related to the Turing Tax. The Turing Tax collector is constantly looking for bags of bits which can be interpreted as either code or data, and he collects his tax each time he finds such a situation. An analogous circumstance arises in hardware design. Some signals can be viewed as either part of the data path or the control path, depending on what level of abstraction one is viewing the hardware from. Compilers vs Interpreters A compiler is one trick to avoid the Turing Tax by translating code (i.e., a transfer function) from a higher level of abstraction to a lower level. We all felt awkward when I wrote “interpreter for C” earlier, and now we can feel better about it. JIT compilers for Python are one way to avoid the Turing Tax. Another example is an HLS compiler which avoids the Turing Tax between the stored-program computer abstraction layer and the synchronous digital logic layer. Step Counting (Multiple Taxation) No, this section isn’t about your Fitbit. Let’s call each evaluation of a transfer function a step . These steps occur at each level of abstraction. Let’s define the ultimate performance goal that we care about to be the number of steps required to execute a computation at the synchronous digital logic level of abstraction. The trouble with these layers of abstraction is that typically a step at a higher layer of abstraction requires multiple steps at a lower layer. For example, the multi-cycle processor implementation you learned about in a Patterson and Hennessy textbook could require 5 clock cycles to execute each instruction (instruction fetch, register fetch, execute, memory, register write back). Interpreters have the same behavior: one Python statement may be implemented with many C statements. Now imagine the following house of cards: A Python interpreter which requires an average of 4 C statements to implement 1 Python statement A C compiler which requires an average of 3 machine instructions to implement 1 C statement A processor which requires an average of 5 clock cycles to execute 1 machine instruction

0 views
Anton Zhiyanov 1 months ago

High-precision date/time in C

I've created a C library called vaqt that offers data types and functions for handling time and duration, with nanosecond precision. Works with C99 (C11 is recommended on Windows for higher precision). is a partial port of Go's package. It works with two types of values: and . Time is a pair (seconds, nanoseconds), where is the 64-bit number of seconds since zero time (0001-01-01 00:00:00 UTC) and is the number of nanoseconds within the current second (0-999999999). Time can represent dates billions of years in the past or future with nanosecond precision. Time is always operated in UTC, but you can convert it from/to a specific timezone. Duration is a 64-bit number of nanoseconds. It can represent values up to about 290 years. provides functions for common date and time operations. Creating time values: Extracting time fields: Calendar time: Time comparison: Time arithmetic: Formatting: Marshaling: Check the API reference for more details. Here's a basic example of how to use to work with time: If you work with date and time in C, you might find useful. See the nalgeon/vaqt repo for all the details.

0 views

Arrows to Arrows, Categories to Queries

I’ve had a little time off of work as of late, and been spending it in characteristically unwise ways. In particular, I’ve written a little programming language that compiles to SQL . I call it catlang . That’s not to say that I’ve written a new query language. It’s a programming language, whose compiler spits out one giant statement. When you run that query in postgres, you get the output of your program. Why have I done this? Because I needed a funny compilation target to test out the actual features of the language, which is that its intermediary language is a bunch of abstract category theory nonsense. Which I’ll get to. But I’m sure you first want to see this bad boy in action. Behold, the function that returns 100 regardless of what input you give it. But it does it with the equivalent of a while loop: If you’re familiar with arrow notation , you’ll notice the above looks kinda like one big block. This is not a coincidence (because nothing is a coincidence). I figured if I were to go through all of this work, we might as well get a working arrow desugarer out of the mix. But I digress; that’s a story for another time. Anyway, what’s going on here is we have an arrow , which takes a single argument . We then loop, starting from the value of . Inside the loop, we now have a new variable , which we do some voodoo on to compute —the current value of the loop variable. Then we subtract 100 from , and take the absolute value. The function here is a bit odd; it returns if the input was negative, and otherwise. Then we branch on the output of , where and have been renamed and respectively. If was less than zero, we find ourselves in the case, where we add 1 to and wrap the whole thing in —which the loop interprets as “loop again with this new value.” Otherwise, was non-negative, and so we can return directly. Is it roundabout? You bet! The obtuseness here is not directly a feature, I was just looking for conceptually simple things I could do which would be easy to desugar into category-theoretical stuff. Which brings us to the intermediary language. After desugaring the source syntax for above, we’re left with this IL representation: We’ll discuss all of this momentarily, but for now, just let your eyes glaze over the pretty unicode. The underlying idea here is that each of these remaining symbols has very simple and specific algebraic semantics. For example, means “do and pipe the result into .” By giving a transformation from this categorical IL into other domains, it becomes trivial to compile catlang to all sorts of weird compilation targets. Like SQL. You’re probably wondering what the generated SQL looks like. Take a peek if you dare. It’s not pretty, rather amazingly, running the above query in postgres 17 will in fact return a single row with a single column whose value is 100. And you’d better believe it does it by actually looping its way up to 100. If you don’t believe me, make the following change: which will instead return a row for each step of the iteration. There are some obvious optimizations I could make to the generated SQL, but it didn’t seem worth my time, since that’s not the interesting part of the project. Let’s take some time to discuss the underlying category theory here. I am by no means an expert, but what I have learned after a decade of bashing my head against this stuff is that a little goes a long way. For our intents and purposes, we have types, and arrows (functions) between types. We always have the identity “do nothing arrow” : and we can compose arrows by lining up one end to another: 1 Unlike Haskell (or really any programming language, for that matter), we DO NOT have the notion of function application. That is, there is no arrow: You can only compose arrows, you can’t apply them. That’s why we call these things “arrows” rather than “functions.” There are a bundle of arrows for working with product types. The two projection functions correspond to and , taking individual components out of pairs: How do we get things into pairs in the first place? We can use the “fork” operation, which takes two arrows computing and , and generates a new arrow which generates a pair of : If you’re coming from a Haskell background, it’s tempting to think of this operation merely as the pair constructor. But you’ll notice from the type of the computation that there can be no data dependency between and , thus we are free to parallelize each side of the pair. In category theory, the distinction between left and right sides of an arrow is rather arbitrary. This gives rise to a notion called duality where we can flip the arrows around, and get cool new behavior. If we dualize all of our product machinery, we get the coproduct machinery, where a coproduct of and is “either or , but definitely not both nor neither.” Swapping the arrow direction of and , and replacing with gives us the following injections: and the following “join” operation for eliminating coproducts: Again, coming from Haskell this is just the standard function. It corresponds to a branch between one of two cases. As you can see, with just these eight operations, we already have a tremendous amount of expressivity. We can express data dependencies via and branching via . With we automatically encode opportunities for parallelism, and gain the ability to build complicated data structures, with and allowing us to get the information back out of the data structures. You’ll notice in the IL that there are no variable names anywhere to be found. The desugaring of the source language builds a stack (via the pattern), and replaces subsequent variable lookups with a series of projections on the stack to find the value again. On one hand, this makes the categorical IL rather hard to read, but it makes it very easy to re-target! Many domains do have a notion of grouping, but don’t have a native notion of naming. For example, in an electronic circuit, I can have a ribbon of 32 wires which represents an . If I have another ribbon of 32 wires, I can trivially route both wires into a 64-wire ribbon corresponding to a pair of . By eliminating names before we get to the IL, it means no compiler backend ever needs to deal with names. They can just work on a stack representation, and are free to special-case optimize series of projections if they are able to. Of particular interest to this discussion is how we desugar loops in catlang. The underlying primitive is : which magically turns an arrow on s into an arrow without the eithers. We obviously must run that arrow on eithers. If that function returns , then we’re happy and we can just output that. But if the function returns , we have no choice but to pass it back in to the eithered arrow. In Haskell, cochoice is implemented as: which as you can see, will loop until finally returns a . What’s neat about this formulation of a loop is that we can statically differentiate between our first and subsequent passes through the loop body. The first time through is , while for all other times it is . We don’t take advantage of it in the original program, but how many times have you written loop code that needs to initialize something its first time through? So that’s the underlying theory behind the IL. How can we compile this to SQL now? As alluded to before, we simply need to give SQL implementations for each of the operations in the intermediary language. As a simple example, compiles to , where is the input of the arrow. The hardest part here was working out a data representation. It seems obvious to encode each element of a product as a new column, but what do we do about coproducts? After much work thought, I decided to flatten out the coproducts. So, for example, the type: would be represented as three columns: with the constraint that exactly one of or would be at any given point in time. With this hammered out, almost everything else is pretty trivial. Composition corresponds to a nested query. Forks are s which concatenate the columns of each sub-query. Joins are s, where we add a clause to enforce we’re looking at the correct coproduct constructor. Cochoice is the only really tricky thing, but it corresponds to a recursive CTE . Generating a recursive CTE table for the computation isn’t too hard, but getting the final value out of it was surprisingly tricky. The semantics of SQL tables is that they are multisets and come with an arbitrary greatest element. Which is to say, you need an column structured in a relevant way in order to query the final result. Due to some quirks in what postgres accepts, and in how I structured my queries, it was prohibitively hard to insert a “how many times have I looped” column and order by that. So instead I cheated and added a column which looks at the processor clock and ordered by that. This is clearly a hack, and presumably will cause problems if I ever add some primitives which generate more than one row, but again, this is just for fun and who cares. Send me a pull request if you’re offended by my chicanery! I’ve run out of vacation time to work on this project, so I’m probably not going to get around to the meta-circular stupidity I was planning. The compiler still needs a few string-crunching primitives (which are easy to add), but then it would be simple to write a little brainfuck interpreter in catlang. Which I could then compile to SQL. Now we’ve got a brainfuck interpreter running in postgres. Of course, this has been done by hand before, but to my knowledge, never via compilation. There exist C to brainfuck compilers. And postgres is written in C. So in a move that would make Xzibit proud, we could run postgres in postgres. And of course, it would be fun to run brainfuck in brainfuck. That’d be a cool catlang backend if someone wanted to contribute such a thing. I am not the first person to do anything like this. The source language of catlang is heavily inspired by Haskell’s arrow syntax , which in turn is essentially a desugaring algorithm for Arrows . Arrows are slightly the wrong abstraction because they require an operation —which requires you to be able to embed Haskell functions in your category, something which is almost never possible. Unfortunately, arrow syntax in Haskell desugars down to for almost everything it does, which in turn makes arrow notation effectively useless. In an ideal world, everything I described in this blog post would be a tiny little Haskell library, with arrow notation doing the heavy lifting. But that is just not the world we live in. Nor am I the first person to notice that there are categorical semantics behind programming languages. I don’t actually know whom to cite on this one, but it is well-established folklore that the lambda calculus corresponds to cartesian-closed categories . The “closed” part of “cartesian-closed” means we have an operation , but everyone and their dog has implemented the lambda calculus, so I thought it would be fun to see how far we can get without it. This is not a limitation on catlang’s turing completeness (since gives us everything we need.) I’ve been thinking about writing a category-first programming language for the better part of a decade, ever since I read Compiling to Categories . That paper takes Haskell and desugars it back down to categories. I stole many of the tricks here from that paper. Anyway. All of the code is available on github if you’re interested in taking a look. The repo isn’t up to my usual coding standards, for which you have my apologies. Of note is the template-haskell backend which can spit out Haskell code; meaning it wouldn’t be very hard to make a quasiquoter to compile catlang into what Haskell’s arrow desugaring ought to be. If there’s enough clamor for such a thing, I’ll see about turning this part into a library. When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎ When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎

1 views
allvpv’s space 1 months ago

Environment variables are a legacy mess: Let's dive deep into them

Programming languages have rapidly evolved in recent years. But in software development, the new often meets the old, and the scaffolding that OS gives for running new processes hasn’t changed much since Unix. If you need to parametrize your application at runtime by passing a few ad-hoc variables (without special files or a custom solution involving IPC or networking), you’re doomed to a pretty awkward, outdated interface: There are no namespaces for them, no types. Just a flat, embarrassingly global dictionary of strings. But what exactly are these envvars? Is it some kind of special dictionary inside the OS? If not, who owns them and how do they propagate? In a nutshell: they’re passed from parent to child. On Linux, a program must use the syscall to execute another program. Whether you type in Bash, call in Python, or launch a code editor, it ultimately comes down to , preceded by a / . The family of C functions also relies on . This system call takes three arguments: , , . For example, for an invocation: By default, all envvars are passed from the parent to the child. However, nothing prevents a parent process from passing a completely different or even empty environment when calling ! In practice, most tooling passes the environment down: Bash, Python’s , the C library , and so on. And this is what you expect – variables are inherited by child processes. That’s the point – to track the environment. Which tools do not pass the parent’s environment? For example, the executable, used when signing into a system, sets up a fresh environment for its children. After launching the new program, the kernel dumps the variables on the stack as a sequence of null-terminated strings which contain the envvar definitions. Here is a hex view: This static layout can’t easily be modified or extended; the program must copy those variables into its own data structure. Let’s look at how Bash, C, and Python store envvars internally. I analyzed their source code and here is a summary. It stores the variables in a hashmap . Or, more precisely, in a stack of hashmaps . When you spawn a new process using Bash, it traverses the stack of hashmaps to find variables marked as exported and copies them into the environment array passed to the child. Side note: Why is traversing the stack needed? Each function invocation in Bash creates a new local scope – a new entry on the stack. If you declare your variable with , it ends up in this locally-scoped hashmap. What’s interesting is that you can export a variable too! I wouldn’t have learned this without diving into Bash source. My intuitive (wrong) assumption was that automatically makes the variable global – like ! Super interesting stuff. exposes a dynamic array, managed via and library functions. It uses an array, so the time complexity of and is linear in the number of envvars. Remember – envvars are not a high-performance dictionary and you should not abuse them. Python couples its environment to the C library, which can cause surprising inconsistencies. If you’ve programmed some Python, you’ve probably used the dictionary. On startup, is built from the C library’s array. But those dictionary values are NOT the “ground truth” for child processes. Rather, each change to invokes the native function, which in turn calls the C library’s . Note that the propagation is one-directional: modifying will call , but not the other way around. Call , and won’t be updated. The Linux kernel is very liberal about the format of environment variables, and so is . For example, your C program can manipulate the environment – the global array – such that several variables share the same name but have different values. And when you execute a child process, it will inherit this “broken” setup. You don’t even need an equals sign separating name from value! The usual entry is , but nothing prevents you from adding to the array. The kernel happily accepts any null-terminated string as an “environment variable” definition. It just imposes a size limitation: Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars. But the fact that you can do something does not mean that you should. For example, if you start Bash with the “broken” environment – duplicated names and entries without – it deduplicates the variables and drops the nonsense. One interesting edge case is a space inside the variable name . My beloved shell – Nushell – has no problem with the following assignment: Python is fine with it, too. Bash, on the other hand, can’t reference it because whitespace isn’t allowed in variable names. Fortunately, the variable isn’t lost – Bash keeps such entries in a special hashmap called and still passes them to child processes. So what name and value can you safely use for your envvar? A popular misconception, repeated on StackOverflow and by ChatGPT, is that POSIX permits only uppercase envvars, and everything else is undefined behavior. But this is seriously NOT what the standard says : These strings have the form name=value; names shall not contain the character ‘=’. For values to be portable across systems conforming to POSIX.1-2017, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in an environment of a process has the same name, the consequences are undefined. Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2017 consist solely of uppercase letters, digits, and the <underscore> ( ‘_’ ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities. Yes, POSIX-specified utilities use uppercase envvars, but that’s not prescriptive for your programs. Quite the contrary: you’re encouraged to use lowercase for your envvars so they don’t collide with the standard tools. The only strict rule is that a variable name cannot contain an equals sign. POSIX requires compliant applications to preserve all variables that conform to this rule. But in reality, not many applications use lowercase. The proper etiquette in software development is to use . …to use for names, and UTF-8 for values. You shouldn’t hit problems on Linux. If you want to be super safe: instead of UTF-8, use the POSIX-mandated Portable Character Set (PCS) – essentially ASCII without control characters. …and I hope it wasn’t a boring read. is the (the executable path), is the array of command line arguments – the implicit first (“zero”) argument is usually the executable name, is the array of envvars (typically much longer). Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars.

0 views
Pinaraf's website 1 months ago

JIT: so you want to be faster than an interpreter on modern CPUs…

Since my previous blog entry about JIT compiler for PostgreSQL, sadly not much happened due to a lack of time, but still some things were done (biggest improvement was the port to ARM64, a few optimizations, implementing more opcodes…). But I am often asking myself how to really beat the interpreter… And on “modern” CPUs, with a well written interpreter, that’s far more complicated than many would imagine. So in order to explain all this and show how I am planning to improve performance (possibly of the interpreter itself too, thus making this endeavor self-defeating), let’s first talk about… If you already know about all the topics mentioned in this title, feel free to jump to the next section . Note that the following section is over-simplified to make the concepts more accessible. I am writing this blog post on a Zen 2+ CPU. If I upgraded to a Zen 3 CPU, same motherboard, same memory, I would get an advertised 25% performance jump in single thread benchmarks while the CPU frequency would be only 2% higher. Why such a discrepancy? Since the 90s and the Pentium-class CPUs, x86 has followed RISC CPUs in the super-scalar era. Instead of running one instruction per cycle, when conditions are right, several instructions can be executed at the same time. Let’s consider the following pseudo-code: X and Y can be calculated at the same time. The CPU can execute these on two integer units, fetch the results and store them. The only issue is the computation of Z: everything must be done before this step, making it impossible for the CPU to go further without waiting for the previous results. But now, what if the code was written as follow: Every step would require waiting for the previous one, slowing down the CPU terribly. Hence the most important technique used to implement superscalar CPUs: out-of-order execution. The CPU will fetch the instructions, dispatch them in several instruction queues, and resolve dependencies to compute Y before computing Z1 in order to have it ready sooner. The CPU is spending less time idling, thus the whole thing is faster. But, alas, what would happen with the following function? Should the CPU wait for X and Y before deciding which Z to compute? Here is the biggest trick: it will try its luck and compute something anyway. This way, if its bet was right, a lot of time will be saved, otherwise the mistake result will be dropped and the proper computation will be done instead. This is called branch prediction, it has been the source of many fun security issues (hello meltdown), but the performance benefits are so huge that one would never consider disabling this. Most interpreters will operate on an intermediate representation, using opcodes instead of directly executing from an AST or similar. So you could use the following main loop for an interpreter. This is how many, many interpreters were written. But this has a terrible drawback at least when compiled that way: it has branches all over the place from a single starting point (most if not all optimizing compilers will generate a jump table to optimize the dispatch, but this will still jump from the same point). The CPU will have a hard time predicting the right jump, and is thus losing a lot of performance. If this was the only way an interpreter could be written, generating a function by stitching the code together would save a lot of time, likely giving a more than 10% performance improvement. If one look at Python, removing this switch made the interpreter 15 to 20% faster. Many project, including PostgreSQL, use this same technique, called “computed gotos”. After a first pass to fill in “label” targets in each step, the execution would be When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter. Now that we have a very basic understanding of modern CPUs and the insane level of optimization they reach, let’s talk about fighting the PostgreSQL interpreter on performance. I will not discuss optimizing the tuple deforming part (aka. going from on-disc structure to the “C” structure used by the code), this will be a topic for a future blog post when I implement it in my compiler. As you may know, PostgreSQL has a very complete type system with operators overloading. Even this simple query ends up being a call to int4eq, a strict function that will perform the comparison. Since it is a strict function, PostgreSQL must check that the arguments are not null, otherwise the function is not called and the result will be null. If you execute a very basic query like the one in the title, PostgreSQL will have the following opcodes: The EEOP_FUNCEXPR_STRICT_2 will perform the null check, and then call the function. If we unroll all the opcodes in real C code, we end up with the following: We can already spot one optimization: why do we check the two arguments, including our constant, against null? It will never change for the entire run of this query and thus each comparison is going to use an ALU, and branch depending on that comparison. But of course the CPU will notice the corresponding branch pattern, and will thus be able to remain active and feed its other units. What is the real cost of such a pointless comparison? For this purpose, I’ve broken a PostgreSQL instance and replaced all FUNCEXPR_STRICT with a check on one argument only, and one with no STRICT check (do not try this at home!). Doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 100 million rows table, with no index, here are the two perf results: So, even if this is not the optimization of the century, it’s not that expensive to make, so… why not do it? (Patch coming to pgsql-hackers soon) But a better optimization is to go all-in on inlining. Indeed, instead of jumping through a pointer to the int4eq code (again, something that the CPU will optimize a lot), one could have a special opcode for this quite common operation. With this change alone (but keeping the two null checks, so there are still optimizations possible), we end up with the following perf results. Let’s sum up these results. The biggest change comes, quite obviously, from inlining the int4eq call. Why is it that much better? Because it reduces by quite a lot the number of instructions to run, and it removes a call to an address stored in memory. And this is again an optimization I could do on my JIT compiler that can also be done on the interpreter with the same benefits. The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse. Well. At first, I thought the elimination of null checks could not be implemented easily in the interpreter. The first draft in my compiler was certainly invalid, but gave me interesting numbers (around 5%, as seen above) and made me want to go ahead. And I realized that implementing it cleanly in the interpreter was far easier than implementing it in my JIT compiler … Then I went with optimizing another common case, the call to int4eq, and, well… One could also add an opcode for that in the interpreter, and thus the performance gain of the JIT compiler are going to be minimal compared to the interpreter. Modern CPUs don’t make my job easy here. Most of the cost of an interpreter is taken away by the branch predictor and the other optimizations implemented in silicon. So is all hope lost, am I to declare the interpreter the winner against the limitations of the copy-patch method I have available for my JIT? Of course not, see you in the next post to discuss the biggest interpreter bottleneck! PS: help welcome. Last year I managed to spend some time working on this during my work time. Since then I’ve changed job, and can hardly get some time on this. I also tried to get some sponsoring to work on this and present at future PostgreSQL conferences, to no luck :/ If you can help in any way on this project, feel free to reach me (code contribution, sponsoring, missions, job offers, nudge nudge wink wink). Since I’ve been alone on this, a lot of things are dibbles on scratch paper, I benchmark code and stuff in my head when life gives me some boring time but testing it for real is of course far better. I have some travel planned soon so I hope for next part to be released before next year, with interesting results since my experiences have been as successful as anticipated.

0 views
Corrode 1 months ago

Prime Video

Are you one of over 240 million subscribers of Amazon’s Prime Video service? If so, you might be surprised to learn that much of the infrastructure behind Prime Video is built using Rust. They use a single codebase for media players, game consoles, and tablets. In this episode, we sit down with Alexandru Ene, a Principal Engineer at Amazon, to discuss how Rust is used at Prime Video, the challenges they face in building a global streaming service, and the benefits of using Rust for their systems. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Prime Video is a streaming service offered by Amazon that provides a wide range of movies, TV shows, and original content to its subscribers. With over 240 million subscribers worldwide, Prime Video is one of the largest streaming platforms in the world. In addition to its vast content library, Prime Video also offers features such as offline viewing, 4K streaming, and support for multiple devices. On the backend, Prime Video relies on a variety of technologies to deliver its content, including Rust, which is used for building high-performance and reliable systems that can handle the demands of a global audience. Alexandru worked on the transition of Prime Video’s user interface from JavaScript to Rust. He has been with Amazon for over 8 years and previously worked at companies like Ubisoft and EA. He has a background in computer science and is an active open source maintainer. Alexandru lives in London. Ferris Makes Emulators Ep.001 - The Journey Begins - First episode of a famous series where Jake Taylor wrote a Nintendo 64 emulator in Rust from scratch CMake - Very common build system used in C++ applications Conan - C++ Package Manager community project C++ Smart Pointers - Still a footgun Herb Sutter: The Free Lunch Is Over - The seminal 2005 paper that highlights the importance of concurrency, well past C++’s mainstream adoption Rust in Production: cURL - Baseline library used everywhere, written in C, but performant and safe Prime Video Platforms - One app runs on all of these WebAssembly (WASM) - Enabling Rust code with good performance that you can still download and run like JavaScript, avoiding the need for firmware updates on some devices Entity Component System - Used in the UI Rust code for pages in the app Bevy - Game engine written in Rust Leptos - UI framework that makes reactive programming in Rust easier tokio - The de facto standard async runtime for Rust SIMD - A nice feature set some CPUs support WebAssembly Micro Runtime - A tiny WASM runtime well suited for IoT platforms WebAssembly Working Group Amazon Prime Video Rust & WASM for UI: Faster Prime Video on ANY Device - Alexandru Ene, QCon San Francisco 2024 Alexandru Ene on LinkedIn Alexandru’s Blog Alexandru Ene on GitHub

0 views
NorikiTech 2 months ago

Motivation for learning Rust

Part of the “ Rustober ” series. Rust is my next first-choice language for making things. It now possesses a combination of traits (get it?) that make it an obvious choice. I’m writing Swift full-time at work (and have been since 2014) and I would have happily picked Swift if it ever became what it was supposed to: I simply cannot use Swift, my strongest language, for any of the projects I do outside of work. Frustrated, I have tried other languages in their strong domains: For me, a major constraint is that I work on multiple things and sometimes leave them for a year or more, and then come back and work on them some more. All of the languages above were used for different projects. Consequently, I need to remember all of them at once if I need to pick a project up again. However, looking back, all of them could be done in Rust without major changes. Why haven’t I learned Rust earlier then? I made an attempt a few years ago but found it too difficult for my taste. Considering the constraint above, it made sense. If it were one of many, it wouldn’t have worked. To work in big languages like C++ and Rust (or let’s be honest, in any language except Go), you need to commit, otherwise after a year of not using it, it goes completely out of your head. Besides reviving your half-dead project you’ll need to relearn the language as well. I also thought it was changing too fast (all of the above were changing much slower) and there was too much drama in the community, so its future was uncertain. Yet in the last few years Rust has proven that it is stable, mature and developer-friendly. The language, tooling and the amount of available libraries are improving (in a good way). I believe it’s a solid career choice as well on the horizon of a few years. Hell, it even plays well with AI because of the strong type system and borrow checker. I’m starting to use it in earnest to build the first project in it, let’s see how it goes. Remarkably, it seems much closer to Swift than I expected, with common use of optionals, , traits and enums with associated types. P.S. C# and Java are missing from the above list. While I did use both for a time, I was never comfortable fully committing to either because it’s a completely new set of libraries and ways of doing things. Swift never became truly cross-platform — I was really hoping for it. Rust did. Apple butchered the language after 4.0 to make SwiftUI look nice. A lot of complexity was added even though the community wasn’t happy. Swift compile times get worse and worse, and they are never mentioned as a priority. Rust compile times constantly improve. Et cetera… my list of grievances with Swift is very long. While I love Common Lisp, it’s simply too niche — I would need to implement most of what I need myself, and I have limited time. I couldn’t find in myself the will to commit to it without second-guessing. C++ is a solid choice for lower-level tasks, but nobody uses it to write web services, and half of my ideas are web services. I liked using C to write all of Typingvania . I understand its appeal and draw, but in C, most things felt tedious , and far from everything could be abstracted enough to be ergonomic. Handling memory didn’t bother me, I learned most of my lessons through objects’ lifetimes in C++. Finally, I wrote multiple things in Go, and it probably came closest to what I was looking for (which was the reason why I committed to it in the first place). However, I got progressively more disillusioned because of its C interop, its lack of constraints, the poor generics (too little, too late), and general inelegance (as a result of its nonexpressiveness). Yes, I could achieve a lot quickly and pick the project back up later, as long as I used Go exclusively, and not as part of something else.

0 views
JSLegendDev 2 months ago

You Can Now Make PS2 Games in JavaScript

I recently discovered that you could make PS2 games in JavaScript. I’m not even kidding, it’s actually possible. I was working on a project and had my phone near my desk when I received a notification. Upon further inspection, it came from itch.io which was a platform where I usually published most of my web games. Under my relatively popular Sonic infinite runner game which was made in JavaScript and developed a year ago, I received a comment from someone with the username Dev Will which claimed they had made a PS2 version of my game and provided the GitHub repo of the source code. At first, I thought that it was cool that someone took the time to remake my game for an old console that had a reputation to be hard to develop for and probably required them to write a lot of C or C++. Out of curiosity, I opened up the GitHub repo and was astonished to see that the project was not using even a bit of C++ or C but was entirely in JavaScript! If making PS2 games were easier than I thought since I could use a higher level language like JavaScript, I could probably try making one in a reasonable amount of time and play it on a retro handled or an actual PS2. How cool would that be? This is where I knew I had to drop everything I was doing to investigate how this was possible. Since the dev behind the project was Portuguese speaking (I assume they were either from Brazil or Portugal), they wrote the Readme of the repo in Portuguese which was a language I did not understand. Fortunately, I was still able to decipher most of what was written because I had done 3 years of Spanish in school and spoke French natively. Since Portuguese is a romance language like Spanish and French, I was fortunately not totally lost. Anyway, The readme said that the engine used to make the PS2 version of my game was called AthenaEnv with a conveniently placed link towards it so I could learn more. As with the Sonic Infinite Runner PS2 project, this engine was also open source and its repo had a very detailed readme written in English. To summarize, Athena was not what we commonly refer to as a game engine but an environment that also offered a JavaScript API for making games and apps for the PS2. It embedded a slightly modified version of QuickJS which was a small and embeddable JavaScript engine. This explained how Athena was able to run JavaScript code on the PS2. Therefore, Athena was the PS2 native program written in C that took your JavaScript code, passed it through the QuickJS engine to interpret it and finally, ran the relevant logic on the system. What made it compelling was not that it just ran JS on the PS2 but that it offered an API suitable for game development. It covered : Rendering : Allowing you to display sprites, text, shapes, etc… on the screen and animate them using a game loop. Asset loading : Allowing you to load images, sounds, fonts, etc... Input handling : Allowing you to receive player input from a controller, multiple ones or even from a mouse and keyboard since the PS2 supported these input methods. File handling : Allowing you to write save files among other things. Sound playback : For playing Sound. and the list goes on. I noticed however, that the level of abstraction offered by the API was similar to something like p5.js, the HTML canvas API or Raylib. That meant that you’d still needed to implement collision detection, scene management, etc… yourself. Now, that I got familiar with Athena, I wanted to try to run the Sonic infinite runner “port” on an emulator. According to the project’s Readme. I needed to install PCSX2 which is the most popular emulator for the PS2. Then, go into the settings and under the emulation tab, check the box “Enable host filesystem”. Once this was done, I would need to open an athena.elf file and the game would start. After installing and configuring the emulator, I was ready to run the game. However, there was a problem. I could not find the athena.elf file in the repo. It was nowhere to be found. This is where I remembered to look at the “releases” section of the repo because a lot of open source projects put executables there, especially if it’s a mobile or desktop app project. As expected, the zip attached in that section contained the athena.elf file but not only. It also contained an assets folder, a main.js file, an athena.ini file and src folder containing the rest of the game’s code. The athena.ini file allowed you to configure the entry point of the project. Here, the entry point was set to main.js which explained how Athena would know what JavaScript to run. You could also configure if you wanted to show Athena’s logo before your game started by setting the boot_logo property to true. It now became evident why we needed to check the “Enable host filesystem” check box earlier. This was so that the emulator could allow Athena to access the assets folder and the source code that were essential for our game. Anyway, I opened the athena.elf file in PCSX2 and surprisingly, the game actually ran with no issues. It was amazing to see that a game I wrote for the web was ported to the PS2 and I was there able to play it with a controller. Now, the game looked a bit blurry which was expected since this was supposed to emulate a PS2 which had a small resolution. Fortunately, I was able to make things more comfortable by upping the resolution in the graphics settings of the emulator. The dev process also seemed quite straightforward. You would only need to open the folder containing all the relevant files (athena.elf, main.js, etc…) in a code editor like VSCode and open athena.elf in the emulator. Now, you could make changes to your JS code and once you were ready to test, you would go under the PCSX2 system tab and click on reset. This would restart the emulator and you could see the latest changes. While not as seamless as in web development with hot reloading, it still was a relatively fast iteration cycle. It’s at that moment, that I knew had to make a post about it and share this awesome project with you. However, I still felt uneasy about one thing. Nowadays, people download PS2 games as .iso files. For most games, you only need one .iso file that you then open in your emulator. Less technical people can therefore more easily enjoy these older titles. However, to run the Sonic infinite runner game “port”, I needed to not only check a box in the settings but also needed the entire project’s folder containing the Athena executable and the source code. I wondered if instead, there was a way to distribute the game as a single .iso file. This is were I simply went back to the itch.io comment section and asked if it was possible. After a thorough back and forth that continued on Discord, the process to convert my files into a single iso, I could distribute, was now clear. To make an iso you needed the following files : athena.elf : Which is the Athena executable. athena.ini : For configuring the project’s entry point. A JS file acting as the entry point of the codebase. The rest of your source code if your code is more than one file, oftentimes it’s in a folder called src. Two files one named ATHA_000.01 and the other SYSTEM.CNF needed to make the iso bootable. As an aside, in case you want to also get into JavaScript PS2 game development, you can check this template I made containing all of the files needed . Once you had all the files, you had to make a zip archive containing them all. One issue I had, was that if I created a zip out of the folder containing the files, the resulting .iso would not work. However, if I selected the files one by one and then created the zip, I would experience no issues. This is something to keep in mind. Now, the only step left was to convert the zip into an iso. As I was using a Mac, the only reliable way I’ve found, was to use the website mconverter.eu and let them do the conversion. However, the issue with this website is that you’re limited in the number of conversions you can do per day before they ask you to pay. Additionally, if your zip archive is above a certain size, you’ll also have to watch an ad before you can do the conversion. If you end up finding a better way using either a CLI tool, a downloadable app or some other website, feel free to share it in the comment section. Once you had the iso, you could open it up in the emulator like you would do with other PS2 games. You also didn’t need to check the “Enable host filesystem” option anymore since all the relevant files needed were included in the iso. If the game booted correctly, then you now had a single file you could distribute which was very convenient. It was now time to get my feet wet. Before attempting anything too complicated, my goal was to create a simple “Hello World” example where I would : Load some assets (In my case a font and an image). Set up a game loop that would run every frame. Animate a sprite using that game loop. Render text. Handle player input so I could move a sprite around. Before I could achieve any of these sub-goals, in main.js, I first defined a few constants that I would end up needing. This is where I learned that you could get the screen’s width and height by first using the Screen module available globally like all Athena provided modules (Meaning that no import statements were needed) and then calling the getMode method. Then, to have a stable frame rate and accurate FPS counting, I needed to call the methods setVSync() and setFrameCounter() With the setup completed, I wanted to load the font I used in my Sonic game and a Spritesheet of Sonic so that I could later animate it. I could achieve the following by creating an instance of the Font and Image classes offered by Athena. While I planned on handling player input later, I still needed a way to get the player’s controller so that my code could know when a given button was pressed. This was made possible by using Athena’s Pads module. Before I could create a game loop, I needed to first write the setup code required to animate my spritesheet. Since all the frames where contained within a single image, I had to find a way to tell Athena what part of the image was to be rendered. To achieve this, I first spent some time to get familiar with the shape of the sprite object created earlier. It turned out that we could set the width and the height of the sprite by modifying the properties of the object with the same names. It also turned out that you could tell Athena what portion of the image to draw by setting the startx, endx, starty, endy properties. For example, if you had the following values : startx = 0, endx = 32, starty = 0 and endy = 44 you would get the first frame rendered. This is because in the spritesheet, every frame has a width of 32 and a height of 44. Also, the origin (0,0) corresponds to the top-left corner of the spritesheet. Now that I knew how to display a single frame within a wider image, I used the following logic to setup Sonic’s run animation. I first created an object called spritePos to set the position of the sprite on the screen. This was needed to be able to move it around when the player would press directional buttons on the D-pad. More on that later. Then I would set the sprite’s width and height to correspond to the width and height of a single frame which was 32x44 pixels. Since I wanted the sprite to appear big enough, I multiplied the width and height by a value defined by the SCALE constant we set earlier in our code. The next step consisted in creating an array called runAnimFrames which would describe each frame of Sonic’s run animation using an object with the startx, endx, starty and endy properties. We then had a frameIndex variable which would determine the current frame to display. The frameDuration constant would be used to set how long in miliseconds to display each frame. The lower the number the higher the frame rate of the animation because we would flip through all the frames faster. Finally, I initialized a timer coming from a custom Timer class that I added in my src folder and imported here. The full code is available in the template mentioned earlier. The timer would end up being crucial to know when it was time to move on to displaying another frame. Now that we had our animation logic setup done, it was time to render the animation. For this purpose, I needed a game loop that runs every frame. In Athena, we could achieve this by calling the display method available under the Screen module. In an if statement we would check if the timer had exceeded the time allocated to displaying the current frame. If it was the case we would move on to the next frame by incrementing the frameIndex as long as it was within the bounds of the runAnimFrames array, otherwise, we would set it back to 0 to display the first frame. This was to achieve a looping animation. Then, on every iteration of the game loop we would set the sprite’s startx, endx, starty, endy properties to correspond to the ones of the current frame. Finally, to render the sprite, we needed to call its draw method and pass to it the coordinates where you wanted to display it on the screen. Now that I had a game loop, I could finally handle user input by making sure that the sprite would move in different directions depending on which button was pressed. This could be easily achieved with a few if statements. You might be wondering where is deltaTime? For those unfamiliar, deltaTime is a value representing the time elapsed between the current frame and the previous frame in a game. It’s often used to make the movement of objects, frame rate independent. Meaning that if your game runs at a lower or higher frame rate, an object, like a character, will still move at the same rate. To achieve frame rate independence, you would usually multiply your movement code by deltaTime. The reason it was absent here, is because when creating a game loop using the display method, this matter is taken care of under the hood. Now that I could move Sonic around, I still needed him to face the correct direction because at this point, he would look right even If I moved him to the left. To implement this, I decided to go with a common technique in pixel art based games, which consisted in mirroring (or flipping) the sprite. To achieve this in Athena, you simply needed to provide a negative width or height to the sprite depending on what axis you wanted the mirroring to take effect on. For flipping a sprite horizontally, providing a negative width was enough. However, an issue arose! If you flipped the sprite, it would not flip in place since it would flip according to the sprite’s origin which was its top-left corner. This meant that it would move the sprite to the left after mirroring. To fix this issue, you only needed to subtract an offset to the x coordinate of the flipped sprite that corresponded to its width. Now that the issue was solved, I created variable called spriteIsFlippedX to know when to flip or unflip the sprite. The logic can be see below : Now, when you moved sonic to the left, he would face left and face right when moved to the right. There was still one thing I wanted to try out before wrapping up my Hello World example and that was text rendering. The first thing I wanted to render onto the screen was an FPS counter. It turned out that the FPS counter in the PCSX2 emulator is not accurate, however, Athena provides the getFPS() method available on the Screen module to accurately determine the frame rate. To display some text, you needed to first create a font object using the Font constructor. It would take either a path to a font that can be in a .ttf format or the string “default” if you wanted to use the default font available on the system. Once created, the font object had a print method that you could use within the game loop to tell the PS2 what to render and where on the screen. Finally, my Hello World example was finished. Now that you’ve been introduced to Athena, you might be tempted to try it out for yourself. In that case, I really recommend looking at the Sonic infinite runner Athena port’s code as you’ll learn a lot about concepts that I did not have time to cover here. Link to the repo here : https://github.com/DevWill-hub/Sonic-Infinite-Runner-PS2 Link to my Athena template : https://github.com/JSLegendDev/Athena-PS2-Template Link to the Athena project : https://github.com/DanielSant0s/AthenaEnv Additionally, I recommend joining the official Athena discord where you’ll be more likely to receive help when stuck. You can join here : https://discord.gg/cZUH5U93US Before wrapping up this post, you might have found strange that nothing was mentioned about 3D considering that the PS2 was mostly known for its 3D games. This is for 2 reasons. First, I’m a novice in terms of 3D game develoment, I have never done it before. Second, to my understanding, Athena has both 2D and 3D capabilities but version 4 which has more of a 3D focus is currently in development. I thought it would have been preferable to wait until v4 was stable before diving into PS2 3D gamedev in JavaScript. However, there are a few 3D demos you can check if you’re interested. Links down below. https://github.com/ps2devnoob/3D-Robot-DEMO-PS2 https://github.com/ps2devnoob/BurnTrack-PS2-DEMO https://github.com/ps2devnoob/AthenaENV-PS2-Samples To conclude, Athena is a cool project allowing you to make real PS2 games in JavaScript. If you learned something new and enjoy technical posts like this one, I recommend subscribing to not miss out on future releases. Subscribe now In the meantime, if you feel inclined, you can read the post below. I recently discovered that you could make PS2 games in JavaScript. I’m not even kidding, it’s actually possible. I was working on a project and had my phone near my desk when I received a notification. Upon further inspection, it came from itch.io which was a platform where I usually published most of my web games. Under my relatively popular Sonic infinite runner game which was made in JavaScript and developed a year ago, I received a comment from someone with the username Dev Will which claimed they had made a PS2 version of my game and provided the GitHub repo of the source code. At first, I thought that it was cool that someone took the time to remake my game for an old console that had a reputation to be hard to develop for and probably required them to write a lot of C or C++. Out of curiosity, I opened up the GitHub repo and was astonished to see that the project was not using even a bit of C++ or C but was entirely in JavaScript! If making PS2 games were easier than I thought since I could use a higher level language like JavaScript, I could probably try making one in a reasonable amount of time and play it on a retro handled or an actual PS2. How cool would that be? This is where I knew I had to drop everything I was doing to investigate how this was possible. The AthenaEnv Project Since the dev behind the project was Portuguese speaking (I assume they were either from Brazil or Portugal), they wrote the Readme of the repo in Portuguese which was a language I did not understand. Fortunately, I was still able to decipher most of what was written because I had done 3 years of Spanish in school and spoke French natively. Since Portuguese is a romance language like Spanish and French, I was fortunately not totally lost. Anyway, The readme said that the engine used to make the PS2 version of my game was called AthenaEnv with a conveniently placed link towards it so I could learn more. As with the Sonic Infinite Runner PS2 project, this engine was also open source and its repo had a very detailed readme written in English. To summarize, Athena was not what we commonly refer to as a game engine but an environment that also offered a JavaScript API for making games and apps for the PS2. It embedded a slightly modified version of QuickJS which was a small and embeddable JavaScript engine. This explained how Athena was able to run JavaScript code on the PS2. Therefore, Athena was the PS2 native program written in C that took your JavaScript code, passed it through the QuickJS engine to interpret it and finally, ran the relevant logic on the system. What made it compelling was not that it just ran JS on the PS2 but that it offered an API suitable for game development. It covered : Rendering : Allowing you to display sprites, text, shapes, etc… on the screen and animate them using a game loop. Asset loading : Allowing you to load images, sounds, fonts, etc... Input handling : Allowing you to receive player input from a controller, multiple ones or even from a mouse and keyboard since the PS2 supported these input methods. File handling : Allowing you to write save files among other things. Sound playback : For playing Sound. Now, the game looked a bit blurry which was expected since this was supposed to emulate a PS2 which had a small resolution. Fortunately, I was able to make things more comfortable by upping the resolution in the graphics settings of the emulator. The dev process also seemed quite straightforward. You would only need to open the folder containing all the relevant files (athena.elf, main.js, etc…) in a code editor like VSCode and open athena.elf in the emulator. Now, you could make changes to your JS code and once you were ready to test, you would go under the PCSX2 system tab and click on reset. This would restart the emulator and you could see the latest changes. While not as seamless as in web development with hot reloading, it still was a relatively fast iteration cycle. It’s at that moment, that I knew had to make a post about it and share this awesome project with you. However, I still felt uneasy about one thing. Nowadays, people download PS2 games as .iso files. For most games, you only need one .iso file that you then open in your emulator. Less technical people can therefore more easily enjoy these older titles. However, to run the Sonic infinite runner game “port”, I needed to not only check a box in the settings but also needed the entire project’s folder containing the Athena executable and the source code. I wondered if instead, there was a way to distribute the game as a single .iso file. This is were I simply went back to the itch.io comment section and asked if it was possible. After a thorough back and forth that continued on Discord, the process to convert my files into a single iso, I could distribute, was now clear. Making an .iso File To make an iso you needed the following files : athena.elf : Which is the Athena executable. athena.ini : For configuring the project’s entry point. A JS file acting as the entry point of the codebase. The rest of your source code if your code is more than one file, oftentimes it’s in a folder called src. Two files one named ATHA_000.01 and the other SYSTEM.CNF needed to make the iso bootable. Load some assets (In my case a font and an image). Set up a game loop that would run every frame. Animate a sprite using that game loop. Render text. Handle player input so I could move a sprite around. Now that I knew how to display a single frame within a wider image, I used the following logic to setup Sonic’s run animation. I first created an object called spritePos to set the position of the sprite on the screen. This was needed to be able to move it around when the player would press directional buttons on the D-pad. More on that later. Then I would set the sprite’s width and height to correspond to the width and height of a single frame which was 32x44 pixels. Since I wanted the sprite to appear big enough, I multiplied the width and height by a value defined by the SCALE constant we set earlier in our code. The next step consisted in creating an array called runAnimFrames which would describe each frame of Sonic’s run animation using an object with the startx, endx, starty and endy properties. We then had a frameIndex variable which would determine the current frame to display. The frameDuration constant would be used to set how long in miliseconds to display each frame. The lower the number the higher the frame rate of the animation because we would flip through all the frames faster. Finally, I initialized a timer coming from a custom Timer class that I added in my src folder and imported here. The full code is available in the template mentioned earlier. The timer would end up being crucial to know when it was time to move on to displaying another frame. Creating our Game Loop Now that we had our animation logic setup done, it was time to render the animation. For this purpose, I needed a game loop that runs every frame. In Athena, we could achieve this by calling the display method available under the Screen module. In an if statement we would check if the timer had exceeded the time allocated to displaying the current frame. If it was the case we would move on to the next frame by incrementing the frameIndex as long as it was within the bounds of the runAnimFrames array, otherwise, we would set it back to 0 to display the first frame. This was to achieve a looping animation. Then, on every iteration of the game loop we would set the sprite’s startx, endx, starty, endy properties to correspond to the ones of the current frame. Finally, to render the sprite, we needed to call its draw method and pass to it the coordinates where you wanted to display it on the screen. Handling Player Input Now that I had a game loop, I could finally handle user input by making sure that the sprite would move in different directions depending on which button was pressed. This could be easily achieved with a few if statements. You might be wondering where is deltaTime? For those unfamiliar, deltaTime is a value representing the time elapsed between the current frame and the previous frame in a game. It’s often used to make the movement of objects, frame rate independent. Meaning that if your game runs at a lower or higher frame rate, an object, like a character, will still move at the same rate. To achieve frame rate independence, you would usually multiply your movement code by deltaTime. The reason it was absent here, is because when creating a game loop using the display method, this matter is taken care of under the hood. Mirroring a Sprite Now that I could move Sonic around, I still needed him to face the correct direction because at this point, he would look right even If I moved him to the left. To implement this, I decided to go with a common technique in pixel art based games, which consisted in mirroring (or flipping) the sprite. To achieve this in Athena, you simply needed to provide a negative width or height to the sprite depending on what axis you wanted the mirroring to take effect on. For flipping a sprite horizontally, providing a negative width was enough. However, an issue arose! If you flipped the sprite, it would not flip in place since it would flip according to the sprite’s origin which was its top-left corner. This meant that it would move the sprite to the left after mirroring. To fix this issue, you only needed to subtract an offset to the x coordinate of the flipped sprite that corresponded to its width. Now that the issue was solved, I created variable called spriteIsFlippedX to know when to flip or unflip the sprite. The logic can be see below : Now, when you moved sonic to the left, he would face left and face right when moved to the right. Rendering Text There was still one thing I wanted to try out before wrapping up my Hello World example and that was text rendering. The first thing I wanted to render onto the screen was an FPS counter. It turned out that the FPS counter in the PCSX2 emulator is not accurate, however, Athena provides the getFPS() method available on the Screen module to accurately determine the frame rate. To display some text, you needed to first create a font object using the Font constructor. It would take either a path to a font that can be in a .ttf format or the string “default” if you wanted to use the default font available on the system. Once created, the font object had a print method that you could use within the game loop to tell the PS2 what to render and where on the screen. Finally, my Hello World example was finished. Now that you’ve been introduced to Athena, you might be tempted to try it out for yourself. In that case, I really recommend looking at the Sonic infinite runner Athena port’s code as you’ll learn a lot about concepts that I did not have time to cover here. Link to the repo here : https://github.com/DevWill-hub/Sonic-Infinite-Runner-PS2 Link to my Athena template : https://github.com/JSLegendDev/Athena-PS2-Template Link to the Athena project : https://github.com/DanielSant0s/AthenaEnv Additionally, I recommend joining the official Athena discord where you’ll be more likely to receive help when stuck. You can join here : https://discord.gg/cZUH5U93US How About 3D? Before wrapping up this post, you might have found strange that nothing was mentioned about 3D considering that the PS2 was mostly known for its 3D games. This is for 2 reasons. First, I’m a novice in terms of 3D game develoment, I have never done it before. Second, to my understanding, Athena has both 2D and 3D capabilities but version 4 which has more of a 3D focus is currently in development. I thought it would have been preferable to wait until v4 was stable before diving into PS2 3D gamedev in JavaScript. However, there are a few 3D demos you can check if you’re interested. Links down below. https://github.com/ps2devnoob/3D-Robot-DEMO-PS2 https://github.com/ps2devnoob/BurnTrack-PS2-DEMO https://github.com/ps2devnoob/AthenaENV-PS2-Samples

0 views
Matthias Endler 2 months ago

On Choosing Rust

Since my professional writing on Rust has moved to the corrode blog , I can be a bit more casual on here and share some of my personal thoughts on the recent debate around using Rust in established software. The two projects in question are git ( kernel thread , Hacker News Discussion ) and the recently rewritten coreutils in Rust , which will ship with Ubuntu 25.10 Quizzical Quokka . What prompted me to write this post is a discussion on Twitter and a blog post titled “Are We Chasing Language Hype Over Solving Real Problems?” . In both cases, the authors speculate about the motivations behind choosing Rust, and as someone who helps teams use Rust in production, I find those takes… hilarious. Back when I started corrode, people always mentioned that Rust wasn’t used for anything serious. I knew about the production use cases from client work, but there was very little public information out there. As a consequence, we started the ‘Rust in Production’ podcast to show that companies indeed choose Rust for real-world applications. However, people don’t like to be proven wrong, so that conspiracy theory has now morphed into “Big Rust” trying to take over the world. 😆 Let’s look at some of the claims made in the blog post and Twitter thread and see how these could be debunked pretty easily. “GNU Core Utils has basically never had any major security vulnerabilities in its entire existence” If only that were true. A quick CVE search shows multiple security issues over the decades, including buffer overflows and path traversal vulnerabilities. Just a few months ago, a heap buffer under-read was found in , which would cause a leak of sensitive data if an attacker sends a specially crafted input stream. The GNU coreutils are one of the most widely used software packages worldwide with billions of installations and hundreds (thousands?) of developers looking at the code. Yes, vulnerabilities still happen. No, it is not easy to write correct, secure C code. No, not even if you’re extra careful and disciplined. is five thousand lines long. (Check out the source code ). That’s a lot of code for printing file names and metadata and a big attack surface! “Rust can only ever match C performance at best and is usually slower” Work by Trifecta shows that it is possible to write Rust code that is faster than C in some cases. Especially in concurrent workloads and with memory safety guarantees. If writing safe C code is too hard, try writing safe concurrent C code! That’s where Rust shines. You can achieve ridiculous levels of parallelization without worrying about security issues. And no, you don’t need to litter your code with blocks. Check out Steve Klabnik’s recent talk about Oxide where he shows that their bootloader and their preemptive multitasking OS, hubris – both pretty core systems code – only contain 5% of code each. You can write large codebases in Rust with no unsafe code at all. As a trivial example, I sat down to rewrite in Rust one day. The result was 3x faster than GNU on my machine. You can read the post here . All I did was use to copy data, which saves one memory copy. Performance is not only dependent on the language but on the algorithms and system calls you use. If you play into Rust’s strengths, you can match C’s performance. At least there is no technical limitation that would prevent this. And I personally feel more willing to aggressively optimize my code in Rust, because I don’t have to worry about introducing memory safety bugs. It feels like I’m not alone . “We reward novelty over necessity in the industry” This ignores that most successful companies (Google, Meta, etc.) primarily use battle-tested tech stacks, not bleeding-edge languages. These companies have massive codebases and cannot afford to rewrite everything in the latest trendy language. But they see the value of using Rust for new components and gradually rewriting existing ones. That’s because 70% of security vulnerabilities are memory safety issues and these issues are extremely costly to fix. If these companies could avoid switching to a new language to do so, they would. Besides, Rust is not exactly new anymore. Rust 1.0 was released 10+ years ago! The industry is moving slowly, but not that slowly. You’d be surprised to find out how many established companies use Rust without even announcing it or thinking of it as “novelty”. “100% orchestrated” Multiple people in the Twitter thread were convinced this is some coordinated master plan rather than developers choosing better tools, while the very maintainers of git and coreutils openly discussed their motivations in public forums for everyone to see. “They’re trying to replace/erase C. It’s not going to happen” They are right. C is not going away anytime soon. There is just so much C/C++ code out there in the wild, and rewriting everything in Rust is not feasible. The good news is that you can incrementally rewrite C/C++ code in Rust, one component at a time. That’s what the git maintainers are planning, by using Rust for new components. “They’re rewriting software with a GNU license into software with an MIT license” Even if you use Rust, you can still license your code under GPL or any other license you want. Git itself remains GPL, and many Rust projects use various licenses, not only MIT. The license fear is often brought up by people who don’t understand how open source licensing works or it might just be FUD. MIT code is still compatible with GPL code and you can use both of them in the same project without issues. It’s just that the end product (the thing you deliver to your users, i.e. binary executables) is now covered by GPL because of its virality. “It’s just developers being bored and wanting to work with shiny new languages” The aging maintainers of C projects are retiring, and there are fewer new developers willing to pick up C just to maintain legacy code in their free time. C developers are essentially going extinct. New developers want to work with modern languages and who can blame them? Or would you want to maintain a 40-year-old COBOL codebase or an old Perl script? We have to move on. “Why not build something completely new instead of rewriting existing tools?” It’s not that easy. The code is only part of the story. The other part is the ecosystem, the tooling, the integrations, the documentation, and the user base. All of that takes years to build. Users don’t want to change their workflows, so they want drop-in replacements. Proven interfaces and APIs, no matter how crude and old-fashioned, have a lot of value. But yes, new tools are being built in Rust as well. “They don’t know how to actually solve problems, just chase trends” Talk about dismissing the technical expertise of maintainers who’ve been working on these projects for years or decades and understand the pain points better than anyone. If they were just chasing trends, they wouldn’t be maintaining these projects in the first place! These people are some of the most experienced developers in the world, and yet people want to tell them how to do their jobs. “It’s part of the woke mind virus infecting software” Imagine thinking memory safety is a political conspiracy. Apparently preventing buffer overflows is now an ideological stance. The closest thing to this is the White House’s technical report which recommends memory-safe languages for government software and mandating memory safety for software receiving federal funding is a pretty reasonable take. Conclusion I could go on, but I think you get my point. People who give Rust an honest chance know that it offers advantages in terms of memory safety, concurrency, and maintainability. It’s not about chasing hype but about long-term investment in software quality. As more companies successfully adopt Rust every day, it increasingly becomes the default choice for many new projects. If you’re interested in learning more about using Rust in production, check out my other blog or listen to the Rust in Production podcast . Oh, and if you know someone who posts such takes, stop arguing and send them a link to this post.

0 views
Dayvster 2 months ago

Are We Chasing Language Hype Over Solving Real Problems?

## Intro As you may have heard or seen, there is a bit of controversy around Ubuntu adopting a rewritten version of GNU Core Utils in Rust. This has sparked a lot of debate in the tech community. This decision by Canonical got me thinking about this whole trend or push of rewriting existing software in Rust which seems to be happening a lot lately. To put it bluntly I was confused by the need to replace GNU Core Utils with a new implementation as GNU Core Utils has been around since arguably the 90s and more realistically 2000s and it has been battle tested and proven to be reliable, efficient, effective and most importantly secure as it had basically never had any major security vulnerabilities in its entire existence. So why then would we deem it necessary to replace it with a new implementation in Rust? Why would anyone go through the trouble of rewriting something that already works perfectly fine and has been doing so for decades? When the end result at best is going to be a tool that does the same thing as the original and in the very best case scenario offer the same performance? What bothers me even more is the bigger pattern this points to. Are we as developers more interested in chasing new languages and frameworks than actually solving real problems? I strongly subscribe to the idea that software development 60% problem solving and 40% creative exploration and innovation. But lately it feels like the balance is shifting more and more towards the latter. We seem to be more interested in trying out the latest and greatest languages and frameworks than actually solving real problems. ## The Hype of New Languages and Shiny Object Syndrome We've all been there in the past haven't we? Getting excited about a new programming language or framework that promises to solve all our problems and make our lives easier. It's easy to get caught up in the hype and want to try out the latest and greatest technology, but it's important to remember that just because something is new doesn't mean it's better. We need to be careful not to fall into the trap of "shiny object syndrome" where we chase after the latest trends without considering whether they actually solve our problems or improve our workflows. It's important to evaluate new technologies based on their merits and how they fit into our existing systems and workflows, rather than simply jumping on the bandwagon because everyone else is doing it. Now for the important question: **Do I think the developers of coreutils-rs are doing this just because Rust is the new hotness?** Short and simple: No, no I do not. I believe they have good intentions and are likely trying to improve upon the existing implementation in some way. However, I do not agree with them that there is a need for a rewritten version of GNU Core Utils in Rust. I also do not agree that GNU Core Utils is inherently insecure or unsafe. ### Why do we get Exited About New Languages? It's also important to briefly touch upon the psychological aspect of why we get excited about new languages. New languages often come with new features, syntax, and paradigms that can be appealing to developers. They may also promise to solve problems that existing languages struggle with, such as performance, concurrency, or memory safety. Additionally, new languages can offer a fresh perspective on programming and can inspire creativity and innovation. Not to mention the community aspect, new usually means a changing of the guard, new people, new ideas, new ways of thinking about problems. All of these factors can contribute to the excitement and enthusiasm that developers feel when a new language is introduced. This enthusiasm can sometimes lead to an almost zealous approach of wanting everything and anything to be written only in the new language by this new and fresh community of developers. This can lead to a situation where existing and well-established software is rewritten in the new language, even if there is no real need for it. This can be seen as a form of "language evangelism" where developers are trying to promote their favorite language by rewriting existing software in it. ## The Case of GNU Core Utils As I've briefly touched upon earlier, GNU Core Utils is a collection of basic file, shell and text manipulation utilities that are fundamental to the operation of Unix-like operating systems. These utilities include commands like `ls`, `cp`, `mv`, `rm`, `cat`, `echo`, and many others. They are essential for performing everyday tasks in the command line interface (CLI) and are used by system administrators, developers, and users alike. Some of these can run hundreds of times per second, so performance is absolutely crucial. Even a small reduction in performance to a utility that is run by some OS critical daemon can have a significant impact on the overall performance of the system. GNU core utils has been optimized for this for about 30+ years at this point and is it really worth just tossing all of those lessons and optimizations out the window just to rewrite it in a new language? I've also briefly touched upon that at best in the absolute **best case scenario** a rewritten version of GNU Core Utils in Rust would be able to match the performance of the original implementation. As we know GNU Core Utils are mostly written in C and some C++ mixed in sparingly. So far benchmarks have shown time and time again that at best with a lot of optimizations and tweaks Rust can only ever match the performance of C and in most cases it is actually slower. So the best case outcome of this rewrite is that we get a tool that does the same thing as the original and at best offers the same performance. So what is the actual benefit of this rewrite? Where is the value, what is the actual problem that is being solved here? ## When Hype Overshadows Real Problems This is the crux of the issue, it's very very easy to get swept up in the excitement of a new language and want to use it for everything and anything under the sun. As developers we love novelty and communities with enthusiasm and fresh ideas. It's stimulating it's fun it feels like progress it feels like we are finally building something again instead of just rehashing and maintaining. We all know from personal experience that creating a new project is more fun and enjoyable than maintaining and existing one and this is a natural human tendency. Now do I think this is one of the reasons the developers of coreutils-rs are doing this? Yes, I do. But in the end they are solving a problem that does not exist. ### It's not just about Core Utils Now with how often I've mentioned this specific example of GNU Core Utils you might think I want to single them out or have some sort of grudge or specific issue with this particular project. No, not really... I think this project is indicative of the larger issue we face in the tech community. It's very easy to get caught up in the excitement of new languages and frameworks and want to use them for everything and anything. This can lead to a situation where we are rewriting existing software in new languages without considering whether it actually solves any real problems or improves our workflows. ## Problem Solving Should Be Our North Star At the end of the day, software development is about solving real problems, not about chasing novelty, shiny new languages, or personal curiosity. Every line of code we write, every framework we adopt, every library we integrate should be justified by the problems it helps solve, not by the hype around it. Yet increasingly, it feels like the industry is losing sight of this. We celebrate engineers for building in the “new hot language,” for rewriting tools, or for adopting the latest framework, even when the original solution worked perfectly fine. We reward novelty over necessity, excitement over impact. This is not a phenomenon isolated to just core utils or systems programming, it happens in web development, mobile development, data science, and pretty much every other area of software development. We too often abandon tried and true solutions and ideas of new and exciting shiny ones without considering whether they actually solve any real problems or improve our workflows. For example web development went full circle with React Server Components where we went from separation of concerns straight back to PHP style mixing of HTML and logic, server rendering and delivering interactive components to the client. Or the whole GraphQL craze where traditional REST APIs were abandoned en masse for a new and exciting way of doing things that promised to solve the dreaded problem of "over-fetching" and "under-fetching" of data. Yet in reality, it introduced a whole new set of problems and complexities that were not present in traditional REST APIs. Or perhaps the whole microservices and microfrontend craze where a lot of projects were abandoned or rewritten to be split into smaller and smaller pieces, **Was it all bad? Should we always just stick to what works and only ever maintain legacy projects and systems? Heck no!** There is definitely a place for innovation and new ideas in software development. New languages, frameworks, and tools can bring fresh perspectives and approaches to problem-solving. However, it's important to evaluate these new technologies based on their merits and how they fit into our existing systems and workflows, rather than simply jumping on the bandwagon because everyone else is doing it. We need to be more critical and thoughtful about the technologies we adopt and the projects we undertake. We need to ask ourselves whether a new language or framework actually solves a real problem or improves our workflows, or if we're just chasing novelty for its own sake. ## A Final Thought At the end of the day, it’s not about Rust, React, GraphQL, or the latest microservices fad. It’s about solving real problems. Every line of code, every framework, every rewrite should have a purpose beyond curiosity or hype. We live in a culture that rewards novelty, celebrates “cool” tech, and often mistakes excitement for progress. But progress isn’t measured in how many new languages you touch, or how many shiny rewrites you ship, it’s measured in impact, in the problems you actually solve for users, teams, and systems. So next time you feel the pull of the newest hot language, framework, or tool, pause. Ask yourself: “Am I solving a real problem here, or just chasing excitement?” Because at the end of the day, engineering isn’t about what’s trendy, it’s about what works, what matters, and what actually makes a difference. And that, my friends, is the craft we should all be sharpening.

0 views
Dayvster 2 months ago

Why I Still Reach for C for Certain Projects

## The Project That Made Me Choose C Again So a while back I got tired of xserver and decided to give wayland a try. I use Arch (BTW) with a tiling window manager (dwm) and I wanted to try something new, since I had a couple of annoyances with xserver. I've heard some good things about wayland so I thought you know what, why not let's give it a shot. After 30-45min my Arch and Hyprland setup was done and ready to go. I was pretty happy with it, but I was missing some features I've previously had such as notifications for when somebody posts new content to their RSS feed. I quite like RSS feeds I use them to keep up to date with blogs, news, streams, releases etc. So I thought to myself, why not write a small program that checks my RSS feeds and sends me a notification when there's something new. Now the way I was gonna go about this was pretty simple. I would write a simple C daemon that would run in the background, check my RSS feeds on a random interval between 5-15min and if there was something new it would send out a notification using `notify-send`. Then comes the tricky part I wanted `swaync` to allow me to offer two buttons on the notification, one to open the link in my default browser and one to ignore the notification and mark it as ignored in a flat file on my system so that I could see if I have to remove certain feeds that I tend to ignore a lot. Now here's the problem, you can't really do that with `swaync`, I mean it does support buttons but it doesn't really let you handle the button clicks in a way that would allow you to do what I wanted. So I had to come up with a workaround. The workaround was to have another C program run as a daemon that would listen on the DBus for a very specific notfication that would contain an action with a string of `openUrl("url")` or `ignoreUrl("url")` and then handle the action accordingly. This way I could have `swaync` send out a notification with the buttons and when the user(me) clicks on one of the buttons it would send out a DBus notification that my second daemon would pick up and handle. ## Why C Was the Right Choice Here? Now you might be wondering why I chose C for this project and not something like python which would allow me to write this specific program much faster and enjoy the rest of my weekend. Well my answer to you is simple, I don't like python and I didn't feel like wasting multiple 100s of MBs of RAM for something this simple and small. In fact I DID write it in Python and Go first with their respective DBus libraries, both were super easy to work with and I had an initial working prototype within less than an hour. But as I ran a simple `htop` I saw that the python version was using around 150MB of RAM and the Go version was using around 50MB of RAM. Now don't get me wrong, I'm not on an old thinkpad I have RAM to spare 32Gb to be exact. But why waste it on something this small and simple. Plus C is just so much more fun and exciting to work with. So I set up my C project which was just a simple `Makefile` and a couple of `.c` and `.h` files. I imported `dbus/dbus.h` and got to work. Now I'd be lying if I said it took me no time at all, in fact it took me roughly 3-4h which is a lot longer than python or Go. But in the end I had a working prototype that was using around 1-2MB of RAM and was super fast and responsive. I also got to brush up on my DBus skills which were a bit rusty since I hadn't worked with it in a while. Now getting a simple program like that from 150 to 50 to 1-2Mb of RAM usage is a huge performance improvement and it really showcases the power and strength of C as a programming language. Look at this this way I may have spent multiple hours longer writing this program in C but it will just continue to run in the background using a fraction of the resources that python or Go would have used for a long time to come. Additionally this probably won't be the only modification I make to my system now imagine if I were this lackluster with 10-20 small programs that I run in the background and let's make a wild assumption that each of those programs would use about 50-200Mb of RAM. Now we're looking at 500-4000Mb of RAM usage, or precisely one chrome tab! No I don't think that's an acceptable tradeoff for a couple of hours of my time. So in the end I think C was the right choice for this project and I'm pretty happy with the result. ## Why Modern Languages Aren't Always the Best Fit That long rant above is a cute story / anecdote. But I feel like it also highlights a bigger issue that developers face today. A lot of times we just wanna complete a certain task as quickly as possible and have it running before we've had our second coffee. I get it we're all busy people and when resources are as cheap as they are today saving a few Mb of RAM or a few ms of CPU time isn't really a big deal. But there is something so nice and satisfying about writing a small program in C that does exactly what you want it to do and nothing more. It's like a breath of fresh air in a world where everything is becoming more and more bloated and resource hungry. It feels a bit zen taking everything down to the bare minimum and just focusing on the task at hand. I'd liken developing in C to woodworking, you have to be precise and careful with your cuts, but in the end you get a beautiful product that you can be uniquely proud of. Sure you could go pick up a flat pack from your local IKEA and have a nice looking table within an hour. But it will be souless and generic and not really made well, not something you can proudly say to your visitors "Hey I built this myself". I mean you could... but they probably won't really be as impressed with your flat pack assembly as if you were to show them a hand crafted table you made yourself. ## C Excels in Low-Level System Programming and Efficiency So maybe the example project that I described above in the beginning of the article is too rickety or not really high brow enough that's fair I get that. But even if you think my project was stupid or silly or could have been done better or easier there is no denying that C simply is the gold standard for low level systems programming and efficiency. C gives you unparalleled control over system resources, memory management, and performance optimization. This makes it the go-to choice for developing operating systems, embedded systems, and performance-critical applications where every byte of memory and every CPU cycle counts. For example, the Linux kernel, which powers a vast majority of servers, desktops, and mobile devices worldwide, is primarily written in C. This is because C allows developers to write code that can directly interact with hardware and manage system resources efficiently. If you drive any modern vehicle that was created in the past decade or so, chances are that you have multiple microcontrollers in your car that are running C code to manage everything from engine performance, emergency breaking systems (AEB, ABS), distance control systems (ACC) and so on. For these systems you are very limited in the amount of RAM usage you can afford and most importantly you can not afford too much latency or delays in processing. Now how often have you heard about these systems failing and leading to fatal crashes? Not very often I'd wager. Of course safety standards, regulations and certifications play a big role in this but the choice of programming language is also a big factor. C is a mature and well-established language with a long history of use in safety-critical systems, making it a reliable choice for such applications. It does it's job exceptionally well and it does it with a very small footprint and most importantly you can run it in perpetuity on very limited hardware without worry over failure or issues due to resource exhaustion. ## Real World Scenarios Where C Shines Here are a few real-world scenarios where C is the preferred choice: - **Operating Systems**: As mentioned earlier, the Linux kernel is written in C. Other operating systems like Windows and macOS also have significant portions written in C. - **Embedded Systems**: C is widely used in embedded systems development for devices like microcontrollers, IoT devices, and automotive systems due to its efficiency and low-level hardware access. - **Game Development**: Many game engines and performance-critical game components are written in C or C++ to achieve high performance and low latency. - **Database Systems**: Many database management systems, such as MySQL and PostgreSQL, are implemented in C to ensure efficient data handling and query processing. - **Network Programming**: C is often used for developing network protocols and applications due to its ability to handle low-level socket programming and efficient data transmission. - **Compilers and Interpreters**: Many programming language compilers and interpreters are written in C to leverage its performance and low-level capabilities. - **High-Performance Computing**: C is used in scientific computing and simulations where performance is critical, such as in numerical libraries and parallel computing frameworks. - **System Utilities**: Many system utilities and command-line tools in Unix-like operating systems are written in C for efficiency and direct system access. - **Device Drivers**: C is the language of choice for writing device drivers that allow the operating system to communicate with hardware devices. - **Cryptography**: Many cryptographic libraries and algorithms are implemented in C to ensure high performance and security. - **Audio/Video Processing**: Libraries like FFmpeg are written in C to handle multimedia processing efficiently. - **Web Servers**: Popular web servers like Apache and Nginx are written in C to handle high loads and provide fast response times. You get my point, C is everywhere and it excels in scenarios where performance, efficiency, and low-level system access are paramount. ## When C Isn’t the Right Choice ? Now as you can see I'm a pretty big proponent of C and you'll often hear people who exclusively develop in C (sadly not me), say: "Anything and everything under the sun can be written in C". Which is true, if you were to remove all programming languages from existence save one, my choice would always be C as we could potentially rebuild everything from the ground up. However even I have to admit that C may not always be the best choice everything has it's ups and downs. So C may not be your best choice when: - **Rapid Prototyping**: If you need to quickly prototype an idea or concept, higher-level languages like Python or JavaScript may be more suitable due to their ease of use and extensive libraries. - **Web Development**: For building web applications, languages like JavaScript (with frameworks like React, Angular, or Vue) or Python (with frameworks like Django or Flask) are often preferred due to their ease of use and extensive web development libraries. - **Data Science and Machine Learning**: Languages like Python and R are commonly used in data science and machine learning due to their extensive libraries and frameworks (e.g., TensorFlow, PyTorch, Pandas). - **Mobile App Development**: For mobile app development, languages like Swift (for iOS) and Kotlin (for Android) are often preferred due to their platform-specific features and ease of use. - **Memory Safety**: If memory safety is a primary concern, languages like Rust or Go may be more suitable due to their built-in memory safety features and garbage collection(Go). - **Ease of Learning**: If you're new to programming, languages like Python or JavaScript may be easier to learn due to their simpler syntax and extensive learning resources. - **Community and Ecosystem**: If you need access to a large community and ecosystem of libraries and frameworks, languages like JavaScript, Python, or Java may be more suitable due to their extensive ecosystems. ## Conclusion So there you have it why I still reach for C for certain projects. It’s not always the fastest or easiest choice, but for small, efficient programs that need to run lean and mean, it often makes sense. C isn’t flashy or trendy, but it gets the job done, and sometimes that’s all that matters. Curious to hear how others tackle similar projects C, Rust, Go, or something else?

0 views