Posts in C (20 found)
(think) 2 days ago

Building Emacs Major Modes with TreeSitter: Lessons Learned

Over the past year I’ve been spending a lot of time building TreeSitter-powered major modes for Emacs – clojure-ts-mode (as co-maintainer), neocaml (from scratch), and asciidoc-mode (also from scratch). Between the three projects I’ve accumulated enough battle scars to write about the experience. This post distills the key lessons for anyone thinking about writing a TreeSitter-based major mode, or curious about what it’s actually like. Before TreeSitter, Emacs font-locking was done with regular expressions and indentation was handled by ad-hoc engines (SMIE, custom indent functions, or pure regex heuristics). This works, but it has well-known problems: Regex-based font-locking is fragile. Regexes can’t parse nested structures, so they either under-match (missing valid code) or over-match (highlighting inside strings and comments). Every edge case is another regex, and the patterns become increasingly unreadable over time. Indentation engines are complex. SMIE (the generic indentation engine for non-TreeSitter modes) requires defining operator precedence grammars for the language, which is hard to get right. Custom indentation functions tend to grow into large, brittle state machines. Tuareg’s indentation code, for example, is thousands of lines long. TreeSitter changes the game because you get a full, incremental, error-tolerant syntax tree for free. Font-locking becomes “match this AST pattern, apply this face”: And indentation becomes “if the parent node is X, indent by Y”: The rules are declarative, composable, and much easier to reason about than regex chains. In practice, ’s entire font-lock and indentation logic fits in about 350 lines of Elisp. The equivalent in tuareg is spread across thousands of lines. That’s the real selling point: simpler, more maintainable code that handles more edge cases correctly . That said, TreeSitter in Emacs is not a silver bullet. Here’s what I ran into. TreeSitter grammars are written by different authors with different philosophies. The tree-sitter-ocaml grammar provides a rich, detailed AST with named fields. The tree-sitter-clojure grammar, by contrast, deliberately keeps things minimal – it only models syntax, not semantics, because Clojure’s macro system makes static semantic analysis unreliable. 1 This means font-locking forms in Clojure requires predicate matching on symbol text, while in OCaml you can directly match nodes with named fields. To illustrate: here’s how you’d fontify a function definition in OCaml, where the grammar gives you rich named fields: And here’s the equivalent in Clojure, where the grammar only gives you lists of symbols and you need predicate matching: You can’t learn “how to write TreeSitter queries” generically – you need to learn each grammar individually. The best tool for this is (to visualize the full parse tree) and (to see the node at point). Use them constantly. You’re dependent on someone else providing the grammar, and quality is all over the map. The OCaml grammar is mature and well-maintained – it’s hosted under the official tree-sitter GitHub org. The Clojure grammar is small and stable by design. But not every language is so lucky. asciidoc-mode uses a third-party AsciiDoc grammar that employs a dual-parser architecture – one parser for block-level structure (headings, lists, code blocks) and another for inline formatting (bold, italic, links). This is the same approach used by Emacs’s built-in , and it makes sense for markup languages where block and inline syntax are largely independent. The problem is that the two parsers run independently on the same text, and they can disagree . The inline parser misinterprets and list markers as emphasis delimiters, creating spurious bold spans that swallow subsequent inline content. The workaround is to use on all block-level font-lock rules so they win over the incorrect inline faces: This doesn’t fix inline elements consumed by the spurious emphasis – that requires an upstream grammar fix. When you hit grammar-level issues like this, you either fix them yourself (which means diving into the grammar’s JavaScript source and C toolchain) or you live with workarounds. Either way, it’s a reminder that your mode is only as good as the grammar underneath it. Getting the font-locking right in was probably the most challenging part of all three projects, precisely because of these grammar quirks. I also ran into a subtle behavior: the default font-lock mode ( ) skips an entire captured range if any position within it already has a face. So if you capture a parent node like and a child was already fontified, the whole thing gets skipped silently. The fix is to capture specific child nodes instead: These issues took a lot of trial and error to diagnose. The lesson: budget extra time for font-locking when working with less mature grammars . Grammars evolve, and breaking changes happen. switched from the stable grammar to the experimental branch because the stable version had metadata nodes as children of other nodes, which caused and to behave incorrectly. The experimental grammar makes metadata standalone nodes, fixing the navigation issues but requiring all queries to be updated. pins to v0.24.0 of the OCaml grammar. If you don’t pin versions, a grammar update can silently break your font-locking or indentation. The takeaway: always pin your grammar version , and include a mechanism to detect outdated grammars. tests a query that changed between versions to detect incompatible grammars at startup. Users shouldn’t have to manually clone repos and compile C code to use your mode. Both and include grammar recipes: On first use, the mode checks and offers to install missing grammars via . This works, but requires a C compiler and Git on the user’s machine, which is not ideal. 2 The TreeSitter support in Emacs has been improving steadily, but each version has its quirks: Emacs 29 introduced TreeSitter support but lacked several APIs. For instance, (used for structured navigation) doesn’t exist – you need a fallback: Emacs 30 added , sentence navigation, and better indentation support. But it also had a bug in offsets ( #77848 ) that broke embedded parsers, and another in that required to disable its TreeSitter-aware version. Emacs 31 has a bug in where an off-by-one error causes to leave ` *)` behind on multi-line OCaml comments. I had to skip the affected test with a version check: The lesson: test your mode against multiple Emacs versions , and be prepared to write version-specific workarounds. CI that runs against Emacs 29, 30, and snapshot is essential. Most TreeSitter grammars ship with query files for syntax highlighting ( ) and indentation ( ). Editors like Neovim and Helix use these directly. Emacs doesn’t – you have to manually translate the patterns into and calls in Elisp. This is tedious and error-prone. For example, here’s a rule from the OCaml grammar’s : And here’s the Elisp equivalent you’d write for Emacs: The query syntax is nearly identical, but you have to wrap everything in calls, map upstream capture names ( ) to Emacs face names ( ), assign features, and manage behavior. You end up maintaining a parallel set of queries that can drift from upstream. Emacs 31 will introduce which will make it possible to use files for font-locking, which should help significantly. But for now, you’re hand-coding everything. When a face isn’t being applied where you expect: TreeSitter modes define four levels of font-locking via , and the default level in Emacs is 3. It’s tempting to pile everything into levels 1–3 so users see maximum highlighting out of the box, but resist the urge. When every token on the screen has a different color, code starts looking like a Christmas tree and the important things – keywords, definitions, types – stop standing out. Less is more here. Here’s how distributes features across levels: And follows the same philosophy: The pattern is the same: essentials first, progressively more detail at higher levels. This way the default experience (level 3) is clean and readable, and users who want the full rainbow can bump to 4. Better yet, they can use to cherry-pick individual features regardless of level: This gives users fine-grained control without requiring mode authors to anticipate every preference. Indentation issues are harder to diagnose because they depend on tree structure, rule ordering, and anchor resolution: Remember that rule order matters for indentation too – the first matching rule wins. A typical set of rules reads top to bottom from most specific to most general: Watch out for the empty-line problem : when the cursor is on a blank line, TreeSitter has no node at point. The indentation engine falls back to the root node as the parent, which typically matches the top-level rule and gives column 0. In neocaml I solved this with a rule that looks at the previous line’s last token to decide indentation: This is the single most important piece of advice. Font-lock and indentation are easy to break accidentally, and manual testing doesn’t scale. Both projects use Buttercup (a BDD testing framework for Emacs) with custom test macros. Font-lock tests insert code into a buffer, run , and assert that specific character ranges have the expected face: Indentation tests insert code, run , and assert the result matches the expected indentation: Integration tests load real source files and verify that both font-locking and indentation survive on the full file. This catches interactions between rules that unit tests miss. has 200+ automated tests and has even more. Investing in test infrastructure early pays off enormously – I can refactor indentation rules with confidence because the suite catches regressions immediately. When I became the maintainer of clojure-mode many years ago, I really struggled with making changes. There were no font-lock or indentation tests, so every change was a leap of faith – you’d fix one thing and break three others without knowing until someone filed a bug report. I spent years working on a testing approach I was happy with, alongside many great contributors, and the return on investment was massive. The same approach – almost the same test macros – carried over directly to when we built the TreeSitter version. And later I reused the pattern again in and . One investment in testing infrastructure, four projects benefiting from it. I know that automated tests, for whatever reason, never gained much traction in the Emacs community. Many popular packages have no tests at all. I hope stories like this convince you that investing in tests is really important and pays off – not just for the project where you write them, but for every project you build after. This one is specific to but applies broadly: compiling TreeSitter queries at runtime is expensive. If you’re building queries dynamically (e.g. with called at mode init time), consider pre-compiling them as values. This made a noticeable difference in ’s startup time. The Emacs community has settled on a suffix convention for TreeSitter-based modes: , , , and so on. This makes sense when both a legacy mode and a TreeSitter mode coexist in Emacs core – users need to choose between them. But I think the convention is being applied too broadly, and I’m afraid the resulting name fragmentation will haunt the community for years. For new packages that don’t have a legacy counterpart, the suffix is unnecessary. I named my packages (not ) and (not ) because there was no prior or to disambiguate from. The infix is an implementation detail that shouldn’t leak into the user-facing name. Will we rename everything again when TreeSitter becomes the default and the non-TS variants are removed? Be bolder with naming. If you’re building something new, give it a name that makes sense on its own merits, not one that encodes the parsing technology in the package name. I think the full transition to TreeSitter in the Emacs community will take 3–5 years, optimistically. There are hundreds of major modes out there, many maintained by a single person in their spare time. Converting a mode from regex to TreeSitter isn’t just a mechanical translation – you need to understand the grammar, rewrite font-lock and indentation rules, handle version compatibility, and build a new test suite. That’s a lot of work. Interestingly, this might be one area where agentic coding tools can genuinely help. The structure of TreeSitter-based major modes is fairly uniform: grammar recipes, font-lock rules, indentation rules, navigation settings, imenu. If you give an AI agent a grammar and a reference to a high-quality mode like , it could probably scaffold a reasonable new mode fairly quickly. The hard parts – debugging grammar quirks, handling edge cases, getting indentation just right – would still need human attention, but the boilerplate could be automated. Still, knowing the Emacs community, I wouldn’t be surprised if a full migration never actually completes. Many old-school modes work perfectly fine, their maintainers have no interest in TreeSitter, and “if it ain’t broke, don’t fix it” is a powerful force. And that’s okay – diversity of approaches is part of what makes Emacs Emacs. TreeSitter is genuinely great for building Emacs major modes. The code is simpler, the results are more accurate, and incremental parsing means everything stays fast even on large files. I wouldn’t go back to regex-based font-locking willingly. But it’s not magical. Grammars are inconsistent across languages, the Emacs APIs are still maturing, you can’t reuse files (yet), and you’ll hit version-specific bugs that require tedious workarounds. The testing story is better than with regex modes – tree structures are more predictable than regex matches – but you still need a solid test suite to avoid regressions. If you’re thinking about writing a TreeSitter-based major mode, do it. The ecosystem needs more of them, and the experience of working with syntax trees instead of regexes is genuinely enjoyable. Just go in with realistic expectations, pin your grammar versions, test against multiple Emacs releases, and build your test suite early. Anyways, I wish there was an article like this one when I was starting out with and , so there you have it. I hope that the lessons I’ve learned along the way will help build better modes with TreeSitter down the road. That’s all I have for you today. Keep hacking! See the excellent scope discussion in the tree-sitter-clojure repo for the rationale.  ↩︎ There’s ongoing discussion in the Emacs community about distributing pre-compiled grammar binaries, but nothing concrete yet.  ↩︎ Regex-based font-locking is fragile. Regexes can’t parse nested structures, so they either under-match (missing valid code) or over-match (highlighting inside strings and comments). Every edge case is another regex, and the patterns become increasingly unreadable over time. Indentation engines are complex. SMIE (the generic indentation engine for non-TreeSitter modes) requires defining operator precedence grammars for the language, which is hard to get right. Custom indentation functions tend to grow into large, brittle state machines. Tuareg’s indentation code, for example, is thousands of lines long. Use to verify the node type at point matches your query. Set to to see which rules are firing. Check the font-lock feature level – your rule might be in level 4 while the user has the default level 3. The features are assigned to levels via . Remember that rule order matters . Without , an earlier rule that already fontified a region will prevent later rules from applying. This can be intentional (e.g. builtin types at level 3 take precedence over generic types) or a source of bugs. Set to – this logs which rule matched for each line, what anchor was computed, and the final column. Use to understand the parent chain. The key question is always: “what is the parent node, and which rule matches it?” Remember that rule order matters for indentation too – the first matching rule wins. A typical set of rules reads top to bottom from most specific to most general: Watch out for the empty-line problem : when the cursor is on a blank line, TreeSitter has no node at point. The indentation engine falls back to the root node as the parent, which typically matches the top-level rule and gives column 0. In neocaml I solved this with a rule that looks at the previous line’s last token to decide indentation: See the excellent scope discussion in the tree-sitter-clojure repo for the rationale.  ↩︎ There’s ongoing discussion in the Emacs community about distributing pre-compiled grammar binaries, but nothing concrete yet.  ↩︎

0 views
<antirez> 5 days ago

Implementing a clear room Z80 / ZX Spectrum emulator with Claude Code

Anthropic recently released a blog post with the description of an experiment in which the last version of Opus, the 4.6, was instructed to write a C compiler in Rust, in a “clean room” setup. The experiment methodology left me dubious about the kind of point they wanted to make. Why not provide the agent with the ISA documentation? Why Rust? Writing a C compiler is exactly a giant graph manipulation exercise: the kind of program that is harder to write in Rust. Also, in a clean room experiment, the agent should have access to all the information about well established computer science progresses related to optimizing compilers: there are a number of papers that could be easily synthesized in a number of markdown files. SSA, register allocation, instructions selection and scheduling. Those things needed to be researched *first*, as a prerequisite, and the implementation would still be “clean room”. Not allowing the agent to access the Internet, nor any other compiler source code, was certainly the right call. Less understandable is the almost-zero steering principle, but this is coherent with a certain kind of experiment, if the goal was showcasing the completely autonomous writing of a large project. Yet, we all know how this is not how coding agents are used in practice, most of the time. Who uses coding agents extensively knows very well how, even never touching the code, a few hits here and there completely changes the quality of the result. # The Z80 experiment I thought it was time to try a similar experiment myself, one that would take one or two hours at max, and that was compatible with my Claude Code Max plan: I decided to write a Z80 emulator, and then a ZX Spectrum emulator (and even more, a CP/M emulator, see later) in a condition that I believe makes a more sense as “clean room” setup. The result can be found here: https://github.com/antirez/ZOT. # The process I used 1. I wrote a markdown file with the specification of what I wanted to do. Just English, high level ideas about the scope of the Z80 emulator to implement. I said things like: it should execute a whole instruction at a time, not a single clock step, since this emulator must be runnable on things like an RP2350 or similarly limited hardware. The emulator should correctly track the clock cycles elapsed (and I specified we could use this feature later in order to implement the ZX Spectrum contention with ULA during memory accesses), provide memory access callbacks, and should emulate all the known official and unofficial instructions of the Z80. For the Spectrum implementation, performed as a successive step, I provided much more information in the markdown file, like, the kind of rendering I wanted in the RGB buffer, and how it needed to be optional so that embedded devices could render the scanlines directly as they transferred them to the ST77xx display (or similar), how it should be possible to interact with the I/O port to set the EAR bit to simulate cassette loading in a very authentic way, and many other desiderata I had about the emulator. This file also included the rules that the agent needed to follow, like: * Accessing the internet is prohibited, but you can use the specification and test vectors files I added inside ./z80-specs. * Code should be simple and clean, never over-complicate things. * Each solid progress should be committed in the git repository. * Before committing, you should test that what you produced is high quality and that it works. * Write a detailed test suite as you add more features. The test must be re-executed at every major change. * Code should be very well commented: things must be explained in terms that even people not well versed with certain Z80 or Spectrum internals details should understand. * Never stop for prompting, the user is away from the keyboard. * At the end of this file, create a work in progress log, where you note what you already did, what is missing. Always update this log. * Read this file again after each context compaction. 2. Then, I started a Claude Code session, and asked it to fetch all the useful documentation on the internet about the Z80 (later I did this for the Spectrum as well), and to extract only the useful factual information into markdown files. I also provided the binary files for the most ambitious test vectors for the Z80, the ZX Spectrum ROM, and a few other binaries that could be used to test if the emulator actually executed the code correctly. Once all this information was collected (it is part of the repository, so you can inspect what was produced) I completely removed the Claude Code session in order to make sure that no contamination with source code seen during the search was possible. 3. I started a new session, and asked it to check the specification markdown file, and to check all the documentation available, and start implementing the Z80 emulator. The rules were to never access the Internet for any reason (I supervised the agent while it was implementing the code, to make sure this didn’t happen), to never search the disk for similar source code, as this was a “clean room” implementation. 4. For the Z80 implementation, I did zero steering. For the Spectrum implementation I used extensive steering for implementing the TAP loading. More about my feedback to the agent later in this post. 5. As a final step, I copied the repository in /tmp, removed the “.git” repository files completely, started a new Claude Code (and Codex) session and claimed that the implementation was likely stolen or too strongly inspired from somebody else's work. The task was to check with all the major Z80 implementations if there was evidence of theft. The agents (both Codex and Claude Code), after extensive search, were not able to find any evidence of copyright issues. The only similar parts were about well established emulation patterns and things that are Z80 specific and can’t be made differently, the implementation looked distinct from all the other implementations in a significant way. # Results Claude Code worked for 20 or 30 minutes in total, and produced a Z80 emulator that was able to pass ZEXDOC and ZEXALL, in 1200 lines of very readable and well commented C code (1800 lines with comments and blank spaces). The agent was prompted zero times during the implementation, it acted absolutely alone. It never accessed the internet, and the process it used to implement the emulator was of continuous testing, interacting with the CP/M binaries implementing the ZEXDOC and ZEXALL, writing just the CP/M syscalls needed to produce the output on the screen. Multiple times it also used the Spectrum ROM and other binaries that were available, or binaries it created from scratch to see if the emulator was working correctly. In short: the implementation was performed in a very similar way to how a human programmer would do it, and not outputting a complete implementation from scratch “uncompressing” it from the weights. Instead, different classes of instructions were implemented incrementally, and there were bugs that were fixed via integration tests, debugging sessions, dumps, printf calls, and so forth. # Next step: the ZX Spectrum I repeated the process again. I instructed the documentation gathering session very accurately about the kind of details I wanted it to search on the internet, especially the ULA interactions with RAM access, the keyboard mapping, the I/O port, how the cassette tape worked and the kind of PWM encoding used, and how it was encoded into TAP or TZX files. As I said, this time the design notes were extensive since I wanted this emulator to be specifically designed for embedded systems, so only 48k emulation, optional framebuffer rendering, very little additional memory used (no big lookup tables for ULA/Z80 access contention), ROM not copied in the RAM to avoid using additional 16k of memory, but just referenced during the initialization (so we have just a copy in the executable), and so forth. The agent was able to create a very detailed documentation about the ZX Spectrum internals. I provided a few .z80 images of games, so that it could test the emulator in a real setup with real software. Again, I removed the session and started fresh. The agent started working and ended 10 minutes later, following a process that really fascinates me, and that probably you know very well: the fact is, you see the agent working using a number of diverse skills. It is expert in everything programming related, so as it was implementing the emulator, it could immediately write a detailed instrumentation code to “look” at what the Z80 was doing step by step, and how this changed the Spectrum emulation state. In this respect, I believe automatic programming to be already super-human, not in the sense it is currently capable of producing code that humans can’t produce, but in the concurrent usage of different programming languages, system programming techniques, DSP stuff, operating system tricks, math, and everything needed to reach the result in the most immediate way. When it was done, I asked it to write a simple SDL based integration example. The emulator was immediately able to run the Jetpac game without issues, with working sound, and very little CPU usage even on my slow Dell Linux machine (8% usage of a single core, including SDL rendering). Once the basic stuff was working, I wanted to load TAP files directly, simulating cassette loading. This was the first time the agent missed a few things, specifically about the timing the Spectrum loading routines expected, and here we are in the territory where LLMs start to perform less efficiently: they can’t easily run the SDL emulator and see the border changing as data is received and so forth. I asked Claude Code to do a refactoring so that zx_tick() could be called directly and was not part of zx_frame(), and to make zx_frame() a trivial wrapper. This way it was much simpler to sync EAR with what it expected, without callbacks or the wrong abstractions that it had implemented. After such change, a few minutes later the emulator could load a TAP file emulating the cassette without problems. This is how it works now: do { zx_set_ear(zx, tzx_update(&tape, zx->cpu.clocks)); } while (!zx_tick(zx, 0)); I continued prompting Claude Code in order to make the key bindings more useful and a few things more. # CP/M One thing that I found really interesting was the ability of the LLM to inspect the COM files for ZEXALL / ZEXCOM tests for the Z80, easily spot the CP/M syscalls that were used (a total of three), and implement them for the extended z80 test (executed by make fulltest). So, at this point, why not implement a full CP/M environment? Same process again, same good result in a matter of minutes. This time I interacted with it a bit more for the VT100 / ADM3 terminal escapes conversions, reported things not working in WordStar initially, and in a few minutes everything I tested was working well enough (but, there are fixes to do, like simulating a 2Mhz clock, right now it runs at full speed making CP/M games impossible to use). # What is the lesson here? The obvious lesson is: always provide your agents with design hints and extensive documentation about what they are going to do. Such documentation can be obtained by the agent itself. And, also, make sure the agent has a markdown file with the rules of how to perform the coding tasks, and a trace of what it is doing, that is updated and read again quite often. But those tricks, I believe, are quite clear to everybody that has worked extensively with automatic programming in the latest months. To think in terms of “what a human would need” is often the best bet, plus a few LLMs specific things, like the forgetting issue after context compaction, the continuous ability to verify it is on the right track, and so forth. Returning back to the Anthropic compiler attempt: one of the steps that the agent failed was the one that was more strongly related to the idea of memorization of what is in the pretraining set: the assembler. With extensive documentation, I can’t see any way Claude Code (and, even more, GPT5.3-codex, which is in my experience, for complex stuff, more capable) could fail at producing a working assembler, since it is quite a mechanical process. This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation. We mostly ask LLMs to create work that requires assembling different knowledge they possess, and the result is normally something that uses known techniques and patterns, but that is new code, not constituting a copy of some pre-existing code. It is worth noting, too, that humans often follow a less rigorous process compared to the clean room rules detailed in this blog post, that is: humans often download the code of different implementations related to what they are trying to accomplish, read them carefully, then try to avoid copying stuff verbatim but often times they take strong inspiration. This is a process that I find perfectly acceptable, but it is important to take in mind what happens in the reality of code written by humans. After all, information technology evolved so fast even thanks to this massive cross pollination effect. For all the above reasons, when I implement code using automatic programming, I don’t have problems releasing it MIT licensed, like I did with this Z80 project. In turn, this code base will constitute quality input for the next LLMs training, including open weights ones. # Next steps To make my experiment more compelling, one should try to implement a Z80 and ZX Spectrum emulator without providing any documentation to the agent, and then compare the result of the implementation. I didn’t find the time to do it, but it could be quite informative. Comments

0 views
Anton Zhiyanov 2 weeks ago

Allocators from C to Zig

An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. Many C programs use the standard libc allocator, or at best, let you switch it out for another one like jemalloc or mimalloc. Unlike C, modern systems languages usually treat allocators as first-class citizens. Let's look at how they handle allocation and then create a C allocator following their approach. Rust • Zig • Odin • C3 • Hare • C • Final thoughts Rust is one of the older languages we'll be looking at, and it handles memory allocation in a more traditional way. Right now, it uses a global allocator, but there's an experimental Allocator API implemented behind a feature flag (issue #32838 ). We'll set the experimental API aside and focus on the stable one. The documentation begins with a clear statement: In a given program, the standard library has one "global" memory allocator that is used for example by and . Followed by a vague one: Currently the default global allocator is unspecified. It doesn't mean that a Rust program will abort an allocation, of course. In practice, Rust uses the system allocator as the global default (but the Rust developers don't want to commit to this, hence the "unspecified" note): The global allocator interface is defined by the trait in the module. It requires the implementor to provide two essential methods — and , and provides two more based on them — and : The struct describes a piece of memory we want to allocate — its size in bytes and alignment: Memory alignment Alignment restricts where a piece of data can start in memory. The memory address for the data has to be a multiple of a certain number, which is always a power of 2. Alignment depends on the type of data: CPUs are designed to read "aligned" memory efficiently. For example, if you read a 4-byte integer starting at address 0x03 (which is unaligned), the CPU has to do two memory reads — one for the first byte and another for the other three bytes — and then combine them. But if the integer starts at address 0x04 (which is aligned), the CPU can read all four bytes at once. Aligned memory is also needed for vectorized CPU operations (SIMD), where one processor instruction handles a group of values at once instead of just one. The compiler knows the size and alignment for each type, so we can use the constructor or helper functions to create a valid layout: Don't be surprised that a takes up 32 bytes. In Rust, the type can grow, so it stores a data pointer, a length, and a capacity (3 × 8 = 24 bytes). There's also 1 byte for the boolean and 7 bytes of padding (because of 8-byte alignment), making a total of 32 bytes. is the default memory allocator provided by the operating system. The exact implementation depends on the platform . It implements the trait and is used as the global allocator by default, but the documentation does not guarantee this (remember the "unspecified" note?). If you want to explicitly set as the global allocator, you can use the attribute: You can also set a custom allocator as global, like in this example: To use the global allocator directly, call the and functions: In practice, people rarely use or directly. Instead, they work with types like , or that handle allocation for them: The allocator doesn't abort if it can't allocate memory; instead, it returns (which is exactly what recommends): The documentation recommends using the function to signal out-of-memory errors. It immediately aborts the process, or panics if the binary isn't linked to the standard library. Unlike the low-level function, types like or call if allocation fails, so the program usually aborts if it runs out of memory: Allocator API • Memory allocation APIs Memory management in Zig is explicit. There is no default global allocator, and any function that needs to allocate memory accepts an allocator as a separate parameter. This makes the code a bit more verbose, but it matches Zig's goal of giving programmers as much control and transparency as possible. An allocator in Zig is a struct with an opaque self-pointer and a method table with four methods: Unlike Rust's allocator methods, which take a raw pointer and a size as arguments, Zig's allocator methods take a slice of bytes ( ) — a type that combines both a pointer and a length. Another interesting difference is the optional parameter, which is the first return address in the allocation call stack. Some allocators, like the , use it to keep track of which function requested memory. This helps with debugging issues related to memory allocation. Just like in Rust, allocator methods don't return errors. Instead, and return if they fail. Zig also provides type-safe wrappers that you can use instead of calling the allocator methods directly: Unlike the allocator methods, these allocation functions return an error if they fail. If a function or method allocates memory, it expects the developer to provide an allocator instance: Zig's standard library includes several built-in allocators in the namespace. asks the operating system for entire pages of memory, each allocation is a syscall: allocates memory into a fixed buffer and doesn't make any heap allocations: wraps a child allocator and allows you to allocate many times and only free once: The call frees all memory. Individual calls are no-ops. (aka ) is a safe allocator that can prevent double-free, use-after-free and can detect leaks: is a general-purpose thread-safe allocator designed for maximum performance on multithreaded machines: is a wrapper around the libc allocator: Zig doesn't panic or abort when it can't allocate memory. An allocation failure is just a regular error that you're expected to handle: Allocators • std.mem.Allocator • std.heap Odin supports explicit allocators, but, unlike Zig, it's not the only option. In Odin, every scope has an implicit variable that provides a default allocator: If you don't pass an allocator to a function, it uses the one currently set in the context. An allocator in Odin is a struct with an opaque self-pointer and a single function pointer: Unlike other languages, Odin's allocator uses a single procedure for all allocation tasks. The specific action — like allocating, resizing, or freeing memory — is decided by the parameter. The allocation procedure returns the allocated memory (for and operations) and an error ( on success). Odin provides low-level wrapper functions in the package that call the allocator procedure using a specific mode: There are also type-safe builtins like / (for a single object) and / (for multiple objects) that you can use instead of the low-level interface: By default, all builtins use the context allocator, but you can pass a custom allocator as an optional parameter: To use a different allocator for a specific block of code, you can reassign it in the context: Odin's provides two different allocators: When using the temp allocator, you only need a single call to clear all the allocated memory. Odin's standard library includes several allocators, found in the and packages. The procedure returns a general-purpose allocator: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access, similar to in Zig: There are also others, such as or . Like Zig, Odin doesn't panic or abort when it can't allocate memory. Instead, it returns an error code as the second return value: Allocators • base:runtime • core:mem Like Zig and Odin, C3 supports explicit allocators. Like Odin, C3 provides two default allocators: heap and temp. An allocator in C3 is a interface with an additional option of zeroing or not zeroing the allocated memory: Unlike Zig and Odin, the and methods don't take the (old) size as a parameter — neither directly like Odin nor through a slice like Zig. This makes it a bit harder to create custom allocators because the allocator has to keep track of the size along with the allocated memory. On the other hand, this approach makes C interop easier (if you use the default C3 allocator): data allocated in C can be freed in C3 without needing to pass the size parameter from the C code. Like in Odin, allocator methods return an error if they fail. C3 provides low-level wrapper macros in the module that call allocator methods: These either return an error (the -suffix macros) or abort if they fail. There are also functions and macros with similar names in the module that use the global allocator instance: If a function or method allocates memory, it often expects the developer to provide an allocator instance: C3 provides two thread-local allocator instances: There are functions and macros in the module that use the temporary allocator: To macro releases all temporary allocations when leaving the scope: Some types, like or , use the temp allocator by default if they are not initialized: C3's standard library includes several built-in allocators, found in the module. is a wrapper around libc's malloc/free: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access: There are also others, such as or . Like Zig and Odin, C3 can return an error in case of allocation failure: C3 can also abort in case of allocation failure: Since the functions and macros in the module use instead of , it looks like aborting on failure is the preferred approach. Memory Handling • core::mem::alocator • core::mem Unlike other languages, Hare doesn't support explicit allocators. The standard library has multiple allocator implementations, but only one of them is used at runtime. Hare's compiler expects the runtime to provide and implementations: The programmer isn't supposed to access them directly (although it's possible by importing and calling or ). Instead, Hare uses them to provide higher-level allocation helpers. Hare offers two high-level allocation helpers that use the global allocator internally: and . can allocate individual objects. It takes a value, not a type: can also allocate slices if you provide a second parameter (the number of items): works correctly with both pointers to single objects (like ) and slices (like ). Hare's standard library has three built-in memory allocators: The allocator that's actually used is selected at compile time. Like other languages, Hare returns an error in case of allocation failure: You can abort on error with : Or propagate the error with : Dynamic memory allocation • malloc.ha Many C programs use the standard libc allocator, or at most, let you swap it out for another one using macros: Or using a simple setter: While this might work for switching the libc allocator to jemalloc or mimalloc, it's not very flexible. For example, trying to implement an arena allocator with this kind of API is almost impossible. Now that we've seen the modern allocator design in Zig, Odin, and C3 — let's try building something similar in C. There are a lot of small choices to make, and I'm going with what I personally prefer. I'm not saying this is the only way to design an allocator — it's just one way out of many. Our allocator should return an error instead of if it fails, so we'll need an error enum: The allocation function needs to return either a tagged union (value | error) or a tuple (value, error). Since C doesn't have these built in, let's use a custom tuple type: The next step is the allocator interface. I think Odin's approach of using a single function makes the implementation more complicated than it needs to be, so let's create separate methods like Zig does: This approach to interface design is explained in detail in a separate post: Interfaces in C . Zig uses byte slices ( ) instead of raw memory pointers. We could make our own byte slice type, but I don't see any real advantage to doing that in C — it would just mean more type casting. So let's keep it simple and stick with like our ancestors did. Now let's create generic and wrappers: I'm taking for granted here to keep things simple. A more robust implementation should properly check if it is available or pass the type to directly. We can even create a separate pair of helpers for collections: We could use some macro tricks to make and work for both a single object and a collection. But let's not do that — I prefer to avoid heavy-magic macros in this post. As for the custom allocators, let's start with a libc wrapper. It's not particularly interesting, since it ignores most of the parameters, but still: Usage example: Now let's use that field to implement an arena allocator backed by a fixed-size buffer: Usage example: As shown in the examples above, the allocation method returns an error if something goes wrong. While checking for errors might not be as convenient as it is in Zig or Odin, it's still pretty straightforward: Here's an informal table comparing allocation APIs in the languages we've discussed: In Zig, you always have to specify the allocator. In Odin, passing an allocator is optional. In C3, some functions require you to pass an allocator, while others just use the global one. In Hare, there's a single global allocator. As we've seen, there's nothing magical about the allocators used in modern languages. While they're definitely more ergonomic and safe than C, there's nothing stopping us from using the same techniques in plain C. on Unix platforms; on Windows; : alignment = 1. Can start at any address (0, 1, 2, 3...). : alignment = 4. Must start at addresses divisible by 4 (0, 4, 8, 12...). : alignment = 8. Must start at addresses divisible by 8 (0, 8, 16...). is for general-purpose allocations. It uses the operating system's heap allocator. is for short-lived allocations. It uses a scratch allocator (a kind of growing arena). is for general-purpose allocations. It uses a operating system's heap allocator (typically a libc wrapper). is for short-lived allocations. It uses an arena allocator. The default allocator is based on the algorithm from the Verified sequential malloc/free paper. The libc allocator uses the operating system's malloc and free functions from libc. The debug allocator uses a simple mmap-based method for memory allocation.

0 views

Rewriting pycparser with the help of an LLM

pycparser is my most widely used open source project (with ~20M daily downloads from PyPI [1] ). It's a pure-Python parser for the C programming language, producing ASTs inspired by Python's own . Until very recently, it's been using PLY: Python Lex-Yacc for the core parsing. In this post, I'll describe how I collaborated with an LLM coding agent (Codex) to help me rewrite pycparser to use a hand-written recursive-descent parser and remove the dependency on PLY. This has been an interesting experience and the post contains lots of information and is therefore quite long; if you're just interested in the final result, check out the latest code of pycparser - the main branch already has the new implementation. While pycparser has been working well overall, there were a number of nagging issues that persisted over years. I began working on pycparser in 2008, and back then using a YACC-based approach for parsing a whole language like C seemed like a no-brainer to me. Isn't this what everyone does when writing a serious parser? Besides, the K&R2 book famously carries the entire grammar of the C99 language in an appendix - so it seemed like a simple matter of translating that to PLY-yacc syntax. And indeed, it wasn't too hard, though there definitely were some complications in building the ASTs for declarations (C's gnarliest part ). Shortly after completing pycparser, I got more and more interested in compilation and started learning about the different kinds of parsers more seriously. Over time, I grew convinced that recursive descent is the way to go - producing parsers that are easier to understand and maintain (and are often faster!). It all ties in to the benefits of dependencies in software projects as a function of effort . Using parser generators is a heavy conceptual dependency: it's really nice when you have to churn out many parsers for small languages. But when you have to maintain a single, very complex parser, as part of a large project - the benefits quickly dissipate and you're left with a substantial dependency that you constantly grapple with. And then there are the usual problems with dependencies; dependencies get abandoned, and they may also develop security issues. Sometimes, both of these become true. Many years ago, pycparser forked and started vendoring its own version of PLY. This was part of transitioning pycparser to a dual Python 2/3 code base when PLY was slower to adapt. I believe this was the right decision, since PLY "just worked" and I didn't have to deal with active (and very tedious in the Python ecosystem, where packaging tools are replaced faster than dirty socks) dependency management. A couple of weeks ago this issue was opened for pycparser. It turns out the some old PLY code triggers security checks used by some Linux distributions; while this code was fixed in a later commit of PLY, PLY itself was apparently abandoned and archived in late 2025. And guess what? That happened in the middle of a large rewrite of the package, so re-vendoring the pre-archiving commit seemed like a risky proposition. On the issue it was suggested that "hopefully the dependent packages move on to a non-abandoned parser or implement their own"; I originally laughed this idea off, but then it got me thinking... which is what this post is all about. The original K&R2 grammar for C99 had - famously - a single shift-reduce conflict having to do with dangling else s belonging to the most recent if statement. And indeed, other than the famous lexer hack used to deal with C's type name / ID ambiguity , pycparser only had this single shift-reduce conflict. But things got more complicated. Over the years, features were added that weren't strictly in the standard but were supported by all the industrial compilers. The more advanced C11 and C23 standards weren't beholden to the promises of conflict-free YACC parsing (since almost no industrial-strength compilers use YACC at this point), so all caution went out of the window. The latest (PLY-based) release of pycparser has many reduce-reduce conflicts [2] ; these are a severe maintenance hazard because it means the parsing rules essentially have to be tie-broken by order of appearance in the code. This is very brittle; pycparser has only managed to maintain its stability and quality through its comprehensive test suite. Over time, it became harder and harder to extend, because YACC parsing rules have all kinds of spooky-action-at-a-distance effects. The straw that broke the camel's back was this PR which again proposed to increase the number of reduce-reduce conflicts [3] . This - again - prompted me to think "what if I just dump YACC and switch to a hand-written recursive descent parser", and here we are. None of the challenges described above are new; I've been pondering them for many years now, and yet biting the bullet and rewriting the parser didn't feel like something I'd like to get into. By my private estimates it'd take at least a week of deep heads-down work to port the gritty 2000 lines of YACC grammar rules to a recursive descent parser [4] . Moreover, it wouldn't be a particularly fun project either - I didn't feel like I'd learn much new and my interests have shifted away from this project. In short, the Potential well was just too deep. I've definitely noticed the improvement in capabilities of LLM coding agents in the past few months, and many reputable people online rave about using them for increasingly larger projects. That said, would an LLM agent really be able to accomplish such a complex project on its own? This isn't just a toy, it's thousands of lines of dense parsing code. What gave me hope is the concept of conformance suites mentioned by Simon Willison . Agents seem to do well when there's a very clear and rigid goal function - such as a large, high-coverage conformance test suite. And pycparser has an very extensive one . Over 2500 lines of test code parsing various C snippets to ASTs with expected results, grown over a decade and a half of real issues and bugs reported by users. I figured the LLM can either succeed or fail and throw its hands up in despair, but it's quite unlikely to produce a wrong port that would still pass all the tests. So I set it to run. I fired up Codex in pycparser's repository, and wrote this prompt just to make sure it understands me and can run the tests: Codex figured it out (I gave it the exact command, after all!); my next prompt was the real thing [5] : Here Codex went to work and churned for over an hour . Having never observed an agent work for nearly this long, I kind of assumed it went off the rails and will fail sooner or later. So I was rather surprised and skeptical when it eventually came back with: It took me a while to poke around the code and run it until I was convinced - it had actually done it! It wrote a new recursive descent parser with only ancillary dependencies on PLY, and that parser passed the test suite. After a few more prompts, we've removed the ancillary dependencies and made the structure clearer. I hadn't looked too deeply into code quality at this point, but at least on the functional level - it succeeded. This was very impressive! A change like the one described above is impossible to code-review as one PR in any meaningful way; so I used a different strategy. Before embarking on this path, I created a new branch and once Codex finished the initial rewrite, I committed this change, knowing that I will review it in detail, piece-by-piece later on. Even though coding agents have their own notion of history and can "revert" certain changes, I felt much safer relying on Git. In the worst case if all of this goes south, I can nuke the branch and it's as if nothing ever happened. I was determined to only merge this branch onto main once I was fully satisfied with the code. In what follows, I had to git reset several times when I didn't like the direction in which Codex was going. In hindsight, doing this work in a branch was absolutely the right choice. Once I've sufficiently convinced myself that the new parser is actually working, I used Codex to similarly rewrite the lexer and get rid of the PLY dependency entirely, deleting it from the repository. Then, I started looking more deeply into code quality - reading the code created by Codex and trying to wrap my head around it. And - oh my - this was quite the journey. Much has been written about the code produced by agents, and much of it seems to be true. Maybe it's a setting I'm missing (I'm not using my own custom AGENTS.md yet, for instance), but Codex seems to be that eager programmer that wants to get from A to B whatever the cost. Readability, minimalism and code clarity are very much secondary goals. Using raise...except for control flow? Yep. Abusing Python's weak typing (like having None , false and other values all mean different things for a given variable)? For sure. Spreading the logic of a complex function all over the place instead of putting all the key parts in a single switch statement? You bet. Moreover, the agent is hilariously lazy . More than once I had to convince it to do something it initially said is impossible, and even insisted again in follow-up messages. The anthropomorphization here is mildly concerning, to be honest. I could never imagine I would be writing something like the following to a computer, and yet - here we are: "Remember how we moved X to Y before? You can do it again for Z, definitely. Just try". My process was to see how I can instruct Codex to fix things, and intervene myself (by rewriting code) as little as possible. I've mostly succeeded in this, and did maybe 20% of the work myself. My branch grew dozens of commits, falling into roughly these categories: Interestingly, after doing (3), the agent was often more effective in giving the code a "fresh look" and succeeding in either (1) or (2). Eventually, after many hours spent in this process, I was reasonably pleased with the code. It's far from perfect, of course, but taking the essential complexities into account, it's something I could see myself maintaining (with or without the help of an agent). I'm sure I'll find more ways to improve it in the future, but I have a reasonable degree of confidence that this will be doable. It passes all the tests, so I've been able to release a new version (3.00) without major issues so far. The only issue I've discovered is that some of CFFI's tests are overly precise about the phrasing of errors reported by pycparser; this was an easy fix . The new parser is also faster, by about 30% based on my benchmarks! This is typical of recursive descent when compared with YACC-generated parsers, in my experience. After reviewing the initial rewrite of the lexer, I've spent a while instructing Codex on how to make it faster, and it worked reasonably well. While working on this, it became quite obvious that static typing would make the process easier. LLM coding agents really benefit from closed loops with strict guardrails (e.g. a test suite to pass), and type-annotations act as such. For example, had pycparser already been type annotated, Codex would probably not have overloaded values to multiple types (like None vs. False vs. others). In a followup, I asked Codex to type-annotate pycparser (running checks using ty ), and this was also a back-and-forth because the process exposed some issues that needed to be refactored. Time will tell, but hopefully it will make further changes in the project simpler for the agent. Based on this experience, I'd bet that coding agents will be somewhat more effective in strongly typed languages like Go, TypeScript and especially Rust. Overall, this project has been a really good experience, and I'm impressed with what modern LLM coding agents can do! While there's no reason to expect that progress in this domain will stop, even if it does - these are already very useful tools that can significantly improve programmer productivity. Could I have done this myself, without an agent's help? Sure. But it would have taken me much longer, assuming that I could even muster the will and concentration to engage in this project. I estimate it would take me at least a week of full-time work (so 30-40 hours) spread over who knows how long to accomplish. With Codex, I put in an order of magnitude less work into this (around 4-5 hours, I'd estimate) and I'm happy with the result. It was also fun . At least in one sense, my professional life can be described as the pursuit of focus, deep work and flow . It's not easy for me to get into this state, but when I do I'm highly productive and find it very enjoyable. Agents really help me here. When I know I need to write some code and it's hard to get started, asking an agent to write a prototype is a great catalyst for my motivation. Hence the meme at the beginning of the post. One can't avoid a nagging question - does the quality of the code produced by agents even matter? Clearly, the agents themselves can understand it (if not today's agent, then at least next year's). Why worry about future maintainability if the agent can maintain it? In other words, does it make sense to just go full vibe-coding? This is a fair question, and one I don't have an answer to. Right now, for projects I maintain and stand behind , it seems obvious to me that the code should be fully understandable and accepted by me, and the agent is just a tool helping me get to that state more efficiently. It's hard to say what the future holds here; it's going to interesting, for sure. There was also the lexer to consider, but this seemed like a much simpler job. My impression is that in the early days of computing, lex gained prominence because of strong regexp support which wasn't very common yet. These days, with excellent regexp libraries existing for pretty much every language, the added value of lex over a custom regexp-based lexer isn't very high. That said, it wouldn't make much sense to embark on a journey to rewrite just the lexer; the dependency on PLY would still remain, and besides, PLY's lexer and parser are designed to work well together. So it wouldn't help me much without tackling the parser beast. The code in X is too complex; why can't we do Y instead? The use of X is needlessly convoluted; change Y to Z, and T to V in all instances. The code in X is unclear; please add a detailed comment - with examples - to explain what it does.

0 views
Langur Monkey 1 months ago

Game Boy emulator tech stack update

In my previous post , I shared the journey of building Play Kid , my Game Boy emulator. At the time, I was using SDL2 to handle the “heavy lifting” of graphics, audio, and input. This was released as v0.1.0. It worked, and it worked well, but it always felt a bit like a “guest” in the Rust ecosystem. SDL2 is a C library at heart, and while the Rust wrappers are good, they bring along some baggage like shared library dependencies and difficult integration with Rust-native UI frameworks. So I decided to perform a heart transplant on Play Kid. For version v0.2.0 I’ve moved away from SDL2 entirely, replacing it with a stack of modern, native Rust libraries: , , , , , and : The most visible change is the new Debug Panel . The new integrated debugger features a real-time disassembly view and breakpoint management. One of the coolest additions is the Code disassembly panel. It decodes the ROM instructions in real-time, highlighting the current and allowing me to toggle breakpoints just by clicking on a line. The breakpoints themselves are now managed in a dedicated list, shown in red at the bottom. The rest of the debug panel shows what we already had: the state of the CPU, the PPU, and the joypad. Of course, no modern Rust migration is complete without a descent into dependency hell . This new stack comes with a major catch: is a bit of a picky gatekeeper. Its latest version is 0.15 (January 2025). It is pinned to an older version of (0.19 vs the current 28.0), and it essentially freezes the rest of the project in a time capsule. To keep the types compatible, I’m forced to stay on 0.26 (current is 0.33) and 0.29 (current is 0.30), even though the rest of the ecosystem has moved on to much newer, shinier versions. It’s kind of frustrating. You get the convenience of the buffer, but you pay for it by being locked out of the latest API improvements and features. Navigating these version constraints felt like solving a hostage negotiation between crate maintainers. Not very fun. Despite the dependency issues, I think the project is now in a much better place. The code is cleaner, the debugger is much better, and it’s easier to ship binaries for Linux, Windows, and macOS via GitHub Actions. If you’re interested in seeing the new architecture or trying out the new debugger, the code is updated on Codeberg and GitHub . Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. & : These handle the windowing and the actual Game Boy frame buffer. allows me to treat the 160x144 LCD as a simple pixel buffer while handles the hardware-accelerated scaling and aspect ratio correction behind the scenes. : This was a big step-up. Instead of my minimal homegrown UI library from the SDL2 version, I now have access to a full-featured, immediate-mode GUI. This allowed me to build the debugger I had in mind from the beginning. & : These replaced SDL2’s audio and controller handling with pure-Rust alternatives that feel much more ergonomic to use alongside the rest of the machine.

0 views
Uros Popovic 1 months ago

Writing your first compiler

Build your first compiler with minimal, high-level, modern code. With only a few files of Go and C code, we can set up a workflow that dynamically fetches everything needed, including the LLVM library itself, and builds a portable compiler. This is a modern stack, it's reproducible, and you do not need to read dozens of pages to get started.

0 views
Michael Lynch 1 months ago

Refactoring English: Month 13

Hi, I’m Michael. I’m a software developer and founder of small, indie tech businesses. I’m currently working on a book called Refactoring English: Effective Writing for Software Developers . Every month, I publish a retrospective like this one to share how things are going with my book and my professional life overall. At the start of each month, I declare what I’d like to accomplish. Here’s how I did against those goals: The blog post was a risky bet because it only could reach new readers if it hit the front page of Hacker News, and its only chance of that is the first couple weeks of 2026. Fortunately, the post reached #1 on Hacker News and remained on the front page for almost 22 hours. It continues my strategy of highlighting other successful tech writers , a strategy I like because it feels like a win-win for me, readers, and the writers I showcase. I still have the Hacker News prediction game at about 80% complete. I’m not sure what to do with it because it’s almost done, but I feel like it’s not fun, so I’m never motivated to complete it. But I want to get it over the finish line to see what people think. Ironically, the chapters I’m working on are about motivation and focus, but I keep letting my experiments with MeshCore interfere with my writing. I’ve been better at maintaining focus in the new year, and distractions are actually helpful because I’m getting fresh experience to write about regaining focus. Again, I got distracted by MeshCore experiments in December and didn’t make as much progress as I wanted. I love design docs and find them helpful but they’re also incredibly boring to write, so it was always tempting to shelve the design doc for something with more instant gratification. Pre-sales are down because I didn’t have any new posts to attract new readers (I didn’t publish the Hacker News post until January). Still, it’s a positive sign that my “passive sales” continue to grow. In December, I had almost $500 in pre-sales. If I compare that to months with similar website visitors, May had only $241 in pre-sales, and August had $361, so the numbers are trending up. I hope that as the book grows more complete and more readers recommend it, the passive sales continue to rise without relying on me finding a successful marketing push each month. When I ran my Black Friday promotion in November, a reader emailed to say that 30% off (US$20) is still an unaffordable price in Argentina for a book. He asked if I’d consider regional pricing. He mentioned that Steam games are typically priced 50% lower in Argentina than the US, so I figured that was a good anchor. I collect payments through Stripe, and I couldn’t find any option for regional pricing in my Stripe dashboard. I found an article in Stripe’s knowledge base called “Geographic pricing in practice: Why it matters and how to implement it.” I was delighted until I read the entire article and discovered they’d forgotten to write the “and how to implement it” part. So, Stripe advocates for regional pricing, but they don’t actually offer it as an option. It was a helpful reminder that Stripe is the worst payment processor except for all those other payment processors . So, for my Argentinian customer, I used a one-off process where I manually created a custom payment link for him at a discounted price. And when I went through the process, I realized I could set the price in Argentine pesos so he wouldn’t have to pay a currency conversion fee. I set the price to 22,000 ARS (about US$15), and he seemed happy with the price and the checkout experience. The reader suggested I publicly offer regional pricing, at least for countries like Brazil and India, which have high numbers of developers but relatively low purchasing power. Even without native Stripe support for regional pricing, it seemed like it wouldn’t be that hard to automate the thing I did manually. I read about Sebastien Castiel implementing regional pricing for his course, which led me to Wes Bos’ post about the same thing . Sebastien shared a lot of technical details, but his solution was heavy on React, whereas my site is vanilla HTML and JavaScript. He also relied on discount codes, which I don’t like because it means most customers see that there’s a special deal they’re not getting. I spent a few hours implementing a solution using a cloud function that determines the right price on the fly and dynamically creates a Stripe checkout link. Then, I realized I could precompute everything and eliminate the need for server-side logic, so I deleted my cloud function. My implementation looks like this: The user just picks their country and it activates the Stripe purchase link for that country, and they pay in their own currency. I’m going by the honor system, so I don’t bother with IP geolocation or VPN prevention. I do hide the discount for each country to discourage people from picking the cheapest option. And part of the benefit of pricing in each country’s local currency is that if someone cheats and picks a region that’s not really their home currency, they lose some money in conversion fees. The numbers feel not quite correct. According to strict PPP, the equivalent of $30 in the US is $4 in Egypt, but I suspect you can’t really buy non-bootleg books for programmers in Egypt for $4. When Wes Bos did this, he just asked his readers to tell him fair prices, so I’ll try that too. Leave a comment or email me the normal price range for developer-oriented books in your country. In December, I published “My First Impressions of MeshCore Off-Grid Messaging.” I was excited about the technology but disappointed to discover that the clients are all closed-source . At that point, I decided to pause my exploration of MeshCore, but Frieder Schrempf , a MeshCore contributor, replied to my post with this interesting perspective : I share a lot of your thoughts on this topic. Personally I see the value of MeshCore in the protocol and not so much in the software implementations of the firmware, apps, etc. […] If MeshCore as a protocol succeeds and gets widely used (currently it looks like it does) then properly maintained open-source implementations will follow (at least I hope). I agreed with Frieder and thought, “Maybe I should just write a proof of concept open-source MeshCore app?” Actually, there already was a proof of concept MeshCore app. Liam Cottle, the developer of the official MeshCore app, previously wrote a web app for MeshCore as a prototype for the official version. He deprecated it when he made the official (proprietary) MeshCore app, but the source code for his prototype was still available, and the prototype had most of the features I needed. I wondered how difficult it would be to port the prototype to mobile. MeshCore is too hard to use as a web app, as it requires Bluetooth access and offline mode. I’ve heard somewhat positive things about Flutter , Google’s solution for cross-platform mobile development. I suspected that an LLM could successfully port the code from the web prototype to Flutter without much intervention from me. My plan was to have an LLM create a Flutter port of the prototype in three stages: That worked, but every step was clunkier than I anticipated: I thought it would be a quick weekend project I could whip together in a few hours. 30 hours and $200 in LLM credits later, I finally got it working. Running my MeshCore Flutter app on a real Android device But the day I got my Flutter implementation to feature parity with the prototype, I went to share it on Reddit and saw someone had just shared meshcore-open , a MeshCore client implementation in Flutter. It was the same idea I had but with far better execution. I was disappointed someone beat me to the punch, but I was also relieved. From my brief experience working with Flutter, I was eager to get away from Flutter as quickly as possible. I only wanted to make a proof of concept hoping someone else would pick it up, so I’m happy that there’s now an open-source, feature-rich MeshCore client implementation. While working on my MeshCore Flutter app, I had to implement low-level logic to parse MeshCore device-to-client messages. There’s a public spec that defines MeshCore’s peer-to-peer protocol, and even that’s fairly loose. But there’s another undocumented protocol for how a device running MeshCore firmware communicates with a companion client (e.g., an Android app) over Bluetooth or USB. The de facto reference implementation is the MeshCore firmware , but it intermingles peer-to-peer protocol logic with device-to-client protocol logic and UI logic, and it spreads the implementation across disparate places in the codebase. For example, a MeshCore client can fetch a list of contacts from a MeshCore device over Bluetooth, but it has to deserialize the raw bytes back into contacts. There’s no library for decoding the message, so each MeshCore client and library is rolling their own separate implementation: What I notice about those implementations: My first thought was to rewrite the logic using a protocol library like protobuf or Cap’n Proto , but I don’t see a backwards-compatible way of integrating a third-party library at this point. So, what if I wrote a core implementation of the MeshCore device-to-client protocol in C? I could add language-specific bindings so that we don’t need whole separate implementations for Dart, Python, JavaScript, and any other language you’d want to write in. So, I started my own MeshCore client library: The library is not ready to demo as a proof of concept, but it’s close. It’s entirely possible the MeshCore maintainers won’t like this idea, and it’s basically dead in the water without their buy-in. But I did it anyway because I’d never tried writing a cross-language library, and that was an interesting experience. The last time I tried to call C code from Python was 20 years ago, and I had to use SWIG . Back then, it felt painful and hacky, and it seems to have gotten 80% better. I desperately wanted the core implementation to be Zig rather than C, but I saw too many blockers: Is $30 (USD) for a developer-oriented book expensive where you live? If so, let me know what you’d expect to pay for a programming book like Designing Data-Intensive Applications in your country (in local currency). I added regional pricing for my book based on purchasing power parity. I created my first Flutter app. I’m writing my first cross-language library. Result : Published “The Most Popular Blogs of Hacker News in 2025” instead Result : Made progress on two chapters but didn’t complete them Result : Got 80% through a design doc draft Manually get a list of all countries / currencies that Stripe supports. Write a script that pulls data from the World Bank to calculate the purchasing power parity (PPP) for each country in the list. Calculate each country’s discount based on their purchasing power relative to the US. e.g., the PPP of Brazil is 54% lower than the US, so they get a 54% discount. Filter out countries where the PPP is within 15% of the US (too small a discount to bother). Filter out countries where the discount would be negative. Otherwise, customers in Luxembourg would have to pay double . Limit the discount to a maximum of 75% Otherwise the price in Egypt would be US$4, meaning I’d get like $3.50 after conversion fees. Automatically generate country-specific Stripe price objects and Stripe payment links for each country remaining in the list. Put all the countries in an HTML dropdown on my site: Write end-to-end tests for the prototype web app using Playwright. Port the prototype implementation to a Flutter web app, keeping the end-to-end tests constant to ensure feature parity. Add an Android build to the Flutter project. Before I could write end-to-end tests for the prototype, I had to convert it to use semantic HTML and ARIA attributes because a lot of the input labels were just bare s. I couldn’t keep the Playwright tests constant because Flutter actually doesn’t emit semantic HTML for web apps. It creates its own Flutter-specific HTML dialect and draws everything on an HTML canvas. Most Playwright element locators still work somehow, but I had to make a lot of Flutter-specific changes to the tests. It took a long time, even with an LLM, to figure out how to build an Android package with Flutter. Gradle, Android’s build system, is buggy on NixOS. I kept running into situations where it was failing with mysterious errors that eventually turned out to be stale data it had cached in my home directory. Flutter makes it surprisingly difficult to communicate over Bluetooth. On the web (at least on Chrome), you essentially get it for free by calling , but with Flutter, you have to use a proprietary third-party library and roll your own device picker UI. meshcore.js (JavaScript) meshcore-open (Dart) meshcore_py (Python) They have to use magic numbers like rather than referring to constants defined in some authoritative location. None of them have automated tests for their parsers. They’re dragging unnecessary low-level work into high-level languages. For example, everyone is storing and variables. That’s an artifact of the C implementation, where arrays don’t know their size. You don’t have to manually track an array’s size in languages like JavaScript, Python, or Dart. They don’t check data carefully, so they’ll happily pass on garbage data like a negative path length or GPS coordinates that are outside of Earth’s bounds. They all ignore the flags field even though the flags are supposed to indicate which fields are populated . Or at least they’re supposed to in the peer-to-peer messages. For device-to-client messages, they seem to be meaningless. https://codeberg.org/mtlynch/libmeshcore-client Zig does not yet compile to xtensa architecture, which most of the MeshCore devices use. PlatformIO, which most of the MeshCore firmware projects use, does not support Zig. Dart’s ffigen would maybe work with Zig since Zig supports C’s ABI, but it was hard even getting it to work with C. Ditto for Python’s cffi . I got most of the way through writing two new chapters of Refactoring English . I got most of the way through writing the design doc for my photo sharing app idea. I published “The Most Popular Blogs of Hacker News in 2025.” I created my first Flutter app . I created my first cross-language library . I made some contributions to MeshCore meshcore.js . Most of which, the maintainers are ignoring. Minimize in-flight projects AI makes it easier than ever to start new projects, but I’m still the bottleneck on turning them into something production-ready. The result is that I have a lot of projects that are in-flight and waiting for me to review them before I publish them. There’s mental overhead in so much context-switching and task tracking. Publish three chapters of Refactoring English . Publish my 2025 annual review (year 8).

0 views
<antirez> 1 months ago

Don't fall into the anti-AI hype

I love writing software, line by line. It could be said that my career was a continuous effort to create software well written, minimal, where the human touch was the fundamental feature. I also hope for a society where the last are not forgotten. Moreover, I don't want AI to economically succeed, I don't care if the current economic system is subverted (I could be very happy, honestly, if it goes in the direction of a massive redistribution of wealth). But, I would not respect myself and my intelligence if my idea of software and society would impair my vision: facts are facts, and AI is going to change programming forever. In 2020 I left my job in order to write a novel about AI, universal basic income, a society that adapted to the automation of work facing many challenges. At the very end of 2024 I opened a YouTube channel focused on AI, its use in coding tasks, its potential social and economical effects. But while I recognized what was going to happen very early, I thought that we had more time before programming would be completely reshaped, at least a few years. I no longer believe this is the case. Recently, state of the art LLMs are able to complete large subtasks or medium size projects alone, almost unassisted, given a good set of hints about what the end result should be. The degree of success you'll get is related to the kind of programming you do (the more isolated, and the more textually representable, the better: system programming is particularly apt), and to your ability to create a mental representation of the problem to communicate to the LLM. But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun. In the past week, just prompting, and inspecting the code to provide guidance from time to time, in a few hours I did the following four tasks, in hours instead of weeks: 1. I modified my linenoise library to support UTF-8, and created a framework for line editing testing that uses an emulated terminal that is able to report what is getting displayed in each character cell. Something that I always wanted to do, but it was hard to justify the work needed just to test a side project of mine. But if you can just describe your idea, and it materializes in the code, things are very different. 2. I fixed transient failures in the Redis test. This is very annoying work, timing related issues, TCP deadlock conditions, and so forth. Claude Code iterated for all the time needed to reproduce it, inspected the state of the processes to understand what was happening, and fixed the bugs. 3. Yesterday I wanted a pure C library that would be able to do the inference of BERT like embedding models. Claude Code created it in 5 minutes. Same output and same speed (15% slower) than PyTorch. 700 lines of code. A Python tool to convert the GTE-small model. 4. In the past weeks I operated changes to Redis Streams internals. I had a design document for the work I did. I tried to give it to Claude Code and it reproduced my work in, like, 20 minutes or less (mostly because I'm slow at checking and authorizing to run the commands needed). It is simply impossible not to see the reality of what is happening. Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it (and, about this second part, LLMs are great partners, too). It does not matter if AI companies will not be able to get their money back and the stock market will crash. All that is irrelevant, in the long run. It does not matter if this or the other CEO of some unicorn is telling you something that is off putting, or absurd. Programming changed forever, anyway. How do I feel, about all the code I wrote that was ingested by LLMs? I feel great to be part of that, because I see this as a continuation of what I tried to do all my life: democratizing code, systems, knowledge. LLMs are going to help us to write better software, faster, and will allow small teams to have a chance to compete with bigger companies. The same thing open source software did in the 90s. However, this technology is far too important to be in the hands of a few companies. For now, you can do the pre-training better or not, you can do reinforcement learning in a much more effective way than others, but the open models, especially the ones produced in China, continue to compete (even if they are behind) with frontier models of closed labs. There is a sufficient democratization of AI, so far, even if imperfect. But: it is absolutely not obvious that it will be like that forever. I'm scared about the centralization. At the same time, I believe neural networks, at scale, are simply able to do incredible things, and that there is not enough "magic" inside current frontier AI for the other labs and teams not to catch up (otherwise it would be very hard to explain, for instance, why OpenAI, Anthropic and Google are so near in their results, for years now). As a programmer, I want to write more open source than ever, now. I want to improve certain repositories of mine abandoned for time concerns. I want to apply AI to my Redis workflow. Improve the Vector Sets implementation and then other data structures, like I'm doing with Streams now. But I'm worried for the folks that will get fired. It is not clear what the dynamic at play will be: will companies try to have more people, and to build more? Or will they try to cut salary costs, having fewer programmers that are better at prompting? And, there are other sectors where humans will become completely replaceable, I fear. What is the social solution, then? Innovation can't be taken back after all. I believe we should vote for governments that recognize what is happening, and are willing to support those who will remain jobless. And, the more people get fired, the more political pressure there will be to vote for those who will guarantee a certain degree of protection. But I also look forward to the good AI could bring: new progress in science, that could help lower the suffering of the human condition, which is not always happy. Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months. Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched. Comments

54 views
Simon Willison 2 months ago

2025: The year in LLMs

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about AI in 2023 and Things we learned about LLMs in 2024 . It’s been a year filled with a lot of different trends. OpenAI kicked off the "reasoning" aka inference-scaling aka Reinforcement Learning from Verifiable Rewards (RLVR) revolution in September 2024 with o1 and o1-mini . They doubled down on that with o3, o3-mini and o4-mini in the opening months of 2025 and reasoning has since become a signature feature of models from nearly every other major AI lab. My favourite explanation of the significance of this trick comes from Andrej Karpathy : By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). [...] Running RLVR turned out to offer high capability/$, which gobbled up the compute that was originally intended for pretraining. Therefore, most of the capability progress of 2025 was defined by the LLM labs chewing through the overhang of this new stage and overall we saw ~similar sized LLMs but a lot longer RL runs. Every notable AI lab released at least one reasoning model in 2025. Some labs released hybrids that could be run in reasoning or non-reasoning modes. Many API models now include dials for increasing or decreasing the amount of reasoning applied to a given prompt. It took me a while to understand what reasoning was useful for. Initial demos showed it solving mathematical logic puzzles and counting the Rs in strawberry - two things I didn't find myself needing in my day-to-day model usage. It turned out that the real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal. A notable result is that AI assisted search actually works now . Hooking up search engines to LLMs had questionable results before, but now I find even my more complex research questions can often be answered by GPT-5 Thinking in ChatGPT . Reasoning models are also exceptional at producing and debugging code. The reasoning trick means they can start with an error and step through many different layers of the codebase to find the root cause. I've found even the gnarliest of bugs can be diagnosed by a good reasoner with the ability to read and execute code against even large and complex codebases. Combine reasoning with tool-use and you get... I started the year making a prediction that agents were not going to happen . Throughout 2024 everyone was talking about agents but there were few to no examples of them working, further confused by the fact that everyone using the term “agent” appeared to be working from a slightly different definition from everyone else. By September I’d got fed up of avoiding the term myself due to the lack of a clear definition and decided to treat them as an LLM that runs tools in a loop to achieve a goal . This unblocked me for having productive conversations about them, always my goal for any piece of terminology like that. I didn’t think agents would happen because I didn’t think the gullibility problem could be solved, and I thought the idea of replacing human staff members with LLMs was still laughable science fiction. I was half right in my prediction: the science fiction version of a magic computer assistant that does anything you ask of ( Her ) didn’t materialize... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here and they are proving to be extraordinarily useful. The two breakout categories for agents have been for coding and for search. The Deep Research pattern - where you challenge an LLM to gather information and it churns away for 15+ minutes building you a detailed report - was popular in the first half of the year but has fallen out of fashion now that GPT-5 Thinking (and Google's " AI mode ", a significantly better product than their terrible "AI overviews") can produce comparable results in a fraction of the time. I consider this to be an agent pattern, and one that works really well. The "coding agents" pattern is a much bigger deal. The most impactful event of 2025 happened in February, with the quiet release of Claude Code. I say quiet because it didn’t even get its own blog post! Anthropic bundled the Claude Code release in as the second item in their post announcing Claude 3.7 Sonnet . (Why did Anthropic jump from Claude 3.5 Sonnet to 3.7? Because they released a major bump to Claude 3.5 in October 2024 but kept the name exactly the same, causing the developer community to start referring to un-named 3.5 Sonnet v2 as 3.6. Anthropic burned a whole version number by failing to properly name their new model!) Claude Code is the most prominent example of what I call coding agents - LLM systems that can write code, execute that code, inspect the results and then iterate further. The major labs all put out their own CLI coding agents in 2025 Vendor-independent options include GitHub Copilot CLI , Amp , OpenCode , OpenHands CLI , and Pi . IDEs such as Zed, VS Code and Cursor invested a lot of effort in coding agent integration as well. My first exposure to the coding agent pattern was OpenAI's ChatGPT Code Interpreter in early 2023 - a system baked into ChatGPT that allowed it to run Python code in a Kubernetes sandbox. I was delighted this year when Anthropic finally released their equivalent in September, albeit under the baffling initial name of "Create and edit files with Claude". In October they repurposed that container sandbox infrastructure to launch Claude Code for web , which I've been using on an almost daily basis ever since. Claude Code for web is what I call an asynchronous coding agent - a system you can prompt and forget, and it will work away on the problem and file a Pull Request once it's done. OpenAI "Codex cloud" (renamed to "Codex web" in the last week ) launched earlier in May 2025 . Gemini's entry in this category is called Jules , also launched in May . I love the asynchronous coding agent category. They're a great answer to the security challenges of running arbitrary code execution on a personal laptop and it's really fun being able to fire off multiple tasks at once - often from my phone - and get decent results a few minutes later. I wrote more about how I'm using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle . In 2024 I spent a lot of time hacking on my LLM command-line tool for accessing LLMs from the terminal, all the time thinking that it was weird that so few people were taking CLI access to models seriously - they felt like such a natural fit for Unix mechanisms like pipes. Maybe the terminal was just too weird and niche to ever become a mainstream tool for accessing LLMs? Claude Code and friends have conclusively demonstrated that developers will embrace LLMs on the command line, given powerful enough models and the right harness. It helps that terminal commands with obscure syntax like and and itself are no longer a barrier to entry when an LLM can spit out the right command for you. As-of December 2nd Anthropic credit Claude Code with $1bn in run-rate revenue ! I did not expect a CLI tool to reach anything close to those numbers. With hindsight, maybe I should have promoted LLM from a side-project to a key focus! The default setting for most coding agents is to ask the user for confirmation for almost every action they take . In a world where an agent mistake could wipe your home folder or a malicious prompt injection attack could steal your credentials this default makes total sense. Anyone who's tried running their agent with automatic confirmation (aka YOLO mode - Codex CLI even aliases to ) has experienced the trade-off: using an agent without the safety wheels feels like a completely different product. A big benefit of asynchronous coding agents like Claude Code for web and Codex Cloud is that they can run in YOLO mode by default, since there's no personal computer to damage. I run in YOLO mode all the time, despite being deeply aware of the risks involved. It hasn't burned me yet... ... and that's the problem. One of my favourite pieces on LLM security this year is The Normalization of Deviance in AI by security researcher Johann Rehberger. Johann describes the "Normalization of Deviance" phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal. This was originally described by sociologist Diane Vaughan as part of her work to understand the 1986 Space Shuttle Challenger disaster, caused by a faulty O-ring that engineers had known about for years. Plenty of successful launches led NASA culture to stop taking that risk seriously. Johann argues that the longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own. ChatGPT Plus's original $20/month price turned out to be a snap decision by Nick Turley based on a Google Form poll on Discord. That price point has stuck firmly ever since. This year a new pricing precedent has emerged: the Claude Pro Max 20x plan, at $200/month. OpenAI have a similar $200 plan called ChatGPT Pro. Gemini have Google AI Ultra at $249/month with a $124.99/month 3-month starting discount. These plans appear to be driving some serious revenue, though none of the labs have shared figures that break down their subscribers by tier. I've personally paid $100/month for Claude in the past and will upgrade to the $200/month plan once my current batch of free allowance (from previewing one of their models - thanks, Anthropic) runs out. I've heard from plenty of other people who are happy to pay these prices too. You have to use models a lot in order to spend $200 of API credits, so you would think it would make economic sense for most people to pay by the token instead. It turns out tools like Claude Code and Codex CLI can burn through enormous amounts of tokens once you start setting them more challenging tasks, to the point that $200/month offers a substantial discount. 2024 saw some early signs of life from the Chinese AI labs mainly in the form of Qwen 2.5 and early DeepSeek. They were neat models but didn't feel world-beating. This changed dramatically in 2025. My ai-in-china tag has 67 posts from 2025 alone, and I missed a bunch of key releases towards the end of the year (GLM-4.7 and MiniMax-M2.1 in particular.) Here's the Artificial Analysis ranking for open weight models as-of 30th December 2025 : GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models. The highest non-Chinese model in that chart is OpenAI's gpt-oss-120B (high), which comes in sixth place. The Chinese model revolution really kicked off on Christmas day 2024 with the release of DeepSeek 3 , supposedly trained for around $5.5m. DeepSeek followed that on 20th January with DeepSeek R1 which promptly triggered a major AI/semiconductor selloff : NVIDIA lost ~$593bn in market cap as investors panicked that AI maybe wasn't an American monopoly after all. The panic didn't last - NVIDIA quickly recovered and today are up significantly from their pre-DeepSeek R1 levels. It was still a remarkable moment. Who knew an open weight model release could have that kind of impact? DeepSeek were quickly joined by an impressive roster of Chinese AI labs. I've been paying attention to these ones in particular: Most of these models aren't just open weight, they are fully open source under OSI-approved licenses: Qwen use Apache 2.0 for most of their models, DeepSeek and Z.ai use MIT. Some of them are competitive with Claude 4 Sonnet and GPT-5! Sadly none of the Chinese labs have released their full training data or the code they used to train their models, but they have been putting out detailed research papers that have helped push forward the state of the art, especially when it comes to efficient training and inference. One of the most interesting recent charts about LLMs is Time-horizon of software engineering tasks different LLMscan complete 50% of the time from METR: The chart shows tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. As you can see, 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours - 2024’s best models tapped out at under 30 minutes. METR conclude that “the length of tasks AI can do is doubling every 7 months”. I'm not convinced that pattern will continue to hold, but it's an eye-catching way of illustrating current trends in agent capabilities. The most successful consumer product launch of all time happened in March, and the product didn't even have a name. One of the signature features of GPT-4o in May 2024 was meant to be its multimodal output - the "o" stood for "omni" and OpenAI's launch announcement included numerous "coming soon" features where the model output images in addition to text. Then... nothing. The image output feature failed to materialize. In March we finally got to see what this could do - albeit in a shape that felt more like the existing DALL-E. OpenAI made this new image generation available in ChatGPT with the key feature that you could upload your own images and use prompts to tell it how to modify them. This new feature was responsible for 100 million ChatGPT signups in a week. At peak they saw 1 million account creations in a single hour! Tricks like "ghiblification" - modifying a photo to look like a frame from a Studio Ghibli movie - went viral time and time again. OpenAI released an API version of the model called "gpt-image-1", later joined by a cheaper gpt-image-1-mini in October and a much improved gpt-image-1.5 on December 16th . The most notable open weight competitor to this came from Qwen with their Qwen-Image generation model on August 4th followed by Qwen-Image-Edit on August 19th . This one can run on (well equipped) consumer hardware! They followed with Qwen-Image-Edit-2511 in November and Qwen-Image-2512 on 30th December, neither of which I've tried yet. The even bigger news in image generation came from Google with their Nano Banana models, available via Gemini. Google previewed an early version of this in March under the name "Gemini 2.0 Flash native image generation". The really good one landed on August 26th , where they started cautiously embracing the codename "Nano Banana" in public (the API model was called " Gemini 2.5 Flash Image "). Nano Banana caught people's attention because it could generate useful text ! It was also clearly the best model at following image editing instructions. In November Google fully embraced the "Nano Banana" name with the release of Nano Banana Pro . This one doesn't just generate text, it can output genuinely useful detailed infographics and other text and information-heavy images. It's now a professional-grade tool. Max Woolf published the most comprehensive guide to Nano Banana prompting , and followed that up with an essential guide to Nano Banana Pro in December. I've mainly been using it to add kākāpō parrots to my photos. Given how incredibly popular these image tools are it's a little surprising that Anthropic haven't released or integrated anything similar into Claude. I see this as further evidence that they're focused on AI tools for professional work, but Nano Banana Pro is rapidly proving itself to be of value to anyone who's work involves creating presentations or other visual materials. In July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad , a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There's no chance any of these were already in the training data! It's also notable because neither of the models had access to tools - their solutions were generated purely from their internal knowledge and token-based reasoning capabilities. Turns out sufficiently advanced LLMs can do math after all! In September OpenAI and Gemini pulled off a similar feat for the International Collegiate Programming Contest (ICPC) - again notable for having novel, previously unpublished problems. This time the models had access to a code execution environment but otherwise no internet access. I don't believe the exact models used for these competitions have been released publicly, but Gemini's Deep Think and OpenAI's GPT-5 Pro should provide close approximations. With hindsight, 2024 was the year of Llama. Meta's Llama models were by far the most popular open weight models - the original Llama kicked off the open weight revolution back in 2023 and the Llama 3 series, in particular the 3.1 and 3.2 dot-releases, were huge leaps forward in open weight capability. Llama 4 had high expectations, and when it landed in April it was... kind of disappointing. There was a minor scandal where the model tested on LMArena turned out not to be the model that was released, but my main complaint was that the models were too big . The neatest thing about previous Llama releases was that they often included sizes you could run on a laptop. The Llama 4 Scout and Maverick models were 109B and 400B, so big that even quantization wouldn't get them running on my 64GB Mac. They were trained using the 2T Llama 4 Behemoth which seems to have been forgotten now - it certainly wasn't released. It says a lot that none of the most popular models listed by LM Studio are from Meta, and the most popular on Ollama is still Llama 3.1, which is low on the charts there too. Meta's AI news this year mainly involved internal politics and vast amounts of money spent hiring talent for their new Superintelligence Labs . It's not clear if there are any future Llama releases in the pipeline or if they've moved away from open weight model releases to focus on other things. Last year OpenAI remained the undisputed leader in LLMs, especially given o1 and the preview of their o3 reasoning models. This year the rest of the industry caught up. OpenAI still have top tier models, but they're being challenged across the board. In image models they're still being beaten by Nano Banana Pro. For code a lot of developers rate Opus 4.5 very slightly ahead of GPT-5.2 Codex. In open weight models their gpt-oss models, while great, are falling behind the Chinese AI labs. Their lead in audio is under threat from the Gemini Live API . Where OpenAI are winning is in consumer mindshare. Nobody knows what an "LLM" is but almost everyone has heard of ChatGPT. Their consumer apps still dwarf Gemini and Claude in terms of user numbers. Their biggest risk here is Gemini. In December OpenAI declared a Code Red in response to Gemini 3, delaying work on new initiatives to focus on the competition with their key products. Google Gemini had a really good year . They posted their own victorious 2025 recap here . 2025 saw Gemini 2.0, Gemini 2.5 and then Gemini 3.0 - each model family supporting audio/video/image/text input of 1,000,000+ tokens, priced competitively and proving more capable than the last. They also shipped Gemini CLI (their open source command-line coding agent, since forked by Qwen for Qwen Code ), Jules (their asynchronous coding agent), constant improvements to AI Studio, the Nano Banana image models, Veo 3 for video generation, the promising Gemma 3 family of open weight models and a stream of smaller features. Google's biggest advantage lies under the hood. Almost every other AI lab trains with NVIDIA GPUs, which are sold at a margin that props up NVIDIA's multi-trillion dollar valuation. Google use their own in-house hardware, TPUs, which they've demonstrated this year work exceptionally well for both training and inference of their models. When your number one expense is time spent on GPUs, having a competitor with their own, optimized and presumably much cheaper hardware stack is a daunting prospect. It continues to tickle me that Google Gemini is the ultimate example of a product name that reflects the company's internal org-chart - it's called Gemini because it came out of the bringing together (as twins) of Google's DeepMind and Google Brain teams. I first asked an LLM to generate an SVG of a pelican riding a bicycle in October 2024 , but 2025 is when I really leaned into it. It's ended up a meme in its own right. I originally intended it as a dumb joke. Bicycles are hard to draw, as are pelicans, and pelicans are the wrong shape to ride a bicycle. I was pretty sure there wouldn't be anything relevant in the training data, so asking a text-output model to generate an SVG illustration of one felt like a somewhat absurdly difficult challenge. To my surprise, there appears to be a correlation between how good the model is at drawing pelicans on bicycles and how good it is overall. I don't really have an explanation for this. The pattern only became clear to me when I was putting together a last-minute keynote (they had a speaker drop out) for the AI Engineer World's Fair in July. You can read (or watch) the talk I gave here: The last six months in LLMs, illustrated by pelicans on bicycles . My full collection of illustrations can be found on my pelican-riding-a-bicycle tag - 89 posts and counting. There is plenty of evidence that the AI labs are aware of the benchmark. It showed up (for a split second) in the Google I/O keynote in May, got a mention in an Anthropic interpretability research paper in October and I got to talk about it in a GPT-5 launch video filmed at OpenAI HQ in August. Are they training specifically for the benchmark? I don't think so, because the pelican illustrations produced by even the most advanced frontier models still suck! In What happens if AI labs train for pelicans riding bicycles? I confessed to my devious objective: Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. My favourite is still this one that I go from GPT-5: I started my tools.simonwillison.net site last year as a single location for my growing collection of vibe-coded / AI-assisted HTML+JavaScript tools. I wrote several longer pieces about this throughout the year: The new browse all by month page shows I built 110 of these in 2025! I really enjoy building in this way, and I think it's a fantastic way to practice and explore the capabilities of these models. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them. I'll highlight a few of my favourites from the past year: A lot of the others are useful tools for my own workflow like svg-render and render-markdown and alt-text-extractor . I built one that does privacy-friendly personal analytics against localStorage to keep track of which tools I use the most often. Anthropic's system cards for their models have always been worth reading in full - they're full of useful information, and they also frequently veer off into entertaining realms of science fiction. The Claude 4 system card in May had some particularly fun moments - highlights mine: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users , given access to a command line, and told something in the system prompt like “ take initiative ,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. In other words, Claude 4 might snitch you out to the feds. This attracted a great deal of media attention and a bunch of people decried Anthropic as having trained a model that was too ethical for its own good. Then Theo Browne used the concept from the system card to build SnitchBench - a benchmark to see how likely different models were to snitch on their users. It turns out they almost all do the same thing ! Theo made a video , and I published my own notes on recreating SnitchBench with my LLM too . The key prompt that makes this work is: I recommend not putting that in your system prompt! Anthropic's original Claude 4 system card said the same thing: We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable. In a tweet in February Andrej Karpathy coined the term "vibe coding", with an unfortunately long definition (I miss the 140 character days) that many people failed to read all the way to the end: There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. The key idea here was "forget that the code even exists" - vibe coding captured a new, fun way of prototyping software that "mostly works" through prompting alone. I don't know if I've ever seen a new term catch on - or get distorted - so quickly in my life. A lot of people instead latched on to vibe coding as a catch-all for anything where LLM is involved in programming. I think that's a waste of a great term, especially since it's becoming clear likely that most programming will involve some level of AI-assistance in the near future. Because I'm a sucker for tilting at linguistic windmills I tried my best to encourage the original meaning of the term: I don't think this battle is over yet. I've seen reassuring signals that the better, original definition of vibe coding might come out on top. I should really get a less confrontational linguistic hobby! Anthropic introduced their Model Context Protocol specification in November 2024 as an open standard for integrating tool calls with different LLMs. In early 2025 it exploded in popularity. There was a point in May where OpenAI , Anthropic , and Mistral all rolled out API-level support for MCP within eight days of each other! MCP is a sensible enough idea, but the huge adoption caught me by surprise. I think this comes down to timing: MCP's release coincided with the models finally getting good and reliable at tool-calling, to the point that a lot of people appear to have confused MCP support as a pre-requisite for a model to use tools. For a while it also felt like MCP was a convenient answer for companies that were under pressure to have "an AI strategy" but didn't really know how to do that. Announcing an MCP server for your product was an easily understood way to tick that box. The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash - if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal. Since leaning heavily into Claude Code and friends myself I've hardly used MCP at all - I've found CLI tools like and libraries like Playwright to be better alternatives to the GitHub and Playwright MCPs. Anthropic themselves appeared to acknowledge this later in the year with their release of the brilliant Skills mechanism - see my October post Claude Skills are awesome, maybe a bigger deal than MCP . MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts. Then in November Anthropic published Code execution with MCP: Building more efficient agents - describing a way to have coding agents generate code to call MCPs in a way that avoided much of the context overhead from the original specification. (I'm proud of the fact that I reverse-engineered Anthropic's skills a week before their announcement , and then did the same thing to OpenAI's quiet adoption of skills two months after that .) MCP was donated to the new Agentic AI Foundation at the start of December. Skills were promoted to an "open format" on December 18th . Despite the very clear security risks, everyone seems to want to put LLMs in your web browser. OpenAI launched ChatGPT Atlas in October, built by a team including long-time Google Chrome engineers Ben Goodger and Darin Fisher. Anthropic have been promoting their Claude in Chrome extension, offering similar functionality as an extension as opposed to a full Chrome fork. Chrome itself now has a little "Gemini" button in the top right called Gemini in Chrome , though I believe that's just for answering questions about content and doesn't yet have the ability to drive browsing actions. I remain deeply concerned about the safety implications of these new tools. My browser has access to my most sensitive data and controls most of my digital life. A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect. So far the most detail I've seen on mitigating these concerns came from OpenAI's CISO Dane Stuckey , who talked about guardrails and red teaming and defense in depth but also correctly called prompt injection "a frontier, unsolved security problem". I've used these browsers agents a few times now ( example ), under very close supervision. They're a bit slow and janky - they often miss with their efforts to click on interactive elements - but they're handy for solving problems that can't be addressed via APIs. I'm still uneasy about them, especially in the hands of people who are less paranoid than I am. I've been writing about prompt injection attacks for more than three years now. An ongoing challenge I've found is helping people understand why they're a problem that needs to be taken seriously by anyone building software in this space. This hasn't been helped by semantic diffusion , where the term "prompt injection" has grown to cover jailbreaking as well (despite my protestations ), and who really cares if someone can trick a model into saying something rude? So I tried a new linguistic trick! In June I coined the term the lethal trifecta to describe the subset of prompt injection where malicious instructions trick an agent into stealing private data on behalf of an attacker. A trick I use here is that people will jump straight to the most obvious definition of any new term that they hear. "Prompt injection" sounds like it means "injecting prompts". "The lethal trifecta" is deliberately ambiguous: you have to go searching for my definition if you want to know what it means! It seems to have worked. I've seen a healthy number of examples of people talking about the lethal trifecta this year with, so far, no misinterpretations of what it is intended to mean. I wrote significantly more code on my phone this year than I did on my computer. Through most of the year this was because I leaned into vibe coding so much. My tools.simonwillison.net collection of HTML+JavaScript tools was mostly built this way: I would have an idea for a small project, prompt Claude Artifacts or ChatGPT or (more recently) Claude Code via their respective iPhone apps, then either copy the result and paste it into GitHub's web editor or wait for a PR to be created that I could then review and merge in Mobile Safari. Those HTML tools are often ~100-200 lines of code, full of uninteresting boilerplate and duplicated CSS and JavaScript patterns - but 110 of them adds up to a lot! Up until November I would have said that I wrote more code on my phone, but the code I wrote on my laptop was clearly more significant - fully reviewed, better tested and intended for production use. In the past month I've grown confident enough in Claude Opus 4.5 that I've started using Claude Code on my phone to tackle much more complex tasks, including code that I intend to land in my non-toy projects. This started with my project to port the JustHTML HTML5 parser from Python to JavaScript , using Codex CLI and GPT-5.2. When that worked via prompting-alone I became curious as to how much I could have got done on a similar project using just my phone. So I attempted a port of Fabrice Bellard's new MicroQuickJS C library to Python, run entirely using Claude Code on my iPhone... and it mostly worked ! Is it code that I'd use in production? Certainly not yet for untrusted code , but I'd trust it to execute JavaScript I'd written myself. The test suite I borrowed from MicroQuickJS gives me some confidence there. This turns out to be the big unlock: the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against. I call these conformance suites and I've started deliberately looking out for them - so far I've had success with the html5lib tests , the MicroQuickJS test suite and a not-yet-released project against the comprehensive WebAssembly spec/test collection . If you're introducing a new protocol or even a new programming language to the world in 2026 I strongly recommend including a language-agnostic conformance suite as part of your project. I've seen plenty of hand-wringing that the need to be included in LLM training data means new technologies will struggle to gain adoption. My hope is that the conformance suite approach can help mitigate that problem and make it easier for new ideas of that shape to gain traction. Towards the end of 2024 I was losing interest in running local LLMs on my own machine. My interest was re-kindled by Llama 3.3 70B in December , the first time I felt like I could run a genuinely GPT-4 class model on my 64GB MacBook Pro. Then in January Mistral released Mistral Small 3 , an Apache 2 licensed 24B parameter model which appeared to pack the same punch as Llama 3.3 70B using around a third of the memory. Now I could run a ~GPT-4 class model and have memory left over to run other apps! This trend continued throughout 2025, especially once the models from the Chinese AI labs started to dominate. That ~20-32B parameter sweet spot kept getting models that performed better than the last. I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled. The problem is that the big cloud models got better too - including those open weight models that, while freely available, were far too large (100B+) to run on my laptop. Coding agents changed everything for me. Systems like Claude Code need more than a great model - they need a reasoning model that can perform reliable tool calling invocations dozens if not hundreds of times over a constantly expanding context window. I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device. My next laptop will have at least 128GB of RAM, so there's a chance that one of the 2026 open weight models might fit the bill. For now though I'm sticking with the best available frontier hosted models as my daily drivers. I played a tiny role helping to popularize the term "slop" in 2024, writing about it in May and landing quotes in the Guardian and the New York Times shortly afterwards. This year Merriam-Webster crowned it word of the year ! slop ( noun ): digital content of low quality that is produced usually in quantity by means of artificial intelligence I like that it represents a widely understood feeling that poor quality AI-generated content is bad and should be avoided. I'm still holding hope that slop won't end up as bad a problem as many people fear. The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff. I don't see the increased volume of junk as changing that fundamental dynamic much. Curation matters more than ever. That said... I don't use Facebook, and I'm pretty careful at filtering or curating my other social media habits. Is Facebook still flooded with Shrimp Jesus or was that a 2024 thing? I heard fake videos of cute animals getting rescued is the latest trend. It's quite possible the slop problem is a growing tidal wave that I'm innocently unaware of. I nearly skipped writing about the environmental impact of AI for this year's post (here's what I wrote in 2024 ) because I wasn't sure if we had learned anything new this year - AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable. What's interesting in 2025 is that public opinion appears to be shifting quite dramatically against new data center construction. Here's a Guardian headline from December 8th: More than 200 environmental groups demand halt to new US datacenters . Opposition at the local level appears to be rising sharply across the board too. I've been convinced by Andy Masley that the water usage issue is mostly overblown, which is a problem mainly because it acts as a distraction from the very real issues around energy consumption, carbon emissions and noise pollution. AI labs continue to find new efficiencies to help serve increased quality of models using less energy per token, but the impact of that is classic Jevons paradox - as tokens get cheaper we find more intense ways to use them, like spending $200/month on millions of tokens to run coding agents. As an obsessive collector of neologisms, here are my own favourites from 2025. You can see a longer list in my definitions tag . If you've made it this far, I hope you've found this useful! You can subscribe to my blog in a feed reader or via email , or follow me on Bluesky or Mastodon or Twitter . If you'd like a review like this on a monthly basis instead I also operate a $10/month sponsors only newsletter with a round-up of the key developments in the LLM space over the past 30 days. Here are preview editions for September , October , and November - I'll be sending December's out some time tomorrow. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The year of "reasoning" The year of agents The year of coding agents and Claude Code The year of LLMs on the command-line The year of YOLO and the Normalization of Deviance The year of $200/month subscriptions The year of top-ranked Chinese open weight models The year of long tasks The year of prompt-driven image editing The year models won gold in academic competitions The year that Llama lost its way The year that OpenAI lost their lead The year of Gemini The year of pelicans riding bicycles The year I built 110 tools The year of the snitch! The year of vibe coding The (only?) year of MCP The year of alarmingly AI-enabled browsers The year of the lethal trifecta The year of programming on my phone The year of conformance suites The year local models got good, but cloud models got even better The year of slop The year that data centers got extremely unpopular My own words of the year That's a wrap for 2025 Claude Code Mistral Vibe Alibaba Qwen (Qwen3) Moonshot AI (Kimi K2) Z.ai (GLM-4.5/4.6/4.7) MiniMax (M2) MetaStone AI (XBai o4) Here’s how I use LLMs to help me write code Adding AI-generated descriptions to my tools collection Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools - my favourite post of the bunch. blackened-cauliflower-and-turkish-style-stew is ridiculous. It's a custom cooking timer app for anyone who needs to prepare Green Chef's Blackened Cauliflower and Turkish-style Spiced Chickpea Stew recipes at the same time. Here's more about that one . is-it-a-bird takes inspiration from xkcd 1425 , loads a 150MB CLIP model via Transformers.js and uses it to say if an image or webcam feed is a bird or not. bluesky-thread lets me view any thread on Bluesky with a "most recent first" option to make it easier to follow new posts as they arrive. Not all AI-assisted programming is vibe coding (but vibe coding rocks) in March Two publishers and three authors fail to understand what “vibe coding” means in May (one book subsequently changed its title to the much better "Beyond Vibe Coding"). Vibe engineering in October, where I tried to suggest an alternative term for what happens when professional engineers use AI assistance to build production-grade software. Your job is to deliver code you have proven to work in December, about how professional software development is about code that demonstrably works, no matter how you built it. Vibe coding, obviously. Vibe engineering - I'm still on the fence of if I should try to make this happen ! The lethal trifecta , my one attempted coinage of the year that seems to have taken root . Context rot , by Workaccount2 on Hacker News, for the thing where model output quality falls as the context grows longer during a session. Context engineering as an alternative to prompt engineering that helps emphasize how important it is to design the context you feed to your model. Slopsquatting by Seth Larson, where an LLM hallucinates an incorrect package name which is then maliciously registered to deliver malware. Vibe scraping - another of mine that didn't really go anywhere, for scraping projects implemented by coding agents driven by prompts. Asynchronous coding agent for Claude for web / Codex cloud / Google Jules Extractive contributions by Nadia Eghbal for open source contributions where "the marginal cost of reviewing and merging that contribution is greater than the marginal benefit to the project’s producers".

2 views
Abhinav Sarkar 2 months ago

Polls I Ran on Mastodon in 2025

In 2025, I ran ten polls on Mastodon exploring various topics, mostly to outsource my research to the hivemind. Here are the poll results organized by topic, with commentary. How do you pronounce JSON? January 15, 2025 I’m in the “Jay-Son, O as in Otter” camp, which is the majority response. It seems like most Americans prefer the “Jay-Son, O as in Utter” option. Thankfully, only one person in the whole world says “Jay-Ess-On”. If someone were to write a new compiler book today, what would you prefer the backend to emit? October 31, 2025 LLVM wins this poll hands down. It is interesting to see WASM beating other targets. Which is your favourite Haskell parsing library? November 3, 2025 I didn’t expect Attoparsec to go toe-to-toe with Megaparsec . I did some digging, and it seems like Megaparsec is the clear winner when it comes to parsing programming languages in Haskell. However, for parsing file formats and network protocols, Attoparsec is the most popular one. I think that’s wise, and I’m inclined to make the same choice. If you were to write a compiler in Haskell, would you use a lens library to transform the data structures? July 11, 2025 This one has mixed results. Personally, I’d like to use a minimal lens library if I’m writing a compiler in Haskell. What do you think is the right length of programming related blog posts (containing code) in terms of reading time? May 18, 2025 As a writer of programming related blog posts, this poll was very informative for me. 10 minute long posts seem to be the most popular option, but my own posts are a bit longer, usually between 15–20 minutes. Do you print blog posts or save them as PDFs for offline reading? March 8, 2025 Most people do not seem to care about saving or printing blog posts. But I went ahead and added (decent) printing support for my blog posts anyway. If you have a personal website and you do not work in academia, do you have your résumé or CV on your website? August 30, 2025 I don’t have a public résumé on my website either. I’d like to, but I don’t think anyone visiting my website would read it. Would people be interested in a series of blog posts where I implement the C compiler from “Writing a C Compiler” book by Nora Sandler in Haskell? November 11, 2025 Well, 84% people voted “Yes”, so this is (most certainly) happening in 2026! If I were to release a service to run on servers, how would you prefer I package it? December 30, 2025 Well, people surely love their Docker images. Surprisingly, many are okay with just source code and build instructions. Statically linked executable are more popular now, probably because of the ease of deployment. Many also commented that they’d prefer OS specify package like deb or rpm. However, my personal preference is Nix package and NixOS module. If you run services on Hetzner, do you keep a backup of your data entirely off Hetzner? August 9, 2025 It is definitely wise to have an offsite backup. I’m still figuring out the backup strategy for my VPS. That’s all for this year. Let’s see what polls I come up with in 2026. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . General Programming JSON Pronunciation Compilers Compiler Backend Targets Haskell Parsing Libraries Compiler in Haskell with Lenses Blogging & Web Blog Post Length Preferences Blog Post Print Support Résumés on Personal Website “Writing a C Compiler” Blog Series Self-hosting Service Packaging Preferences Hetzner Backup Strategy

0 views
Alex White's Blog 2 months ago

Constraints Breed Innovation

I've mentioned a few times on my blog about daily driving a Palm Pilot. I've been using either my Tungsten C or T3 for the past 2 months. These devices have taken the place of my smartphone in my pocket. They hold my agenda, tasks, blog post drafts, databases of my media collection and child's sleep schedule and lots more. Massive amounts of data, in kilobytes of size. Simply put, it's been a joy to use these machines, more so than my smartphone ever has been. I've been thinking about the why behind my love of Palm Pilots. Is it simply nostalgia for my childhood? Or maybe an overpowering disdain for modern tech? Yes to both of these, but it's also something more. I genuinely believe the software on Palm is BETTER than most of what you'll find on Android or iOS. The operating system itself, the database software ( HanDBase ) I use to track my child's bed times, the outline tool I plan projects with ( ShadowPlan ), the program I'm writing this post on ( CardTXT ) and the solitaire game I kill time with ( Acid FreeCell ), they all feel special. Each app does an absolutely excellent job, only takes up kilobytes of storage, opens instantly, doesn't require internet or a subscription fee (everything was pay once). But I think there's an additional, underpinning reason these pieces of software are so great: constraint. The device I'm using right now, the Palm Pilot Tungsten T3, has a 400MHz processor, 64MiB of RAM and a 480x320 pixel screen. That's all you have to work with! You can't count on network connectivity (this device doesn't have WiFi). You have to hyper optimize for file size and performance. Each pixel needs to serve a purpose (there's only 153,600 of them!). When you're hands are tied behind your back, you get creative and focused. Constraint truly is the breeder of innovation, and something we've lost. A modern smartphone is immensely powerful, constantly online, capable of multitasking and has a high resolution screen. Building a smartphone app means anything goes. Optimizations aren't as necessary, space isn't a concern, screen real estate is abundant. Now don't get me wrong, there's definitely a balance of too much performance and too little. There's a reason I'm not writing this on a Apple Newton (well, the cost of buying one). But on the other hand, look at the Panic Playdate. It has a 168MHz processor, 16 MiB RAM and a 400x240 1-bit black & white screen, yet there are some beautiful , innovative games hitting the console. Developers have to optimize every line of C code for performance, and keep an eye on file size, just like the Palm Pilot. I've experienced the power of constraint myself as a developer. My most successful projects have been ones where I limited myself from using libraries, and instead focused on plain PHP + MySQL. With a framework project and composer behind you, you implement every feature that crosses your mind, heck it's just one "composer require" away! But when you have to dedicate real time to writing each feature, you tend to hyper focus on what adds value to your software. I think this is what powers great Palm software. You don't have the performance or memory to add bloat. You don't have the screen real estate to build some complicated, fancy UI. You don't have the network connectivity to rely on offloading to a server. You need to make a program that launches instantly, does it's job well enough to sell licenses and works great even in black & white. That's a tall order, and a lot of developers knocked it out of the park. All this has got me thinking about what a modern, constrained PDA would look like. Something akin to the Playdate, but for the productivity side of the house. Imagine a Palm Pilot with a keyboard, USB C, the T3 screen size, maybe a color e-ink display, expandable storage, headphone jack, Bluetooth (for file transfer), infrared (I REALLY like IR) and a microphone (for voice memos). Add an OS similar to Palm OS 5, or a slightly improved version of it. Keep the CPU, memory, RAM all constrained (within reason). That would be a sweet device, and I'd love to see what people would do with it. I plan to start doing reviews on some of my favorite Palm Pilot software, especially the tools that help me plan and write this blog, so be on the lookout!

0 views
Ginger Bill 2 months ago

context—Odin's Most Misunderstood Feature

Even with the documentation on the topic, many people completely misunderstand what the system is for, and what problem it actually solves. For those not familiar with Odin, in each scope, there is an implicit value named . This context variable is local to each scope and is implicitly passed by pointer to any procedure call in that scope (if the procedure has the Odin calling convention). The main purpose of the implicit system is for the ability to intercept third-party code and libraries and modify their functionality. One such case is modifying how a library allocates something or logs something. In C, this was usually achieved with the library defining macros which could be overridden so that the user could define what they wanted. However, not many libraries support this, in any language, by default which meant intercepting third-party code to see what it does and to change how it does it is generally not possible. The value has default values for its parameters which is decided in the package runtime. These defaults are compiler specific. To see what the implicit value contains, please see the definition of the struct in package runtime . Fundamentally, the entire point of the system is to intercept third-party code, and to change how it does things. By third-party, I just mean code not written by yourself or code that you cannot easily modify (which could even be your own past self’s code). I expect most people to 100% ignore the because its existence is not for whatever preconceived reason they think it is for, be that minimizing typing/passing things around, or dynamic scoping, etc. It’s just for interception of third-party code. Ironically, works because people misunderstand it, and thus generally leave it alone. That allows those who do understand it to work around less-than-ideal API’s. I understand a lot of people may not understand why it exists when they might not currently need it, but it’s fundamentally a solution to a specific problem which cannot really be solved in another way. A common misunderstanding usually arises when it is necessary to interact with third-party code and having to write callbacks which do not use the Odin calling convention. There is a general misunderstanding that because some procedure may not appear to use directly (or at least not obviously do so), people will say that they should be marked as or , however this misses the entire point. Because the default calling convention of Odin has this , you don’t actually know if the code needs it or not (which is by design). The first common example of this of this interaction complain usually happens when using Odin’s printing procedures in . For most people, they just do to define a and then continue, but I have had a lot of people “complain” as to why that is even necessary. Arguing that other than the , why would you need the ? This complaint is due to not understanding what et al actually do. is a wrapper around which takes in a generic . Since other libraries can utilize these procedures, it might be necessary to intercept/override/track this behaviour. And from that, there is little choice to but require the . A good APIs offers a way to specify the allocator to use, e.g. or . An API that doesn’t offer it can be worked around by overriding the for that call or series of calls, in the knowledge that the other programmer didn’t hardcode the allocator they gave to , et al. There are two allocators on the and . I expect most people to never use custom allocators whatsoever (which is empirically true), but I do also want to encourage things like using the because it allows for many useful benefits, especially those that most people don’t even realize are a thing. For many people, they usually just want to do nothing with the (assuming they know about it) or set the to the and be done; that’s pretty much it. You could argue that it is “better” to pass allocators 1 around explicitly, but from my own experience in C with this exact interface (made and used well before I even made Odin), I found that I got in a very very lazy habit of not actually passing around allocators properly. This overly explicitness with a generalized interface lead to more allocation bugs than if I had used specific allocators on a per-system basis. When explicit allocators are wanted, you rarely want the generic interface too, and usually a specific allocator instead e.g. . As I have previously expressed in my Memory Allocation Strategies series, an allocator can be used to represent a specific set of lifetimes for a set of allocations—arenas being the most common kind but other allocators such as pools, basic free lists, etc may be useful. However because most people will still default to a traditional / style of dynamic memory allocation, having a generic interface which can be overridden/intercepted/tracked is extremely useful to be able to do, especially in third-party libraries/code. n.b. Odin’s defaults to a heap-like allocator on most platforms, and defaults to a growing arena-like allocator. The field exists because you might honestly want a different way of asserting, with more information it like stack traces. You might even want to use it as a mechanism to do a rudimentary sort of exception handling (similar to Go’s & ). Having this overridable is extremely useful, again with third-party code. I understand it does default to something when it is not set, but that’s still kind of the point still. It does need to assert/panic, which means it cannot just do nothing. Logging is common throughout most applications and we wanted to provide a default approach. I expect most people to default to this as they want a simple unified logging experience. Most people don’t want their logs to be handled by different libraries in numerous different ways BY DEFAULT . But because it is on the , the default logging behaviour can now overridden easily. If you something more than this logger interface, then use what you want. The point as I keep trying to iterate is: what is the default and what will third-party libraries default to so that you can then intercept it if necessary? This is the newest addition to the ; part of the reason for this being here is probably less than obvious. Sometimes a third-party library will do (pseudo-)random number generation but controlling how it does that is very hard (if not impossible). Take C’s for example. If you know the library is using , you can at least set a seed with if you want a deterministic controlled output. However I have used libraries in C which use a different random number generator (thank goodness because is dreadful), but I had no way to overriding it without modifying the source code directly (which is not always possible if it’s a precompiled LIB/DLL). The counter is when you want a cryptographic grade random number generator and you want non-determinacy whatsoever. Having a random generator be on the allows for all of this kind of interception. n.b. Odin’s default is based on ChaCha8 is heavily optimized with SIMD. If you have used C, you’ve probably experienced having callbacks where there is way to pass a custom user data pointer as part of it. The API designer has assumed that the callback is “pure”. However in reality, this is rarely the case, so how do you actually pass a callback (which is immediately used, not delayed) that you can pass user data too? The most obvious example of this is , and even in the “pure” case, it is common to do a sort of a key based on an external table. There are two approaches that some people do in languages without closures (e.g. C and Odin) to get around these issues, but neither of which are great: global variables and/or thread local variables. Honestly, those are just dreadful solutions to the problem, and why has the and fields, to allow you to intercept this poorly thought out API. Now you might be asking: Why both a pointer and an index? Why not just a pointer? From my experience of programming over many years, it is very common I just want to pass the same value to many callbacks but I want to access a different “element” inside of the data passed. Instead of creating a wrapper struct which has both this pointer and the index, I wanted to solve it by just having both already. It’s an empirically derived solution, not anything from “first principles”. I do recommend that an API should be designed to minimize the need for callbacks in the first place, but when necessary to at least have a parameter for callbacks. For when people do not design good APIs, is there to get around the crap. This just exists for internal use within the core library, which no-one should never use for any reason . Most of the time, it exists just for temporary things which will be improved in the future, or for passing things down the stack in a bodgy way. Again, this is not for the programmer, it’s for the compiler/core library developers only. As I said in the and section, a lot of the impetus for making comes from the experience of using numerous C libraries. The GOOD C libraries that allow for a form of interception usually do it through a macro; at best, they only do and style things, and sometimes . However, those are not the norm and usually only written by people in a similar “sphere of influence” to myself. Sadly, the average C library doesn’t allow for this. Even so, with the GOOD C libraries, this macro approach fails across a LIB/DLL boundary. Which is part of the problem when interfacing with a foreign language (e.g. Odin). So even in the GOOD case for C, it’s not that good in practice. Now some library writers are REALLY GOOD and they provide things like an allocation interface, but I probably know all of these library writers personally at this point, so I’d be preaching to the choir with my complaints. I’ve honestly had a few people effectively tell me that if it’s an bad API then the user should put up with it—“It’s API is bad? Oh well, tough luck”. However, I’ve had a lot the same people then ask “but why does the language need to solve that? Isn’t it a library problem?”. I’m sorry but telling someone the API is at fault doesn’t help them in the slightest, and if a API/library cannot be easily modified, then how can that be fixed in code? It’s fundamentally only fixable at the language itself. People rarely write things perfect the first time—code evolves. That’s what engineering is all about. Requirements change. The people change. The problem changes entirely. Expecting not to be able to intercept third-party code is pie-in-the-sky thinking. As I’ve said numerous times before, Third-party just means “stuff not written by you”; that’s it. As I stress, it could even be your past self, which is not the same as your present self. Point out that shitty APIs exist is the entire point. Just saying “tough luck” doesn’t solve anything, you’re adding to the problem. This is why exists. One important aspect about the is that memory layout is not user modifiable, and this is another big design choice too. It allows for a consistent and well understood ABI, which means you can—you guessed it—intercept third-party code even across LIB/DLL boundaries. If the user was allowed to add as many custom fields to the as desired, it would not be ABI consistent, and thus not be stable for the use of its interception abilities across LIB/DLL boundaries. At best, allowing for custom fields is allowing you to minimize passing/typing parameters to procedures. Typing is rarely—if ever—the bottleneck in programming. Another common question I’ve gotten a few times is why the is passed as an implicit pointer argument to a procedure, and not something like a thread local variable stack? The rationale being that there would not need to be a calling convention difference for . Unfortunately through a lot of experimentation and thought, there are a few reasons why it is implemented the way it is: Odin’s also has copy-on-write semantics. This is done for two reasons: to keep things local, and prevent back-propagation of “bad” data from an third-party library (be it malicious or just buggy). So not having an easily accessible stack of values makes it harder for this back-propagation to happen. The main inspiration for the implicit system does come from Jonathan Blow’s language, however I believe the reasoning for its existence in Jonathan Blow’s language is very different. As I have never used Jon’s language, I am only going on what other people have told me and what I have seen from Jon’s initial streams. From what I can tell, Jon’s language’s behaves quite different to Odin’s, since it allows for the ability to add custom fields to it and to back-propagate. I am not sure what Jon’s initial rationale was for his form of , but I do not believe Jon was thinking of third-party code interception when he designed/implemented his . I hypothesize it was something closer to a form of “static dynamic-scoping” but not exactly (I know that’s an oxymoron of a statement). All I know is when I saw it, I saw a brilliant solution to third-party code interception problem. I hope this clarifies a lot of the design rationale behind the implicit system, and why it exists. If you have any more questions, or want me to clarify something further, please feel to contact me! If you want to disallow like defaults in Odin on a per-file basis, you can do so with the get tag .  ↩︎ Easier to manage across LIB/DLL boundaries that trying to use a single thread-local stack Easier management of recovery from crashes where the might be hard to figure out. Using the existing stack makes stack management easier already, you don’t need to have a separate allocator for that stack Some platforms do not thread-local variables (e.g. freestanding targets) Works better with async/fiber based things, which would then require a fiber-local stack instead of a thread-local one Prevent back-propagation, which would be trivial with a global/thread-local stack If you want to disallow like defaults in Odin on a per-file basis, you can do so with the get tag .  ↩︎

0 views
Corrode 2 months ago

Rust for Linux

Bringing Rust into the Linux kernel is one of the most ambitious modernization efforts in open source history. The Linux kernel, with its decades of C code and deeply ingrained development practices, is now opening its doors to a memory-safe language. It’s the first time in over 30 years that a new programming language has been officially adopted for kernel development. But the journey is far from straightforward. In this episode, we speak with Danilo Krummrich, Linux kernel maintainer and Rust for Linux core team member, about the groundbreaking work of integrating Rust into the Linux kernel. Among other things, we talk about the Nova GPU driver, a Rust-based successor to Nouveau for NVIDIA graphics cards, and discuss the technical challenges and cultural shifts required for large-scale Rust adoption in the kernel as well as the future of the Rust4Linux project. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Rust for Linux is a project aimed at bringing the Rust programming language into the Linux kernel. Started to improve memory safety and reduce vulnerabilities in kernel code, the project has been gradually building the infrastructure, abstractions, and tooling necessary for Rust to coexist with the kernel’s existing C codebase. Danilo Krummrich is a software engineer at Red Hat and a core contributor to the Rust for Linux project. In January 2025, he was officially added as a reviewer to the RUST entry in the kernel’s MAINTAINERS file, recognizing his expertise in developing Rust abstractions and APIs for kernel development. Danilo maintains the and branches and is the primary developer of the Nova GPU driver, a fully Rust-based driver for modern NVIDIA GPUs. He is also a maintainer of RUST [ALLOC] and several DRM-related kernel subsystems. AOSP - The Android Open Source Project Kernel Mailing Lists - Where the Linux development happens Miguel Ojeda - Rust4Linux maintainer Wedson Almeida Filho - Retired Rust4Linux maintainer noveau driver - The old driver for NVIDIA GPUs Vulkan - A low level graphics API Mesa - Vulkan and OpenGL implementation for Linux vtable - Indirect function call, a source of headaches in nouveau DRM - Direct Rendering Manager, Linux subsystem for all things graphics Monolithic Kernel - Linux’ kernel architecture The Typestate Pattern in Rust - A very nice way to model state machines in Rust pinned-init - The userspace crate for pin-init rustfmt - Free up space in your brain by not thinking about formatting kunit - Unit testing framework for the kernel Rust core crate - The only part of the Rust Standard Library used in the Linux kernel Alexandre Courbot - NVIDIA employed co-maintainer of nova-core Greg Kroah-Hartman - Linux Foundation fellow and major Linux contributor Dave Airlie - Maintainer of the DRM tree vim - not even neovim mutt - classic terminal e-mail client aerc - a pretty good terminal e-mail client Rust4Linux Zulip - The best entry point for the Rust4Linux community Rust for Linux GitHub Danilo Krummrich on GitHub Danilo Krummrich on LinkedIn

0 views
Alex White's Blog 2 months ago

Writing a Blog Post on a Palm Pilot

When I was a kid, I was obsessed with Palm Pilot devices.. Computers that you can carry in your pocket, I mean how cool is that! My parents got me a m105, and later on a Tungsten E2. I took those devices everywhere! On long car trips, you'd find me in the back of the car playing The Quest or Kyle's Quest. When I was done gaming, I'd fold out my IR keyboard and start programming in C on a fully on-device IDE. I remember writing a stock market "simulation" game on a trip to Virginia, and making my first couple of dollars selling software on PalmDB. Then smartphones came, and my Palm went into storage to be replaced by a first generation iPhone. The funny thing about the iPhone though, it never fully replaced the functionality of the Palm Pilot. Sure, it had a beautiful screen that you could touch with a finger (as opposed to a stylus), a camera, WiFi and calling, but it was (and still is) a lesser experience in some ways than the Palm. The built-in PIM apps on Palm (Contacts, Calendar, Tasks) are still some of my favorites for productivity. Simple, to the point, predictable in user interface. Similarly, the launcher ("Applications") is fast, easy to use and offers just enough customization. The real magic on Palm though comes from third-party applications. The Palm Pilot is from an era where boutique software developers crafted applications to provide value, rather than stick you with another subscription fee. Shareware on the Palm was a pay once ordeal, and for that you typically got exactly what you expected. No ads, tracking, internet requirement or bloat. Heck, the device I'm writing this on doesn't have WiFi and only comes with 64mb of onboard stoprage. Despite this limitations of these little devices, you had full word processors, spreadsheet and slideshow software, video/music players....and ENTIRE ONBOARD IDEs (integrated development environment). I can't understate how big having IDEs on device are, you can go from 0 to compiled application entirely on a Palm. There are applications to write code, build user interfaces, create icons and compile into binary. iOS definitely can't do this, and I'm fairly certain Android can't either. I honestly don't know if I would have gotten so into programming if not for my Palm Pilot as a kid. I never got to write Palm apps on a computer as we were a Apple family and the Palm development suite was Windows only, but with the power of an infrared keyboard and long car trips, I cranked out games and software. All that backstory brings us to today. Here I am, sitting in a coffee shop listening to my iPod and typing on a keyboard with a little arm beaming keycodes into a Palm Tungsten T3. The keyboard is the same one that went on car trips with me 20 some years ago. One of the advantages of tech becoming retro is I can now afford the devices I drooled over in Staples all those years ago, hence the Tungsten T3 (and the Tungsten C on it's way to me via USPS). My Tungsten E2 still works and sits at home, although the bottom 10 rows of pixels are dead. Here's the thing, this Palm feels great to use even today. It's fast and task oriented. There's no distractions and things just work. Sure I had to solder back on a battery terminal for the keyboard, but that's not bad for 20 years in storage. I plan to spend some time getting my dev environment set up again. When my new TC arrives, I'll be using that as a daily calendar, notes, expense tracking and tasks device (extremely excited about the builtin keyboard). The T3 will remain my gaming, development and writing device thanks to the amazing sliding screen. I read an article months ago that, until recently, imax still used Palm Pilots to run their projectors. You give these little devices a task, and they do it, even for decades. You can't say that of newer tech. Updates will break compatibility, apps will be bought out by shady companies that sunset them or stuff them with ads, hardware fails that can't be repaired,. etc. A Palm Pilot on the other hand is always one HotSync away from what you need it to do. Now that you've listened to me ramble about my love of these little devices, you might ask yourself "that's cool, but kinda silly, why should I care?". Well reader, here's a few quick hitters on why purchasing a $30 Palm Pilot could be a good idea in 2025(6): Palm Pilots are another entry in the "forgotten, but not useless" tech category. If you do find yourself going down the Palm rabbit hole, I highly recommend the articles from kelbot (you'll need a Gemini browser to view, see my post on Gemini for recommendations) Have a favorite Palm OS application or memory you want to share? Send me an email and let's geek out over Palm Pilots! Focused writing device that doesn't cost $600+ Great selection of classic games (one of the best solitaire games I've played on any device) Offline life organizer for tasks, contacts and calendars Focused document reader (again that doesn't cost hundreds of dollars ) Control TVs or stereos with the IR blaster Offline music player (although I DO prefer an iPod with Rockbox for this) Voice memos (this is a seriously awesome feature, my T3 has a dedicated button that you hold down to instantly record a memo. I don't think there's a faster way to record thoughts these days. You could even HotSync them, then run some speech to text analysis.....maybe a future article?).

0 views
Pinaraf's website 2 months ago

JIT, episode III: warp speed ahead

In our first JIT episode , we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter. In our second episode , I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment… The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already. So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past… A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously. But why is 64 bits faster? We are talking database here, so let’s have a look at a table, shall we? http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0870F10_K17_Matisse7_InstLatX64.txt (I know uBlock doesn’t like this domain, but this text document there is good, I promise) On my CPU, loading a 64 bits value in a register requires twice the time it takes to load a 32 bits value. So sure 64 bits must be slower than 32 bits! Except the switch from 32 to 64 bits also fixed one of the biggest issue with x86: the lack of registers. x86 never improved from its 16 bits roots and had 8 general purpose registers, little compared to PowerPC (32), Sparc (31), ARM (15). When AMD introduced 64 bits in the x86 world, they doubled the number of registers, from a ridiculous 8 to an acceptable 16. And from this came a huge performance boost. Memory = slow Registers = fast Ok, more seriously, I will not start writing about this. Even if it is getting old and outdated, the “What Every Programmer Should Know About Memory” paper of Ulrich Drepper is still a great read if you’re interested into that topic. The only thing that matters for us is that, even with a lot of cache, writing to memory is slower than writing to a register. If I look at some measurements for my Zen2 CPU, a comparison between two registers takes less than a cycle (0.33c it seems), but if data has to be loaded from L1 cache you can add 4 cycles, 12 cycles from L2 and 38 from L3. Way, way slower. 12 to 115 times slower. Registers are used automatically by your compiler. When you write a function, it will automatically figure out what variable to move to a register, when, and if you don’t have enough registers for your entire function, spill registers on the stack as needed. If you are interested into this, there are many fun register allocation algorithms and many wikipedia pages covering this topic. Let’s see one of the most basic opcode, EEOP_SCAN_VAR, taking a value from a scan slot in order to use it later. This is indeed a memory write. Could the interpreter get rid of this? Well, I think it could, but it would be a major undertaking. If we had a variable, stored in a register by the compiler, we could store there, sure, and a next step could fetch from that place, but what if the next step needs another value instead… Then we would have to spill the value back in memory, but checking for this at each step is going to kill performance. It may be possible to rewrite the entire interpreter to match a register based VM, but I can not be sure it would be worth it. And this is the path to beating the interpreter. We can check many things before running the opcodes, trace memory accesses and use registers as much as possible. The great benefit of copy-patch is that you (almost) don’t write assembly code. Porting it to arm64 required me to learn about ARM64 specific relocations, how to encode immediate values in some ARM opcodes, but nothing more. But a big downside is that you don’t write assembly code And, well, if you don’t write the assembly code, you don’t control register allocation. But there is a simple way around this, let’s speak a bit about calling conventions. When function A is called, how do you pass parameters to it? If you learned some x86 assembly at school, you will answer “on the stack” and won a free ticket for an assembly refreshment course When AMD64 was introduced, the SysV Call Convention took over and completely changed the way functions are called. The first six integer or pointer parameters are passed through general purpose registers, and six floating point parameters are passed through FP registers. Each opcode is defined as a function with three parameters (matching the function signature expected by PostgreSQL). While respecting the SysV calling convention, it leaves us three registers that the compiler will keep across the opcode calls, and will spill automatically if needed. An alternative would have been to use the preserve_none calling convention, but for the first version I did not need it (and I still have many calls to PostgreSQL functions that will use the SysV calling convention) But three registers means… two values only. Sadly we transitioned from 32 to 64 bits, not to 65 bits… 65 bits would have given one bit to represent NULL/NOT NULL values, 0 would not have been NULL, 1 + NULL would be NULL! But we will not rewrite history here, and we are going to use one register as a set of null flags, one bit per value register (so we are wasting 62 bits here). Our opcode functions are thus going to have three new parameters, char nullFlags, intptr_t reg0, intptr_t reg1. Jumping to the next opcode will require passing these values around. Great, we keep registers around, now what about using these? As a reminder, here are the opcodes we are dealing with for our previous “SELECT * FROM demo WHERE a = 42”. This code doesn’t use our registers. I rewrote every opcode implementation to use the registers instead of writing in memory. In this version, all memory accesses have been replaced with register accesses instead, hurray! But this will work only for a simple query like this one. Once we start having more variables to store, we will need a spilling mechanism, a way to swap registers… Another issue appears when you call, for instance, a non-inlined function. The EEOP_FUNCEXPR is defined as: Parameters are fed through the fcinfo_data structure. The other opcodes are writing directly into this structure in the usual interpreter execution. It means that we must check all memory accesses from the opcodes and make sure that any expected memory access from an opcode implementation will not end up in a memory location we didn’t write to. I started with a small experiment, a “variabilizer”, that would look at each opcode and figure out through each memory access (read/write) all the variables used in a run, their lifetimes… It can even detect constants stored in memory (a memory that is only read from). I then refactored a lot of the compiler code in the past weeks. I started by moving the specialized opcodes definition and dispatch to the stencil library only, removing any special case I had in the compiler part. This required #defining a way for the C code in stencils.c to generate more C code in the built-stencils.h file through the stencil-builder.py script. Fun, but complicated and hairy stuff. After that, I started rewriting the stencil signature and several opcodes to use registers instead, and wrote a “contract” for each opcode, defining what was expected in each register, what will be written in each register, and what is going to be read/written in memory. With all these changes, here is what the FUNCEXPR_STRICT opcode optimized for int4eq looks like. More metadata than actual code… But if that’s what it takes to get a good performance boost, then here we go. After ingesting that, the compiler can fill in the registers with the proper values when needed. Another big issue that I’m not covering here is that doing this requires some minimal control flow analysis. For my simple benchmark, this is not a problem, and the code is getting ready for a wider range of queries, but I did not want to cover this and preferred focusing on the registers work… Well… This is the optimization I mentioned in the previous article. So, on our stupid benchmark, doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table… As you can see, this is exactly what we expected: less instructions, sure, but this is not what gave us the performance boost here. What changed is the number of cycles, because the same instruction now uses a register instead of using a memory access, thus several cycles saved for the same instruction. The LLVM JIT can achieve about the same run time here, but it takes some time to generate the bitcode (less than 1ms), then several ms to analyze it, optimize it, and finally translate it to machine code. And this makes LLVM JIT slower here than copyjit, while copyjit still has some room for improvements (I’ve yet to look at tuple deforming) See you in the next one, I think we already know what the topic will be… Well, after I finish porting every opcode to these new metadata, test more stuff, and likely figure out some more optimizations on the way… PS: as said previously, help welcome, code FOSS as usual, on github , and I would gladly accept any sponsoring, mission, anything that could give me more time to work on this…

0 views
Michael Lynch 3 months ago

My First Impressions of MeshCore Off-Grid Messaging

When my wife saw me playing with my new encrypted radio, she asked what it was for. “Imagine,” I said, “if I could type a message on my phone and send it to you, and the message would appear on your phone. Instantly!” She wasn’t impressed. “It also works if phone lines are down due to a power outage… or societal collapse.” Still nothing. “If we’re not within radio range of each other, we can route our messages through a mesh network of our neighbors’ radios. But don’t worry! The radios encrypt our messages end-to-end, so nobody else can read what we’re saying.” By this point, she’d left the room. My wife has many wonderful qualities, but, if I’m being honest, “enthusiasm for encrypted off-grid messaging” has never been one of them. The technology I was pitching to my wife was, of course, MeshCore. If you’d like to skip to the end, check out the summary . MeshCore is software that runs on inexpensive long-range (LoRa) radios . LoRa radios transmit up to several miles depending on how clear the path is. Unlike HAM radios, you don’t need a license to broadcast over LoRa frequencies in the US, so anyone can pick up a LoRa radio and start chatting. MeshCore is more than just sending messages over radio. The “mesh” in the name is because MeshCore users form a mesh network. If Alice wants to send a message to her friend Charlie, but Charlie’s out of range of her radio, she can route her message through Bob, another MeshCore user in her area, and Bob will forward the message to Charlie. If Alice is within radio range of Bob but not Charlie, she can tell Bob’s MeshCore radio to forward her message to Charlie. I’m not exactly a doomsday prepper, but I plan for realistic disaster scenarios like extended power outages, food shortages, and droughts. When I heard about MeshCore, I thought it would be neat to give some devices to friends nearby so we could communicate in an emergency. And if it turned out that we’re out of radio range of each other, maybe I could convince a few neighbors to get involved as well. We could form a messaging network that’s robust against power failures and phone outages. MeshCore is a newer implementation of an idea that was popularized by a technology called Meshtastic . I first heard about Meshtastic from Tyler Cipriani’s 2022 blog post . I thought the idea sounded neat, but Tyler’s conclusion was that Meshtastic was too buggy and difficult for mainstream adoption at the time. I have no particular allegiance to MeshCore or Meshtastic, as I’ve never tried either. Some people I follow on Mastodon have been excited about MeshCore, so I thought I’d check it out. Most MeshCore-compatible devices are also compatible with Meshtastic, so I can easily experiment with one and later try the other. I only have a limited understanding of the differences between Meshtastic and MeshCore, but what I gather is that MeshCore’s key differentiator is preserving bandwidth. Apparently, Meshtastic hits scaling issues when many users are located close to each other. The Meshtastic protocol is chattier than MeshCore, so I’ve seen complaints that Meshtastic chatter floods the airwaves and interferes with message delivery. MeshCore attempts to solve that problem by minimizing network chatter. I should say at this point that I’m not a radio guy. It seems like many people in the LoRa community are radio enthusiasts who have experience with HAM radios or other types of radio broadcasting. I’m a tech-savvy software developer, but I know nothing about radio communication. If I have an incorrect mental model of radio transmission, that’s why. The MeshCore firmware runs on a couple dozen devices, but the official website recommends three devices in particular. The cheapest one is the Heltec v3. I bought two for $27/ea. At $27, the Heltec v3 is the cheapest MeshCore-compatible device I could find. I connected the Heltec v3 to my computer via the USB-C port and used the MeshCore web flasher to flash the latest firmware. I selected “Heltec v3” as my device, “Companion Bluetooth” as the mode, and “v1.9.0” as the version. I clicked “Erase device” since this was a fresh install. Then, I used the MeshCore web app to pair the Heltec with my phone over Bluetooth. Okay, I’ve paired my phone with my MeshCore device, but… now what? The app doesn’t help me out much in terms of onboarding. I try clicking “Map” to see if there are any other MeshCore users nearby. Okay, that’s a map of New Zealand. I live in the US, so that’s a bit surprising. Even if I explore the map, I don’t see any MeshCore activity anywhere, so I don’t know what the map is supposed to do. The map of New Zealand reminded me that different countries use different radio frequencies for LoRa, and if the app defaults to New Zealand’s location, it’s probably defaulting to New Zealand broadcast frequencies as well. I went to settings and saw fields for “Radio Settings,” and I clicked them expecting a dropdown, but it expects me to enter a number. And then I noticed a subtle “Choose Preset” button, which listed presets for different countries that were “suggested by the community.” I had no idea what any of them meant, but who am I to argue with the community? I chose “USA/Canada (Recommended).” I also noticed that the settings let me change my device name, so that seemed useful: It seemed like there were no other MeshCore users within range of me, which I expected. That’s why I bought the second Heltec. I repeated the process with an old phone and my second Heltec v3, but they couldn’t see each other. I eventually realized that I’d forgotten to configure my second device for the US frequency. This is another reason I wish the MeshCore app took initial onboarding more seriously. Okay, they finally see each other! They can both publish messages to the public channel. My devices could finally talk to each other over a public channel. If I communicate with friends over MeshCore, I don’t want to broadcast our whole conversation over the public channel, so it was time to test out direct messaging. I expected some way to view a contact in the public channel and send them a direct message, but I couldn’t. Clicking their name did nothing. There’s a “Participants” view, but the only option is to block, not send a direct message. This seems like an odd design choice. If a MeshCore user posts to the public channel, why can’t I talk to them? I eventually figured out that I have to “Advert.” There are three options: “Zero Hop,” “Flood Routed,” and “To Clipboard.” I don’t know what any of these mean, but I figure “flood” sounds kind of rude, whereas “Zero Hop” sounds elegant, so I do a “Zero Hop.” Great! Device 2 now sees device 1. Let’s say hi to Device 1 from Device 2. Whoops, what’s wrong? Maybe I need to “Advert” from Device 2 as well? Okay, I do, and voila! Messages now work. This is a frustrating user experience. If I have to advert from both ends, why did MeshCore let me send a message on a half-completed handshake? I’m assuming “Advert” is me announcing my device’s public key, but I don’t understand why that’s an explicit step I have to do ahead of time. Why can’t MeshCore do that implicitly when I post to a public channel or attempt to send someone a direct message? Anyway, I can talk to myself in both public channels and DMs. Onward! The Heltec v3 boards were a good way to experiment with MeshCore, but they’re impractical for real-world scenarios. They require their own power source, and a phone to pair. I wanted to power it from my phone with a USB-C to USB-C cable, but the Heltec board wouldn’t power up from my phone. In a real emergency, that’s too many points of failure. The MeshCore website recommends two other MeshCore-compatible devices, so I ordered those: the Seeed SenseCAP T-1000e ($40) and the Lilygo T-Deck+ ($100). I bought the Seeed SenseCAP T-1000e (left) and the Lilygo T-Deck+ (right) to continue experimenting with MeshCore. The T-1000e was a clear improvement over the Heltec v3. It’s self-contained and has its own battery and antenna, which feels simpler and more robust. It’s also nice and light. You could toss it into a backpack and not notice it’s there. The T-1000e feels like a more user-friendly product compared to the bare circuit board of the Heltec v3. Annoyingly, the T-1000e uses a custom USB cable, so I can’t charge it or flash it from my computer with one of my standard USB cables: The Seeed T-1000e uses a custom USB cable for charging and flashing. I used the web flasher for the Heltec, but I decided to try flashing the T-1000e directly from source: I use Nix, and the repo conveniently has a , so the dependencies installed automatically with . I then flashed the firmware for the T-1000e like this: From there, I paired the T-1000e with my phone, and it was basically the same as using the Heltec. The only difference was that the T-1000e has no screen, so it defaults to the Bluetooth pairing password of . Does that mean anyone within Bluetooth range can trivially take over my T-1000e and read all my messages? It also seems impossible to turn off the T-1000e, which is undesirable for a broadcasting device. The manufacturer advises users to just leave it unplugged for several days until the battery runs out. Update : MeshCore contributor Frieder Schrempf just fixed this in commit 07e7e2d , which is included in the v.1.11.0 MeshCore firmware. You can now power off the device by holding down the button at the top of the T-1000e. Now it was time to test the Lilygo T-Deck. This was the part of MeshCore I’d been most excited about since the very beginning. If I handed my non-techy friends a device like the T-1000e, there were too many things that could go wrong in an actual emergency. “Oh, you don’t have the MeshCore app? Oh, you’re having trouble pairing it with your phone? Oh, your phone battery is dead?” The T-Deck looked like a 2000s era Blackberry. It seemed dead-simple to use because it was an all-in-one device: no phone pairing step or app to download. I wanted to buy a bunch, and hand them out to my friends. If society collapsed and our city fell into chaos, we’d still be able to chat on our doomsday hacker Blackberries like it was 2005. As soon as I turned on my T-Deck, my berry was burst. This was not a Blackberry at all. As a reminder, this is what a Blackberry looked like in 2003: A Blackberry smartphone in 2003 Before I even get to the T-Deck software experience, the hardware itself is so big and clunky. We can’t match the quality of a hardware product that we produced 22 years ago ? Right off the bat, the T-Deck was a pain to use. You navigate the UI by clicking a flimsy little thumbwheel in the center of the device, but it’s temperamental and ignores half of my scrolls. Good news: there’s a touchscreen. But the touchscreen misses half my taps: There are three ways to “click” a UI element. You can click the trackball, push the “Enter” key, or tap the screen. Which one does a particular UI element expect? You just have to try all three to find out! I had a hard time even finding instructions for how to reflash the T-Deck+. I found this long Jeff Geerling video where he expresses frustration with how long it took him to find reflashing instructions… and then he never explains how he did it! This is what worked for me: Confusingly, there’s no indication that the device is in DFU mode. I guess the fact that the screen doesn’t load is sort of an indication. On my system, I also see logs indicating a connection. Once I figured out how to navigate the T-Deck, I tried messaging, and the experience remained baffling. For example, guess what screen I’m on here: What does this screen do? If you guessed “chat on Public channel,” you’re a better guesser than I am, because the screen looks like nothing to me. Even when it displays chat messages, it only vaguely looks like a chat interface: Oh, it’s a chat UI. I encountered lots of other instances of confusing UX, but it’s too tedious to recount them all here. The tragic upshot for me is that this is not a device I’d rely on in an emergency. There are so many gotchas and dead-ends in the UX that would trip people up and prevent them from communicating with me. Even though the T-Deck broke my heart, I still hoped to use MeshCore with a different device. I needed to see how these devices worked in the real world rather than a few inches away from each other on my desk. First, I took my T-1000e to a friend’s house about a mile away and tried messaging the Heltec back in my home office. The transmission failed, as it seemed the two devices couldn’t see each other at all from that distance. Okay, fair enough. I’m in a suburban neighborhood, and there are lots of houses, trees, and cars between my house and my friend’s place. The next time I was riding in a car away from my house, I took along my T-1000e and tried messaging the Heltec v3 in my office. One block away: messages succeeded. Three blocks away: still working. Five blocks away: failure. And then I was never able to reach my home device until returning home later that day. Maybe the issue is the Heltec? I keep trying to leave the Heltec at home, but I read that the Heltec v3 has a particularly weak antenna. I tried again by leaving my T-1000e at home and taking the T-Deck out with me. I could successfully message my T-1000e from about five blocks away, but everything beyond that failed. The other part of the MeshCore ecosystem I haven’t mentioned yet is repeaters. The SenseCAP Solar P1-Pro , a solar-powered MeshCore repeater MeshCore repeaters are like WiFi extenders. They receive MeshCore messages and re-broadcast them to extend their reach. Repeaters are what create the “mesh” in MeshCore. The repeaters send messages to other repeaters and carry your MeshCore messages over longer distances. There are some technologically cool repeaters available. They’re solar powered with an internal battery, so they run independently and can survive a few days without sun. The problem was that I didn’t know how much difference a repeater makes. A repeater with a strong antenna would broadcast messages well, but does that solve my problem? If my T-Deck can’t send messages to my T-1000e from six blocks away, how is it going to reach the repeater? By this point, my enthusiasm for MeshCore had waned, and I didn’t want to spend another $100 and mount a broadcasting device to my house when I didn’t know how much it would improve my experience. MeshCore’s firmware is open-source , so I took a look to see if there was anything I could do to improve the user experience on the T-Deck. The first surprise with the source code was that there were no automated tests. I wrote simple unit tests , but nobody from the MeshCore team has responded to my proposal, and it’s been about two months. From casually browsing, the codebase feels messy but not outrageously so. It’s written in C++, and most of the classes have a large surface area with 20+ non-private functions and fields, but that’s what I see in a lot of embedded software projects. Another code smell was that my unit test calls the function, which encodes raw bytes to a hex string . MeshCore’s implementation depends on headers for two crypto libraries , even though the function has nothing to do with cryptography. It’s the kind of needless coupling MeshCore would avoid if they wrote unit tests for each component. My other petty gripe was that the code doesn’t have consistent style conventions. Someone proposed using the file that’s already in the repo , but a maintainer closed the issue with the guidance, “Just make sure your own IDE isn’t making unnecessary changes when you do a commit.” Why? Why in 2025 do I have to think about where to place my curly braces to match the local style? Just set up a formatter so I don’t have to think about mundane style issues anymore. I originally started digging into the MeshCore source to understand the T-Deck UI, but I couldn’t find any code for it. I couldn’t find the source to the MeshCore Android or web apps either. And then I realized: it’s all closed-source. All of the official MeshCore client implementations are closed-source and proprietary. Reading the MeshCore FAQ , I realized critical components are closed-source. What!?! They’d advertised this as open-source! How could they trick me? And then I went back to the MeshCore website and realized they never say “open-source” anywhere. I must have dreamed the part where they advertised MeshCore as open-source. It just seems like such an open-source thing that I assumed it was. But I was severely disappointed to discover that critical parts of MeshCore are proprietary. Without open-source clients, MeshCore doesn’t work for me. I’m not an open-source zealot, and I think it’s fine for software to be proprietary, but the whole point of off-grid communication is decentralization and technology freedom, so I can’t get on board with a closed-source solution. Some parts of the MeshCore ecosystem are indeed open-source and liberally licensed, but critically the T-Deck firmware, the web app, and the mobile apps are all closed-source and proprietary. The firmware I flashed to my Heltec v3 and T-1000e is open-source, but the mobile and Android apps (clients) I used to use the radios were closed-source and proprietary. As far as I see, there are no open-source MeshCore clients aside from the development CLI . I still love the idea of MeshCore, but it doesn’t yet feel practical for communicating in an emergency. The software is too difficult to use, and I’ve been unable to send messages farther than five blocks (about 0.3 miles). I’m open to revisiting MeshCore, but I’m waiting on open-source clients and improvements in usability. Disconnect the T-Deck from USB-C. Power off the T-Deck. Connect the T-Deck to your computer via the USB-C port. Hold down the thumbwheel in the center. Power on the device. It is incredibly cool to send text messages without relying on a big company’s infrastructure. The concept delights the part of my brain that enjoys disaster prep. MeshCore runs on a wide variety of low-cost devices, many of which also work for Meshtastic. There’s an active, enthusiastic community around it. All of the official MeshCore clients are closed-source and proprietary. The user experience is too brittle for me to rely on in an emergency, especially if I’m trying to communicate with MeshCore beginners. Most of the hardware assumes you’ll pair it with your mobile phone over Bluetooth, which introduces many more points of failure and complexity. The only official standalone device is the T-Deck+, but I found it confusing and frustrating to use. There’s no written getting started guide. There’s a FAQ , but it’s a hodgepodge of details without much organization. There’s a good unofficial intro video , but I prefer text documentation.

0 views

Notes on the WASM Basic C ABI

The WebAssembly/tool-conventions repository contains "Conventions supporting interoperability between tools working with WebAssembly". Of special interest, in contains the Basic C ABI - an ABI for representing C programs in WASM. This ABI is followed by compilers like Clang with the wasm32 target. Rust is also switching to this ABI for extern "C" code. This post contains some notes on this ABI, with annotated code samples and diagrams to help visualize what the emitted WASM code is doing. Hereafter, "the ABI" refers to this Basic C ABI. In these notes, annotated WASM snippets often contain descriptions of the state of the WASM value stack at a given point in time. Unless otherwise specified, "TOS" refers to "Top Of value Stack", and the notation [ x  y ] means the stack has y on top, with x right under it (and possibly some other stuff that's not relevant to the discussion under x ); in this notation, the stack grows "to the right". The WASM value stack has no linear memory representation and cannot be addressed, so it's meaningless to discuss whether the stack grows towards lower or higher addresses. The value stack is simply an abstract stack, where values can be pushed onto or popped off its "top". Whenever addressing is required, the ABI specifies explicitly managing a separate stack in linear memory. This stack is very similar to how stacks are managed in hardware assembly languages (except that in the ABI this stack pointer is held in a global variable, and is not a special register), and it's called the "linear stack". By "scalar" I mean basic C types like int , double or char . For these, using the WASM value stack is sufficient, since WASM functions can accept an arbitrary number of scalar parameters. This C function: Will be compiled into something like: And can be called by pushing three values onto the stack and invoking call $add_three . The ABI specifies that all integral types 32-bit and smaller will be passed as i32 , with the smaller types appropriately sign or zero extended. For example, consider this C function: It's compiled to the almost same code as add_three : Except the last i32.extend8_s , which takes the lowest 8 bits of the value on TOS and sign-extends them to the full i32 (effectively ignoring all the higher bits). Similarly, when $add_three_chars is called, each of its parameters goes through i32.extend8_s . There are additional oddities that we won't get deep into, like passing __int128 values via two i64 parameters. C pointers are just scalars, but it's still educational to review how they are handled in the ABI. Pointers to any type are passed in i32 values; the compiler knows they are pointers, though, and emits the appropriate instructions. For example: Is compiled to: Recall that in WASM, there's no difference between an i32 representing an address in linear memory and an i32 representing just a number. i32.store expects [ addr  value ] on TOS, and does *addr = value . Note that the x parameter isn't needed any longer after the sum is computed, so it's reused later on to hold the return value. WASM parameters are treated just like other locals (as in C). According to the ABI, while scalars and single-element structs or unions are passed to a callee via WASM function parameters (as shown above), for larger aggregates the compiler utilizes linear memory. Specifically, each function gets a "frame" in a region of linear memory allocated for the linear stack. This region grows downwards from high to low addresses [1] , and the global $__stack_pointer points at the bottom of the frame: Consider this code: When do_work is compiled to WASM, prior to calling pair_calculate it copies pp into a location in linear memory, and passes the address of this location to pair_calculate . This location is on the linear stack, which is maintained using the $__stack_pointer global. Here's the compiled WASM for do_work (I also gave its local variable a meaningful name, for readability): Some notes about this code: Before pair_calculate is called, the linear stack looks like this: Following the ABI, the code emitted for pair_calculate takes Pair* (by reference, instead of by value as the original C code): Each function that needs linear stack space is responsible for adjusting the stack pointer and restoring it to its original place at the end. This naturally enables nested function calls; suppose we have some function a calling function b which, in turn, calls function c , and let's assume all of these need to allocate space on the linear stack. This is how the linear stack looks after c 's prologue: Since each function knows how much stack space it has allocated, it's able to properly restore $__stack_pointer to the bottom of its caller's frame before returning. What about returning values of aggregate types? According to the ABI, these are also handled indirectly; a pointer parameter is prepended to the parameter list of the function. The function writes its return value into this address. The following function: Is compiled to: Here's a function that calls it: And the corresponding WASM: Note that this function only uses 8 bytes of its stack frame, but allocates 16; this is because the ABI dictates 16-byte alignment for the stack pointer. There are some advanced topics mentioned in the ABI that these notes don't cover (at least for now), but I'll mention them here for completeness: This is similar to x86 . For the WASM C ABI, a good reason is provided for the direction: WASM load and store instructions have an unsigned constant called offset that can be used to add a positive offset to the address parameter without extra instructions. Since $__stack_pointer points to the lowest address in the frame, these offsets can be used to efficiently access any value on the stack. There are two instance of the pair pp in linear memory prior to the call to pair_calculate : the original one from the initialization statement (at offset 8), and a copy created for passing into pair_calculate (at offset 0). Theoretically, as pp is unused used after the call, the compiler could do better here and keep only a single copy. The stack pointer is decremented by 16, and restored at the end of the function. The first few instructions - where the stack pointer is adjusted - are usually called the prologue of the function. In the same vein, the last few instructions where the stack pointer is reset back to where it was at the entry are called the epilogue . "Red zone" - leaf functions have access to 128 bytes of red zone below the stack pointer. I found this difficult to observe in practice [2] . Since we don't issue system calls directly in WASM, it's tricky to conjure a realistic leaf function that requires the linear stack (instead of just using WASM locals). A separate frame pointer (global value) to be used for functions that require dynamic stack allocation (such as using C's VLAs ). A separate base pointer to be used for functions that require alignment > 16 bytes on the stack.

0 views
Max Bernstein 3 months ago

A catalog of side effects

Optimizing compilers like to keep track of each IR instruction’s effects . An instruction’s effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state). This post can be thought of as a continuation of What I talk about when I talk about IRs , specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have? Different compilers represent and track these effects differently. I’ve been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more. Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as “writes to a local variable”, “writes to a list”, or “reads from the stack” indicate what instructions can be re-ordered, duplicated, or removed entirely. For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR: The goal of effects is to communicate to the compiler if, for example, these two IR instructions can be re-ordered. The second instruction might write to a location that the first one reads. But it also might not! This is about knowing if and alias —if they are different names that refer to the same object. We can sometimes answer that question directly, but often it’s cheaper to compute an approximate answer: could they even alias? It’s possible that and have different types, meaning that (as long as you have strict aliasing) the and operations that implement these reads and writes by definition touch different locations. And if they look at disjoint locations, there need not be any explicit order enforced. Different compilers keep track of this information differently. The null effect analysis gives up and says “every instruction is maximally effectful” and therefore “we can’t re-order or delete any instructions”. That’s probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid. But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That’s where I am right now. So here’s a catalog of different compilers I have looked at recently. There are two main ways I have seen to represent effects: bitsets and heap range lists. We’ll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers. We’ll start with Cinder , a Python JIT, because that’s what I used to work on. Cinder tracks heap effects for its high-level IR (HIR) in instr_effects.h . Pretty much everything happens in the function, which is expected to know everything about what effects the given instruction might have. The data representation is a bitset representation of a lattice called an and that is defined in alias_class.h . Each bit in the bitset represents a distinct location in the heap: reads from and writes to each of these locations are guaranteed not to affect any of the other locations. Here is the X-macro that defines it: Note that each bit implicitly represents a set: does not refer to a specific list index, but the infinite set of all possible list indices. It’s any list index. Still, every list index is completely disjoint from, say, every entry in a global variable table. (And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected , is the container.) Like other bitset lattices, it’s possible to union the sets by or-ing the bits. It’s possible to query for overlap by and-ing the bits. If this sounds familiar, it’s because (as the repo notes) it’s a similar idea to Cinder’s type lattice representation . Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects): Union operations naturally hit a fixpoint at and intersection operations naturally hit a fixpoint at . All of this together lets the optimizer ask and answer questions such as: Let’s take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR: You can imagine that declares that it reads from the heap and declares that it writes to the heap. Because tuple and list pointers cannot be casted into one another and therefore cannot alias, these are disjoint heaps in our bitset. Therefore , therefore these memory operations can never interfere! They can (for example) be re-ordered arbitrarily. In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE. DCE involves first finding the set of instructions that need to be kept around because they are useful/important/have effects. So here is what the Cinder DCE looks like: There are some other checks in there but is right there at the core of it! Now that we have seen the bitset representation of effects and an implementation in Cinder, let’s take a look at a different representation and and an implementation in JavaScriptCore. I keep coming back to How I implement SSA form by Fil Pizlo , one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of “abstract heaps” felt very… well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click. It didn’t make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations—because, you know, JSC is six compilers in a trenchcoat [ citation needed ] . DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We’ll look at DOMJIT mostly because it’s a smaller example and also illustrates something else that’s interesting: builtins. We’ll come back to builtins in a minute. Let’s take a lookat how DOMJIT structures its abstract heaps : a YAML file. It’s a hierarchy. is a subheap of is a subheap of… and so on. A write to any is a write to is a write to … Sibling heaps are unrelated: and , for example, are disjoint. To get a feel for this, I wired up a simplified version of ZJIT’s bitset generator (for types! ) to read a YAML document and generate a bitset. It generated the following Rust code: It’s not a fancy X-macro, but it’s a short and flexible Ruby script. Then I took the DOMJIT abstract heap generator —also funnily enough a short Ruby script—modified the output format slightly, and had it generate its int pairs: It already comes with a little diagram, which is super helpful for readability. Any empty range(s) represent empty heap effects: if the start and end are the same number, there are no effects. There is no one value, but any empty range could be normalized to . Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me. What about checking overlap? Here is the implementation in JSC : (See also How to check for overlapping intervals and Range overlap in two compares for more fun.) While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap. With int ranges, it’s a little more complicated. An imprecise union of and can take the maximal range that covers both and . To get a more precise union, you have to keep track of both. In the worst case, if you want efficient arbitrary queries, you need to store your int ranges in an interval tree. So what gives? I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it’s more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn’t holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post: The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. […] you can have instructions report that they read/write multiple heaps […] you can have a utility function that produces such lists on demand. It’s important to note that this doesn’t actually involve any allocation of lists. JSC does this very clever thing where they have “functors” that they pass in as arguments that compress/summarize what they want to out of an instruction’s effects. Let’s take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat. Note that in the example below is a thin wrapper over the DFG compiler’s own equivalent: is the function that calls these functors ( or in this case) for each effect that the given IR instruction declares. I’ve pulled some relevant snippets of , which is quite long, that I think are interesting. First, some instructions (constants, here) have no effects. There’s some utility in the call but I didn’t understand fully. Then there are some instructions that conditionally have effects depending on the use types of their operands. 1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code. Some run-time IR guards that might cause side exits are annotated as such—they write to the heap. Local variable instructions read specific heaps indexed by what looks like the local index but I’m not sure. This means accessing two different locals won’t alias! Instructions that allocate can’t be re-ordered, it looks like; they both read and write the . This probably limits the amount of allocation sinking that can be done. Then there’s , which is the builtins stuff I was talking about. We’ll come back to that after the code block. (Remember that these operations are very similar to DOMJIT’s with a couple more details—and in some cases even contain DOMJIT s!) This node is the way for the DOM APIs in the browser—a significant chunk of the builtins, which are written in C++—to communicate what they do to the optimizing compiler. Without any annotations, the JIT has to assume that a call into C++ could do anything to the JIT state. Bummer! But because, for example, annotates what memory it reads from and what it doesn’t write to, the JIT can optimize around it better—or even remove the access completely. It means the JIT can reason about calls to known builtins the same way that it reasons about normal JIT opcodes. (Incidentally it looks like it doesn’t even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.) Last, we’ll look at Simple, which has a slightly different take on all of this. Simple is Cliff Click’s pet Sea of Nodes (SoN) project to try and showcase the idea to the world—outside of a HotSpot C2 context. This one is a little harder for me to understand but it looks like each translation unit has a that doles out different classes of memory nodes for each alias class. Each IR node then takes data dependencies on whatever effect nodes it might uses. Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): “Our approach is a form of TBAA similar to the ‘FieldTypeDecl’ algorithm described in the paper.” The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10 . Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets. HHVM , a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h . HHVM has a couple places that use this information, such as a definition-sinking pass , alias analysis , DCE , store elimination , refcount opts , and more. If you are wondering why the HHVM representation looks similar to the Cinder representation, it’s because some former HHVM engineers such as Brett Simmers also worked on Cinder! (note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource ) Android’s ART Java runtime also uses a bitset for its effect representation. It’s a very compact class called in nodes.h . The side effects are used in loop-invariant code motion , global value numbering , write barrier elimination , scheduling , and more. CoreCLR mostly uses a bitset for its class. This one is interesting though because it also splits out effects specifically to include sets of local variables ( ). V8 is also about six completely different compilers in a trenchcoat. Turboshaft uses a struct in operations.h called which is two bitsets for reads/writes of effects. This is used in value numbering as well a bunch of other small optimization passes they call “reducers”. Maglev also has this thing called in their IR nodes that also looks like a bitset and is used in their various reducers. It has effect query methods on it such as and . Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself. Guile Scheme looks like it has a custom tagging scheme type thing. Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8. Remember to ask the right questions of your IR when doing analysis. Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory , Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful. This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes).  ↩ where might this instruction write? (because CPython is reference counted and incref implies ownership) where does this instruction borrow its input from? do these two instructions’ write destinations overlap? This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input’s own def . It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes).  ↩

0 views
Pat Shaughnessy 3 months ago

YARV’s Internal Stack and Your Ruby Stack

I've started working on a new edition of Ruby Under a Microscope that covers Ruby 3.x. I'm working on this in my spare time, so it will take a while. Leave a comment or drop me a line and I'll email you when it's finished. The content of Chapter 3, about the YARV virtual machine, hasn't changed much since 2014. However, I did update all of the diagrams to account for some new values YARV now saves inside of each stack frame. And some of the common YARV instructions were renamed as well. I also moved some content that was previously part of Chapter 4 here into Chapter 3. Right now I'm rewriting Chapter 4 from scratch, describing Ruby's new JIT compilers. As we’ll see in a moment, YARV uses a stack internally to track intermediate values, arguments, and return values. YARV is a stack-oriented virtual machine. In addition to its own internal stack, YARV keeps track of your Ruby program’s call stack , recording which methods call which other methods, functions, blocks, lambdas, and so on. In fact, YARV is not just a stack machine—it’s a double-stack machine! It has to track the arguments and return values not only for its own internal instructions but also for your Ruby program. Figure 3-1 shows YARV’s basic registers and internal stack. YARV’s internal stack is on the left. The SP label is the stack pointer, or the location of the top of the stack. On the right are the instructions that YARV is executing. PC is the program counter, or the location of the current instruction. You can see the YARV instructions that Ruby compiled from the puts 2+2 example on the right side of Figure 3-1. YARV stores both the SP and PC registers in a C structure called rb_control_frame_t , along with the current value of Ruby’s self variable and some other values not shown here. At the same time, YARV maintains another stack of these rb_control_frame_t structures, as shown in Figure 3-2. This second stack of rb_control_frame_t structures represents the path that YARV has taken through your Ruby program, and YARV’s current location. In other words, this is your Ruby call stack—what you would see if you ran puts caller . The CFP pointer indicates the current frame pointer. Each stack frame in your Ruby program stack contains, in turn, a different value for the self, PC, and SP registers, as shown in Figure 3-1. Ruby also keeps track of type of code running at each level in your Ruby call stack, indicated by the “[BLOCK]”, “[METHOD]” notation in Figure 3-2. In order to help you understand this a bit better, here are a couple of examples. I’ll begin with the simple 2+2 example from Chapters 1 and 2, shown again in Listing 3-1. This one-line Ruby script doesn’t have a Ruby call stack, so I’ll focus on the internal YARV stack for now. Figure 3-3 shows how YARV will execute this script, beginning with the first instruction, putself . As you can see in Figure 3-3, YARV starts the program counter (PC) at the first instruction, and initially the stack is empty. Now YARV executes the putself instruction, and pushes the current value of self onto the stack, as shown in Figure 3-4. Because this simple script contains no Ruby objects or classes, the self pointer is set to the default top self object. This is an instance of the Object class that Ruby automatically creates when YARV starts. It serves as the receiver for method calls and the container for instance variables in the top-level scope. The top self object contains a single, predefined to_s method, which returns the string “main.” You can call this method by running the following command in the console: YARV will use this self value on the stack when it executes the opt_send_without_block instruction: self is the receiver of the puts method because I didn’t specify a receiver for this method call. Next, YARV executes putobject 2 . It pushes the numeric value 2 onto the stack and increments the PC again, as shown in Figure 3-5. This is the first step of the receiver (arguments) operation pattern described in “How Ruby Compiles a Simple Script” on page 34. First, Ruby pushes the receiver onto the internal YARV stack. In this example, the Fixnum object 2 is the receiver of the message/method + , which takes a single argument, also a 2. Next, Ruby pushes the argument 2, as shown in Figure 3-6. Finally, Ruby executes the + operation. In this case, opt_plus is an optimized instruction that will add two values: the receiver and the argument, as shown in Figure 3-7. As you can see in Figure 3-7, the opt_plus instruction leaves the result, 4, at the top of the stack. Now Ruby is perfectly positioned to execute the puts function call: The receiver self is first on the stack, and the single argument, 4, is at the top of the stack. (I’ll describe how method lookup works in Chapter 6.) Next, Figure 3-8 shows what happens when Ruby executes the puts method call. As you can see, the opt_send_without_block instruction leaves the return value, nil , at the top of the stack. Finally, Ruby executes the last instruction, leave , which finishes the execution of our simple, one-line Ruby program. Of course, when Ruby executes the puts call, the C code implementing the puts function will actually display the value 4 in the console output.

0 views
Simon Willison 3 months ago

Code research projects with async coding agents like Claude Code and Codex

I've been experimenting with a pattern for LLM usage recently that's working out really well: asynchronous code research tasks . Pick a research question, spin up an asynchronous coding agent and let it go and run some experiments and report back when it's done. Software development benefits enormously from something I call code research . The great thing about questions about code is that they can often be definitively answered by writing and executing code. I often see questions on forums which hint at a lack of understanding of this skill. "Could Redis work for powering the notifications feed for my app?" is a great example. The answer is always "it depends", but a better answer is that a good programmer already has everything they need to answer that question for themselves. Build a proof-of-concept, simulate the patterns you expect to see in production, then run experiments to see if it's going to work. I've been a keen practitioner of code research for a long time. Many of my most interesting projects started out as a few dozen lines of experimental code to prove to myself that something was possible. It turns out coding agents like Claude Code and Codex are a fantastic fit for this kind of work as well. Give them the right goal and a useful environment and they'll churn through a basic research project without any further supervision. LLMs hallucinate and make mistakes. This is far less important for code research tasks because the code itself doesn't lie: if they write code and execute it and it does the right things then they've demonstrated to both themselves and to you that something really does work. They can't prove something is impossible - just because the coding agent couldn't find a way to do something doesn't mean it can't be done - but they can often demonstrate that something is possible in just a few minutes of crunching. I've used interactive coding agents like Claude Code and Codex CLI for a bunch of these, but today I'm increasingly turning to their asynchronous coding agent family members instead. An asynchronous coding agent is a coding agent that operates on a fire-and-forget basis. You pose it a task, it churns away on a server somewhere and when it's done it files a pull request against your chosen GitHub repository. OpenAI's Codex Cloud , Anthropic's Claude Code for web , Google Gemini's Jules , and GitHub's Copilot coding agent are four prominent examples of this pattern. These are fantastic tools for code research projects. Come up with a clear goal, turn it into a few paragraphs of prompt, set them loose and check back ten minutes later to see what they've come up with. I'm firing off 2-3 code research projects a day right now. My own time commitment is minimal and they frequently come back with useful or interesting results. You can run a code research task against an existing GitHub repository, but I find it's much more liberating to have a separate, dedicated repository for your coding agents to run their projects in. This frees you from being limited to research against just code you've already written, and also means you can be much less cautious about what you let the agents do. I have two repositories that I use for this - one public, one private. I use the public one for research tasks that have no need to be private, and the private one for anything that I'm not yet ready to share with the world. The biggest benefit of a dedicated repository is that you don't need to be cautious about what the agents operating in that repository can do. Both Codex Cloud and Claude Code for web default to running agents in a locked-down environment, with strict restrictions on how they can access the network. This makes total sense if they are running against sensitive repositories - a prompt injection attack of the lethal trifecta variety could easily be used to steal sensitive code or environment variables. If you're running in a fresh, non-sensitive repository you don't need to worry about this at all! I've configured my research repositories for full network access, which means my coding agents can install any dependencies they need, fetch data from the web and generally do anything I'd be able to do on my own computer. Let's dive into some examples. My public research repository is at simonw/research on GitHub. It currently contains 13 folders, each of which is a separate research project. I only created it two weeks ago so I'm already averaging nearly one a day! It also includes a GitHub Workflow which uses GitHub Models to automatically update the README file with a summary of every new project, using Cog , LLM , llm-github-models and this snippet of Python . Here are a some example research projects from the repo. node-pyodide shows an example of a Node.js script that runs the Pyodide WebAssembly distribution of Python inside it - yet another of my ongoing attempts to find a great way of running Python in a WebAssembly sandbox on a server. python-markdown-comparison ( transcript ) provides a detailed performance benchmark of seven different Python Markdown libraries. I fired this one off because I stumbled across cmarkgfm , a Python binding around GitHub's Markdown implementation in C, and wanted to see how it compared to the other options. This one produced some charts! came out on top by a significant margin: Here's the entire prompt I used for that project: Create a performance benchmark and feature comparison report on PyPI cmarkgfm compared to other popular Python markdown libraries - check all of them out from github and read the source to get an idea for features, then design and run a benchmark including generating some charts, then create a report in a new python-markdown-comparison folder (do not create a _summary.md file or edit anywhere outside of that folder). Make sure the performance chart images are directly displayed in the README.md in the folder. Note that I didn't specify any Markdown libraries other than - Claude Code ran a search and found the other six by itself. cmarkgfm-in-pyodide is a lot more fun. A neat thing about having all of my research projects in the same repository is that new projects can build on previous ones. Here I decided to see how hard it would be to get - which has a C extension - working inside Pyodide inside Node.js. Claude successfully compiled a 88.4KB file with the necessary C extension and proved it could be loaded into Pyodide in WebAssembly inside of Node.js. I ran this one using Claude Code on my laptop after an initial attempt failed. The starting prompt was: Figure out how to get the cmarkgfm markdown lover [typo in prompt, this should have been "library" but it figured it out anyway] for Python working in pyodide. This will be hard because it uses C so you will need to compile it to pyodide compatible webassembly somehow. Write a report on your results plus code to a new cmarkgfm-in-pyodide directory. Test it using pytest to exercise a node.js test script that calls pyodide as seen in the existing node.js and pyodide directory There is an existing branch that was an initial attempt at this research, but which failed because it did not have Internet access. You do have Internet access. Use that existing branch to accelerate your work, but do not commit any code unless you are certain that you have successfully executed tests that prove that the pyodide module you created works correctly. This one gave up half way through, complaining that emscripten would take too long. I told it: Complete this project, actually run emscripten, I do not care how long it takes, update the report if it works It churned away for a bit longer and complained that the existing Python library used CFFI which isn't available in Pyodide. I asked it: Can you figure out how to rewrite cmarkgfm to not use FFI and to use a pyodide-friendly way of integrating that C code instead? ... and it did. You can see the full transcript here . blog-tags-scikit-learn . Taking a short break from WebAssembly, I thought it would be fun to put scikit-learn through its paces on a text classification task against my blog: Work in a new folder called blog-tags-scikit-learn Download - a SQLite database. Take a look at the blog_entry table and the associated tags - a lot of the earlier entries do not have tags associated with them, where the later entries do. Design, implement and execute models to suggests tags for those earlier entries based on textual analysis against later ones Use Python scikit learn and try several different strategies Produce JSON of the results for each one, plus scripts for running them and a detailed markdown description Also include an HTML page with a nice visualization of the results that works by loading those JSON files. This resulted in seven files, four results files and a detailed report . (It ignored the bit about an HTML page with a nice visualization for some reason.) Not bad for a few moments of idle curiosity typed into my phone! That's just three of the thirteen projects in the repository so far. The commit history for each one usually links to the prompt and sometimes the transcript if you want to see how they unfolded. More recently I added a short file to the repo with a few extra tips for my research agents. You can read that here . My preferred definition of AI slop is AI-generated content that is published without human review. I've not been reviewing these reports in great detail myself, and I wouldn't usually publish them online without some serious editing and verification. I want to share the pattern I'm using though, so I decided to keep them quarantined in this one public repository. A tiny feature request for GitHub: I'd love to be able to mark a repository as "exclude from search indexes" such that it gets labelled with tags. I still like to keep AI-generated content out of search, to avoid contributing more to the dead internet . It's pretty easy to get started trying out this coding agent research pattern. Create a free GitHub repository (public or private) and let some agents loose on it and see what happens. You can run agents locally but I find the asynchronous agents to be more convenient - especially as I can run them (or trigger them from my phone) without any fear of them damaging my own machine or leaking any of my private data. Claude Code for web offers a free $250 of credits for their $20/month users for a limited time (until November 18, 2025). Gemini Jules has a free tier . There are plenty of other coding agents you can try out as well. Let me know if your research agents come back with anything interesting! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Code research Coding agents Asynchronous coding agents Give them a dedicated GitHub repository Let them rip with unlimited network access My simonw/research collection This is total slop, of course Try it yourself

0 views