Posts in C (20 found)
Simon Willison 1 weeks ago

Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me

Anthropic didn't release their latest model, Claude Mythos ( system card PDF ), today. They have instead made it available to a very restricted set of preview partners under their newly announced Project Glasswing . The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare. Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser . Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems. There's a great deal more technical detail in Assessing Claude Mythos Preview’s cybersecurity capabilities on the Anthropic Red Team blog: In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex  JIT heap spray  that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets. Plus this comparison with Claude 4.6 Opus: Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more. Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted. Just a few days ( last Friday ) ago I started a new ai-security-research tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research. Greg Kroah-Hartman of the Linux kernel: Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us. Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real. Daniel Stenberg of : The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good. I'm spending hours per day on this now. It's intense. And Thomas Ptacek published Vulnerability Research Is Cooked , a post inspired by his podcast conversation with Anthropic's Nicholas Carlini. Anthropic have a 5 minute talking heads video describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine): It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...] I've found more bugs in the last couple of weeks than I found in the rest of my life combined . We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it . On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches patches so that anyone who runs the software is no longer vulnerable to these attacks. I found this on the OpenBSD 7.8 errata page : 025: RELIABILITY FIX: March 25, 2026 All architectures TCP packets with invalid SACK options could crash the kernel. A source code patch exists which remedies this problem. I tracked that change down in the GitHub mirror of the OpenBSD CVS repo (apparently they still use CVS!) and found it using git blame : Sure enough, the surrounding code is from 27 years ago. I'm not sure which Linux vulnerability Nicholas was describing, but it may have been this NFS one recently covered by Michael Lynch . There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues. I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon. The bad news for those of us who are not trusted partners is this: We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview. I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
devansh 1 weeks ago

On LLMs and Vulnerability Research

I have been meaning to write this for six months. The landscape kept shifting. It has now shifted enough to say something definitive. I work at the intersection of vulnerability triage. I see, every day, how this landscape is changing. These views are personal and do not represent my employer. Take them with appropriate salt. Two things happened in quick succession. Frontier models got dramatically better (Opus 4.6, GPT 5.4). Agentic toolkits (Claude Code, Codex, OpenCode) gave those models hands. The combination produces solid vulnerability research. "LLMs are next-token predictors." This framing was always reductive. It is now actively misleading. The gap between what these models theoretically do (predict the next word) and what they actually do (reason about concurrent thread execution in kernel code to identify use-after-free conditions) has grown too wide for the old frame to hold. Three mechanisms explain why. Implicit structural understanding. Tokenizers know nothing about code. Byte Pair Encoding treats , , and as frequent byte sequences, not syntactic constructs. But the transformer layers above tell a different story. Through training on massive code corpora, attention heads specialise: some track variable identity and provenance, others develop bias toward control flow tokens. The model converges on internal representations that capture semantic properties of code, something functionally equivalent to an abstract syntax tree, built implicitly, never formally. Neural taint analysis. The most security-relevant emergent capability. The model learns associations between sources of untrusted input (user-controlled data, network input, file reads) and dangerous sinks (system calls, SQL queries, memory operations). When it identifies a path from source to sink without adequate sanitisation, it flags a vulnerability. This is not formal taint analysis. No dataflow graph as well. It is a statistical approximation. But it works well for intra-procedural bugs where the source-to-sink path is short, and degrades as distance increases across functions, files, and abstraction layers. Test-time reasoning. The most consequential advance. Standard inference is a single forward pass: reactive, fast, fundamentally limited. Reasoning models (o-series, extended thinking, DeepSeek R1) break this constraint by generating internal reasoning tokens, a scratchpad where the model works through a problem step by step before answering. The model traces execution paths, tracks variable values, evaluates branch conditions. Symbolic execution in natural language. Less precise than formal tools but capable of handling what they choke on: complex pointer arithmetic, dynamic dispatch, deeply nested callbacks. It self-verifies, generating a hypothesis ("the lock isn't held across this path"), then testing it ("wait, is there a lock acquisition I missed?"). It backtracks when reasoning hits dead ends. DeepSeek R1 showed these behaviours emerge from pure reinforcement learning with correctness-based rewards. Nobody taught the model to check its own work. It discovered that verification produces better answers. The model is not generating the most probable next token. It is spending variable compute to solve a specific problem. Three advances compound on each other. Mixture of Experts. Every frontier model now uses MoE. A model might contain 400 billion parameters but activate only 17 billion per token. Vastly more encoded knowledge about code patterns, API behaviours, and vulnerability classes without proportional inference cost. Million-token context. In 2023, analysing a codebase required chunking code into a vector database, retrieving fragments via similarity search, and feeding them to the model. RAG is inherently lossy: code split at arbitrary boundaries, cross-file relationships destroyed, critical context discarded. For vulnerability analysis, where understanding cross-module data flow is the entire point, this information loss is devastating. At one million tokens, you fit an entire mid-size codebase in a single prompt. The model traces user input from an HTTP handler through three middleware layers into a database query builder and spots a sanitisation gap on line 4,200 exploitable via the endpoint on line 890. No chunking. No retrieval. No information loss. Reinforcement-learned reasoning. Earlier models trained purely on next-token prediction. Modern frontier models add an RL phase: generate reasoning chains, reward correctness of the final answer rather than plausibility of text. Over millions of iterations, this shapes reasoning to produce correct analyses rather than plausible-sounding ones. The strategies transfer across domains. A model that learned to verify mathematical reasoning applies the same verification to code. A persistent belief: truly "novel" vulnerability classes exist, bugs so unprecedented that only human genius could discover them. Comforting. Also wrong. Decompose the bugs held up as examples. HTTP request smuggling: the insight that a proxy and backend might disagree about where one request ends and another begins feels like a creative leap. But the actual bug is the intersection of known primitives: ambiguous protocol specification, inconsistent parsing between components, a security-critical assumption about message boundaries. None novel individually. The "novelty" was in combining them. Prototype pollution RCEs in JavaScript frameworks. Exotic until you realise it is dynamic property assignment in a prototype-based language, unsanitised input reaching object modification, and a rendering pipeline evaluating modified objects in a privileged context. Injection, type confusion, privilege boundary crossing. Taxonomy staples for decades. The pattern holds universally. "Novel" vulnerabilities decompose into compositions of known primitives: spec ambiguities, type confusions, missing boundary checks, TOCTOU gaps, trust boundary violations. The novelty is in the composition, not the components. This is precisely what frontier LLMs are increasingly good at. A model that understands protocol ambiguity, inconsistent component behaviour, and security boundary assumptions has all the ingredients to hypothesise a request-smuggling-class vulnerability when pointed at a reverse proxy codebase. It does not need to have seen that exact bug class. It needs to recognise that the conditions for parser disagreement exist and that parser disagreement at a trust boundary has security implications. Compositional reasoning over known primitives. Exactly what test-time reasoning enables. LLMs will not discover the next Spectre tomorrow. Microarchitectural side channels in CPU pipelines are largely absent from code-level training data. But the space of "LLM-inaccessible" vulnerabilities is smaller than the security community assumes, and it shrinks with every model generation. Most of what we call novel vulnerability research is creative recombination within a known search space. That is what these models do best. Effective AI vulnerability research = good scaffolding + adequate tokens. Scaffolding (harness design, prompt engineering, problem framing) is wildly underestimated. Claude Code and Codex are general-purpose coding environments, not optimised for vulnerability research. A purpose-built harness provides threat models, defines trust boundaries, highlights historical vulnerability patterns in the specific technology stack, and constrains search to security-relevant code paths. The operator designing that context determines whether the model spends its reasoning budget wisely or wastes it on dead ends. Two researchers, same model, same codebase, dramatically different results. Token quality beats token quantity. A thousand reasoning tokens on the right code path with the right threat model outperform a million tokens sprayed across a repo with "find vulnerabilities." The search space is effectively infinite. You cannot brute-force it. You narrow it with human intelligence encoded as context, directing machine intelligence toward where bugs actually live. "LLMs are non-deterministic, so you can't trust their findings." Sounds devastating. Almost entirely irrelevant. It confuses the properties of the tool with the properties of the target. The bugs are deterministic. They are in the code. A buffer overflow on line 847 is still there whether the model notices it on attempt one or attempt five. Non-determinism in the search process does not make the search less valid. It makes it more thorough under repetition. Each run samples a different trajectory through the hypothesis space. The union of multiple runs covers more search space than any single run. Conceptually identical to fuzzing. Nobody says "fuzzers are non-deterministic so we can't trust them." You run the fuzzer longer, cover more input space, find more bugs. Same principle. Non-determinism under repetition becomes coverage. In 2023 and 2024, the state of the art was architecture. Multi-agent systems, RAG pipelines, tool integration with SMT solvers and fuzzers and static analysis engines. The best orchestration won. That era is ending. A frontier model ingests a million tokens of code in a single prompt. Your RAG pipeline is not an advantage when the model without RAG sees the whole codebase while your pipeline shows fragments selected by retrieval that does not know what is security-relevant. A reasoning model spends thousands of tokens tracing execution paths and verifying hypotheses. Your external solver integration is not a differentiator when the model approximates what the solver does with contextual understanding the solver lacks. Agentic toolkits handle orchestration better than your custom tooling. The implication the security industry has not fully processed: vulnerability research is being democratised. When finding a memory safety bug in a C library required a Project Zero-calibre researcher with years of experience, the supply was measured in hundreds worldwide. When it requires a well-prompted API call, the supply is effectively unlimited. What replaces architecture as the competitive advantage? Two things. Domain expertise encoded as context. Not "find bugs in this code" but "this is a TLS implementation; here are three classes of timing side-channel that have affected similar implementations; analyse whether the constant-time guarantees hold across these specific code paths." The human provides the insight. The model does the grunt work. Access to compute. Test-time reasoning scales with inference compute. More tokens means deeper analysis, more self-verification, more backtracking. Teams that let a model spend ten minutes on a complex code path will find bugs that teams limited to five-second responses will miss. The end state: vulnerability discovery for known bug classes becomes a commodity, available to anyone with API access and a credit card. The researchers who thrive will focus where the model cannot: novel vulnerability classes, application-level logic flaws, architectural security review, adversarial creativity. This is not a prediction. It is already happening. The pace is set by model capability, which doubles on a timeline measured in months. Beyond next-token prediction Implicit structural understanding Neural taint analysis Test-time reasoning The architecture that enabled this Mixture of Experts Million-token context Reinforcement-learned reasoning The myth of novel vulnerabilities Scaffolding and tokens Non-determinism is a feature Orchestration is no longer your moat

0 views
Anton Zhiyanov 1 weeks ago

Porting Go's strings package to C

Creating a subset of Go that translates to C was never my end goal. I liked writing C code with Go, but without the standard library it felt pretty limited. So, the next logical step was to port Go's stdlib to C. Of course, this isn't something I could do all at once. I started with the io package , which provides core abstractions like and , as well as general-purpose functions like . But isn't very interesting on its own, since it doesn't include specific reader or writer implementations. So my next choices were naturally and — the workhorses of almost every Go program. This post is about how the porting process went. Bits and UTF-8 • Bytes • Allocators • Buffers and builders • Benchmarks • Optimizing search • Optimizing builder • Wrapping up Before I could start porting , I had to deal with its dependencies first: Both of these packages are made up of pure functions, so they were pretty easy to port. The only minor challenge was the difference in operator precedence between Go and C — specifically, bit shifts ( , ). In Go, bit shifts have higher precedence than addition and subtraction. In C, they have lower precedence: The simplest solution was to just use parentheses everywhere shifts are involved: With and done, I moved on to . The package provides functions for working with byte slices: Some of them were easy to port, like . Here's how it looks in Go: And here's the C version: Just like in Go, the ( → ) macro doesn't allocate memory; it just reinterprets the byte slice's underlying storage as a string. The function (which works like in Go) is easy to implement using from the libc API. Another example is the function, which looks for a specific byte in a slice. Here's the pure-Go implementation: And here's the C version: I used a regular C loop to mimic Go's : But and don't allocate memory. What should I do with , since it clearly does? I had a decision to make. The Go runtime handles memory allocation and deallocation automatically. In C, I had a few options: An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. See Allocators from C to Zig if you want to learn more about them. For me, the winner was clear. Modern systems programming languages like Zig and Odin clearly showed the value of allocators: An is an interface with three methods: , , and . In C, it translates to a struct with function pointers: As I mentioned in the post about porting the io package , this interface representation isn't as efficient as using a static method table, but it's simpler. If you're interested in other options, check out the post on interfaces . By convention, if a function allocates memory, it takes an allocator as its first parameter. So Go's : Translates to this C code: If the caller doesn't care about using a specific allocator, they can just pass an empty allocator, and the implementation will use the system allocator — , , and from libc. Here's a simplified version of the system allocator (I removed safety checks to make it easier to read): The system allocator is stateless, so it's safe to have a global instance: Here's an example of how to call with an allocator: Way better than hidden allocations! Besides pure functions, and also provide types like , , and . I ported them using the same approach as with functions. For types that allocate memory, like , the allocator becomes a struct field: The code is pretty wordy — most C developers would dislike using instead of something shorter like . My solution to this problem is to automatically translate Go code to C (which is actually what I do when porting Go's stdlib). If you're interested, check out the post about this approach — Solod: Go can be a better C . Types that don't allocate, like , need no special treatment — they translate directly to C structs without an allocator field. The package is the twin of , so porting it was uneventful. Here's usage example in Go and C side by side: Again, the C code is just a more verbose version of Go's implementation, plus explicit memory allocation. What's the point of writing C code if it's slow, right? I decided it was time to benchmark the ported C types and functions against their Go versions. To do that, I ported the benchmarking part of Go's package. Surprisingly, the simplified version was only 300 lines long and included everything I needed: Here's a sample benchmark for the type: Reads almost like Go's benchmarks. To monitor memory usage, I created — a memory allocator that wraps another allocator and keeps track of allocations: The benchmark gets an allocator through the function and wraps it in a to keep track of allocations: There's no auto-discovery, but the manual setup is quite straightforward. With the benchmarking setup ready, I ran benchmarks on the package. Some functions did well — about 1.5-2x faster than their Go equivalents: But (searching for a substring in a string) was a total disaster — it was nearly 20 times slower than in Go: The problem was caused by the function we looked at earlier: This "pure" Go implementation is just a fallback. On most platforms, Go uses a specialized version of written in assembly. For the C version, the easiest solution was to use , which is also optimized for most platforms: With this fix, the benchmark results changed drastically: Still not quite as fast as Go, but it's close. Honestly, I don't know why the -based implementation is still slower than Go's assembly here, but I decided not to pursue it any further. After running the rest of the function benchmarks, the ported versions won all of them except for two: Benchmarking details is a common way to compose strings from parts in Go, so I tested its performance too. The results were worse than I expected: Here, the C version performed about the same as Go, but I expected it to be faster. Unlike , is written entirely in Go, so there's no reason the ported version should lose in this benchmark. The method looked almost identical in Go and C: Go's automatically grows the backing slice, while does it manually ( , on the contrary, doesn't grow the slice — it's merely a wrapper). So, there shouldn't be any difference. I had to investigate. Looking at the compiled binary, I noticed a difference in how the functions returned results. Go returns multiple values in separate registers, so uses three registers: one for 8-byte , two for the interface (implemented as two 8-byte pointers). But in C, was a single struct made up of two unions and a pointer: Of course, this 56-byte monster can't be returned in registers — the C calling convention passes it through memory instead. Since is on the hot path in the benchmark, I figured this had to be the issue. So I switched from a single monolithic type to signature-specific types for multi-return pairs: Now, the implementation in C looked like this: is only 16 bytes — small enough to be returned in two registers. Problem solved! But it wasn't — the benchmark only showed a slight improvement. After looking into it more, I finally found the real issue: unlike Go, the C compiler wasn't inlining calls. Adding and moving to the header file made all the difference: 2-4x faster. That's what I was hoping for! Porting and was a mix of easy parts and interesting challenges. The pure functions were straightforward — just translate the syntax and pay attention to operator precedence. The real design challenge was memory management. Using allocators turned out to be a good solution, making memory allocation clear and explicit without being too difficult to use. The benchmarks showed that the C versions outperformed Go in most cases, sometimes by 2-4x. The only exceptions were and , where Go relies on hand-written assembly. The optimization was an interesting challenge: what seemed like a return-type issue was actually an inlining problem, and fixing it gave a nice speed boost. There's a lot more of Go's stdlib to port. In the next post, we'll cover — a very unique Go package. In the meantime, if you'd like to write Go that translates to C — with no runtime and manual memory management — I invite you to try Solod . The and packages are included, of course. implements bit counting and manipulation functions. implements functions for UTF-8 encoded text. Loop over the slice indexes with ( is a macro that returns , similar to Go's built-in). Access the i-th byte with (a bounds-checking macro that returns ). Use a reliable garbage collector like Boehm GC to closely match Go's behavior. Allocate memory with libc's and have the caller free it later with . Introduce allocators. It's obvious whether a function allocates memory or not: if it has an allocator as a parameter, it allocates. It's easy to use different allocation methods: you can use for one function, an arena for another, and a stack allocator for a third. It helps with testing and debugging: you can use a tracking allocator to find memory leaks, or a failing allocator to test error handling. Figuring out how many iterations to run. Running the benchmark function in a loop. Recording metrics (ns/op, MB/s, B/op, allocs/op). Reporting the results.

0 views
Maurycy 2 weeks ago

GopherTree

While gopher is usually seen as a proto-web, it's really closer to FTP. It has no markup format, no links and no URLs. Files are arranged in a hierarchically, and can be in any format. This rigid structure allows clients to get creative with how it's displayed ... which is why I'm extremely disappointed that everyone renders gopher menus like shitty websites: You see all that text mixed into the menu? Those are informational selectors: a non-standard feature that's often used to recreate hypertext. I know this "limited web" aesthetic appeals to certain circles, but it removes the things that make the protocol interesting. It would be nice to display gopher menus like what they are, a directory tree : This makes it easy to browse collections of files, and help avoid the Wikipedia problem: Absentmindedly clicking links until you realize it's 3 AM and you have a thousand tabs open... and that you never finished what you wanted to read in the first place. I've made the decision to hide informational selectors by default . These have two main uses: creating faux hypertext and adding ASCII art banners. ASCII art banners are simply annoying: Having one in each menu looks cute in a web browser, but having 50 copies cluttering up the directory tree is... not great. Hypertext doesn't work well. In the strict sense, looking ugly is better then not working at all — but almost everyone who does this also hosts on the web, so it's not a huge loss. The client also has a built in text viewer , with pagination and proper word-wrap. It supports both UTF-8 and Latin-1 text encodings, but this has to be selected manually: gopher has no mechanism to indicate encoding. (but most text looks the same in both) Bookmarks work by writing items to a locally stored gopher menu, which also serves as a "homepage" of sorts. Because it's just a file, I didn't bother implementing any advanced editing features: any text editor works fine for that. The bookmark code is UNIX/Linux specific, but porting should be possible. All this fits within a thousand lines of C code , the same as my ultra-minimal web browser. While arguably a browser, it was practically unusable: lacking basic features like a back button or pagination. The gopher version of the same size is complete enough to replace Lynx as my preferred client. Usage instructions can be found at the top of the source file. /projects/gopher/gophertree.c : Source and instructions /projects/tinyweb/ : 1000 line web browser https://datatracker.ietf.org/doc/html/rfc1436 : Gopher RFC

0 views
Maurycy 2 weeks ago

My ramblings are available over gopher

It has recently come to my attention that people need a thousand lines of C code to read my website. This is unacceptable. For simpler clients, my server supports gopher: The response is just a text file: it has no markup, no links and no embedded content. For navigation, gopher uses specially formatted directory-style menus: The first character on a line indicates the type of the linked resource: The type is followed by a tab-separated list containing a display name, file path, hostname and port. Lines beginning with an "i" are purely informational and do not link to anything. (This is non-standard, but widely used) Storing metadata in links is weird to modern sensibilities , but it keeps the protocol simple. Menus are the only thing that the client has to understand: there's no URLs, no headers, no mime types — the only thing sent to the server is the selector (file path), and the only thing received is the file. ... as a bonus, this one liner can download files: That's quite clunky , but there are lots of programs that support it. If you have Lynx installed, you should be able to just point it at this URL: ... although you will want to put in because it's not 1991 anymore [Citation Needed] I could use informational lines to replicate the webs navigation by making everything a menu — but that would be against the spirit of the thing: gopher is document retrieval protocol, not a hypertext format. Instead, I converted all my blog posts in plain text and set up some directory-style navigation. I've actually been moving away from using inline links anyways because they have two opposing design goals: While reading, links must be normal text. When you're done, links must be distinct clickable elements. I've never been able to find a good compromise: Links are always either distracting to the reader, annoying to find/click, or both. Also, to preempt all the emails : ... what about Gemini? (The protocol, not the autocomplete from google.) Gemini is the popular option for non-web publishing... but honestly, it feels like someone took HTTP and slapped markdown on top of it. This is a Gemini request... ... and this is an HTTP request: For both protocols, the server responds with metadata followed by hypertext. It's true that HTTP is more verbose, but 16 extra bytes doesn't create a noticeable difference. Unlike gopher, which has a unique navigation model and is of historical interest , Gemini is just the web but with limited features... so what's the point? I can already write websites that don't have ads or autoplaying videos, and you can already use browsers that don't support features you don't like. After stripping away all the fluff (CSS, JS, etc) the web is quite simple: a functional browser can be put together in a weekend. ... and unlike gemini, doing so won't throw out 35 years of compatibility: Someone with Chrome can read a barebones website, and someone with Lynx can read normal sites. Gemini is a technical solution to an emotional problem . Most people have a bad taste for HTTP due to the experience of visiting a commercial website. Gemini is the obvious choice for someone looking for "the web but without VC types". It doesn't make any sense when I'm looking for an interesting (and humor­ously outdated) protocol. /projects/tinyweb/ : A browser in 1000 lines of C ... /about.html#links : ... and thoughts on links for navigation. https://www.rfc-editor.org/rfc/rfc1436.html : Gopher RFC https://lynx.invisible-island.net/ : Feature complete text-based web browser

0 views
Max Bernstein 2 weeks ago

Using Perfetto in ZJIT

Originally published on Rails At Scale . Look! A trace of slow events in a benchmark! Hover over the image to see it get bigger. Now read on to see what the slow events are and how we got this pretty picture. The first rule of just-in-time compilers is: you stay in JIT code. The second rule of JIT is: you STAY in JIT code! When control leaves the compiled code to run in the interpreter—what the ZJIT team calls either a “side-exit” or a “deopt”, depending on who you talk to—things slow down. In a well-tuned system, this should happen pretty rarely. Right now, because we’re still bringing up the compiler and runtime system, it happens more than we would like. We’re reducing the number of exits over time. We can track our side-exit reduction progress with , which, on process exit, prints out a tidy summary of the counters for all of the bad stuff we track. It’s got side-exits. It’s got calls to C code. It’s got calls to slow-path runtime helpers. It’s got everything. Here is a chopped-up sample of stats output for the Lobsters benchmark, which is a large Rails app: (I’ve cut out significant chunks of the stats output and replaced them with because it’s overwhelming the first time you see it.) The first thing you might note is that the thing I just described as terrible for performance is happening over twelve million times . The second thing you might notice is that despite this, we’re staying in JIT code seemingly a high percentage of the time. Or are we? Is 80% high? Is a 4.5% class guard miss ratio high? What about 11% for shapes? It’s hard to say. The counters are great because they’re quick and they’re reasonably stable proxies for performance. There’s no substitute for painstaking measurements on a quiet machine but if the counter for Bad Slow Thing goes down (and others do not go up), we’re probably doing a good job. But they’re not great for building intuition. For intuition, we want more tangible feeling numbers. We want to see things. The third thing is that you might ask yourself “self, where are these exits coming from?” Unfortunately, counters cannot tell you that. For that, we want stack traces. This lets us know where in the guest (Ruby) code triggers an exit. Ideally also we would want some notion of time: we would want to know not just where these events happen but also when. Are the exits happening early, at application boot? At warmup? Even during what should be steady state application time? Hard to say. So we need more tools. Thankfully, Perfetto exists. Perfetto is a system for visualizing and analyzing traces and profiles that your application generates. It has both a web UI and a command-line UI. We can emit traces for Perfetto and visualize them there. Take a look at this sample ZJIT Perfetto trace generated by running Ruby with 1 . What do you see? I see a couple arrows on the left. Arrows indicate “instant” point-in-time events. Then I see a mess of purple to the right of that until the end of the trace. Hover over an arrow. Find out that each arrow is a side-exit. Scream silently. But it’s a friendly arrow. It tells you what the side-exit reason is. If you click it, it even tells you the stack trace in the pop-up panel on the bottom. If we click a couple of them, maybe we can learn more. We can also zoom by mousing over the track, holding Ctrl, and scrolling. That will get us look closer. But there are so many… Fortunately, Perfetto also provides a SQL interface to the traces. We can write a query to aggregate all of the side exit events from the table and line them up with the topmost method from the backtrace arguments in the table: This pulls up a query box at the bottom showing us that there are a couple big hotspots: It even has a helpful option to export the results Markdown table so I can paste (an edited version) into this blog post: Looks like we should figure out why we’re having shape misses so much and that will clear up a lot of exits. (Hint: it’s because once we make our first guess about what we think the object shape will be, we don’t re-assess… yet .) This has been a taste of Perfetto. There’s probably a lot more to explore. Please join the ZJIT Zulip and let us know if you have any cool tracing or exploring tricks. Now I’ll explain how you too can use Perfetto from your system. Adding support to ZJIT was pretty straightforward. The first thing is that you’ll need some way to get trace data out of your system. We write to a file with a well-known location ( ), but you could do any number of things. Perhaps you can stream events over a socket to another process, or to a server that aggregates them, or store them internally and expose a webserver that serves them over the internet, or… anything, really. Once you have that, you need a couple lines of code to emit the data. Perfetto accepts a number of formats. For example, in his excellent blog post , Tristan Hume opens with such a simple snippet of code for logging Chromium Trace JSON-formatted events (lightly modified by me): This snippet is great. It shows, end-to-end, writing a stream of one event. It is a complete (X) event, as opposed to either: It was enough to get me started. Since it’s JSON, and we have a lot of side exits, the trace quickly ballooned to 8GB large for a several second benchmark. Not great. Now, part of this is our fault—we should side exit less—and part of it is just the verbosity of JSON. Thankfully, Perfetto ingests more compact binary formats, such as the Fuchsia trace format . In addition to being more compact, FXT even supports string interning. After modifying the tracer to emit FXT, we ended with closer to 100MB for the same benchmark. We can reduce further by sampling —not writing every exit to the trace, but instead every K exits (for some (probably prime) K). This is why we provide the option. Check out the trace writer implementation from the point this article was written. We could trace: Visualizations are awesome. Get your data in the right format so you can ask the right questions easily. Thanks for Perfetto! Also, looks like visualizations are now available in Perfetto canary. Time to go make some fun histograms… This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩ two discrete timestamped begin (B) and end (E) events that book-end something, or an instant (i) event that has no duration, or a couple other event types in the Chromium Trace Event Format doc When methods get compiled How big the generated code is How long each compile phase takes When (and where) invalidation events happen When (and where) allocations happen from JITed code Garbage collection events This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩

0 views
Anton Zhiyanov 3 weeks ago

Porting Go's io package to C

Creating a subset of Go that translates to C was never my end goal. I liked writing C code with Go, but without the standard library it felt pretty limited. So, the next logical step was to port Go's stdlib to C. Of course, this isn't something I could do all at once. So I started with the standard library packages that had the fewest dependencies, and one of them was the package. This post is about how that went. io package • Slices • Multiple returns • Errors • Interfaces • Type assertion • Specialized readers • Copy • Wrapping up is one of the core Go packages. It introduces the concepts of readers and writers , which are also common in other programming languages. In Go, a reader is anything that can read some raw data (bytes) from a source into a slice: A writer is anything that can take some raw data from a slice and write it to a destination: The package defines many other interfaces, like and , as well as combinations like and . It also provides several functions, the most well-known being , which copies all data from a source (represented by a reader) to a destination (represented by a writer): C, of course, doesn't have interfaces. But before I get into that, I had to make several other design decisions. In general, a slice is a linear container that holds N elements of type T. Typically, a slice is a view of some underlying data. In Go, a slice consists of a pointer to a block of allocated memory, a length (the number of elements in the slice), and a capacity (the total number of elements that can fit in the backing memory before the runtime needs to re-allocate): Interfaces in the package work with fixed-length slices (readers and writers should never append to a slice), and they only use byte slices. So, the simplest way to represent this in C could be: But since I needed a general-purpose slice type, I decided to do it the Go way instead: Plus a bound-checking helper to access slice elements: Usage example: So far, so good. Let's look at the method again: It returns two values: an and an . C functions can only return one value, so I needed to figure out how to handle this. The classic approach would be to pass output parameters by pointer, like or . But that doesn't compose well and looks nothing like Go. Instead, I went with a result struct: The union can store any primitive type, as well as strings, slices, and pointers. The type combines a value with an error. So, our method (let's assume it's just a regular function for now): Translates to: And the caller can access the result like this: For the error type itself, I went with a simple pointer to an immutable string: Plus a constructor macro: I wanted to avoid heap allocations as much as possible, so decided not to support dynamic errors. Only sentinel errors are used, and they're defined at the file level like this: Errors are compared by pointer identity ( ), not by string content — just like sentinel errors in Go. A error is a pointer. This keeps error handling cheap and straightforward. This was the big one. In Go, an interface is a type that specifies a set of methods. Any concrete type that implements those methods satisfies the interface — no explicit declaration needed. In C, there's no such mechanism. For interfaces, I decided to use "fat" structs with function pointers. That way, Go's : Becomes an struct in C: The pointer holds the concrete value, and each method becomes a function pointer that takes as its first argument. This is less efficient than using a static method table, especially if the interface has a lot of methods, but it's simpler. So I decided it was good enough for the first version. Now functions can work with interfaces without knowing the specific implementation: Calling a method on the interface just goes through the function pointer: Go's interface is more than just a value wrapper with a method table. It also stores type information about the value it holds: Since the runtime knows the exact type inside the interface, it can try to "upgrade" the interface (for example, a regular ) to another interface (like ) using a type assertion : The last thing I wanted to do was reinvent Go's dynamic type system in C, so dropping this feature was an easy decision. There's another kind of type assertion, though — when we unwrap the interface to get the value of a specific type: And this kind of assertion is quite possible in C. All we have to do is compare function pointers: If two different types happened to share the same method implementation, this would break. In practice, each concrete type has its own methods, so the function pointer serves as a reliable type tag. After I decided on the interface approach, porting the actual types was pretty easy. For example, wraps a reader and stops with EOF after reading N bytes: The logic is straightforward: if there are no bytes left, return EOF. Otherwise, if the buffer is bigger than the remaining size, shorten it. Then, call the underlying reader, and decrease the remaining size. Here's what the ported C code looks like: A bit more verbose, but nothing special. The multiple return values, the interface call with , and the slice handling are all implemented as described in previous sections. is where everything comes together. Here's the simplified Go version: In Go, allocates its buffer on the heap with . I could take a similar approach in C — make take an allocator and use it to create the buffer like this: But since this is just a temporary buffer that only exists during the function call, I decided stack allocation was a better choice: allocates memory on a stack with a bounds-checking macro that wraps C's . It moves the stack pointer and gives you a chunk of memory that's automatically freed when the function returns. People often avoid using because it can cause a stack overflow, but using a bounds-checking wrapper fixes this issue. Another common concern with is that it's not block-scoped — the memory stays allocated until the function exits. However, since we only allocate once, this isn't a problem. Here's the simplified C version of : Here, you can see all the parts from this post working together: a function accepting interfaces, slices passed to interface methods, a result type wrapping multiple return values, error sentinels compared by identity, and a stack-allocated buffer used for the copy. Porting Go's package to C meant solving a few problems: representing slices, handling multiple return values, modeling errors, and implementing interfaces using function pointers. None of this needed anything fancy — just structs, unions, functions, and some macros. The resulting C code is more verbose than Go, but it's structurally similar, easy enough to read, and this approach should work well for other Go packages too. The package isn't very useful on its own — it mainly defines interfaces and doesn't provide concrete implementations. So, the next two packages to port were naturally and — I'll talk about those in the next post. In the meantime, if you'd like to write Go that translates to C — with no runtime and manual memory management — I invite you to try Solod . The package is included, of course.

0 views
Anton Zhiyanov 3 weeks ago

Solod: Go can be a better C

I'm working on a new programming language named Solod ( So ). It's a strict subset of Go that translates to C, without hidden memory allocations and with source-level interop. Highlights: So supports structs, methods, interfaces, slices, multiple returns, and defer. To keep things simple, there are no channels, goroutines, closures, or generics. So is for systems programming in C, but with Go's syntax, type safety, and tooling. Hello world • Language tour • Compatibility • Design decisions • FAQ • Final thoughts This Go code in a file : Translates to a header file : Plus an implementation file : In terms of features, So is an intersection between Go and C, making it one of the simplest C-like languages out there — on par with Hare. And since So is a strict subset of Go, you already know it if you know Go. It's pretty handy if you don't want to learn another syntax. Let's briefly go over the language features and see how they translate to C. Variables • Strings • Arrays • Slices • Maps • If/else and for • Functions • Multiple returns • Structs • Methods • Interfaces • Enums • Errors • Defer • C interop • Packages So supports basic Go types and variable declarations: is translated to ( ), to ( ), and to ( ). is not treated as an interface. Instead, it's translated to . This makes handling pointers much easier and removes the need for . is translated to (for pointer types). Strings are represented as type in C: All standard string operations are supported, including indexing, slicing, and iterating with a for-range loop. Converting a string to a byte slice and back is a zero-copy operation: Converting a string to a rune slice and back allocates on the stack with : There's a stdlib package for heap-allocated strings and various string operations. Arrays are represented as plain C arrays ( ): on arrays is emitted as compile-time constant. Slicing an array produces a . Slices are represented as type in C: All standard slice operations are supported, including indexing, slicing, and iterating with a for-range loop. As in Go, a slice is a value type. Unlike in Go, a nil slice and an empty slice are the same thing: allocates a fixed amount of memory on the stack ( ). only works up to the initial capacity and panics if it's exceeded. There's no automatic reallocation; use the stdlib package for heap allocation and dynamic arrays. Maps are fixed-size and stack-allocated, backed by parallel key/value arrays with linear search. They are pointer-based reference types, represented as in C. No delete, no resize. Only use maps when you have a small, fixed number of key-value pairs. For anything else, use heap-allocated maps from the package (planned). Most of the standard map operations are supported, including getting/setting values and iterating with a for-range loop: As in Go, a map is a pointer type. A map emits as in C. If-else and for come in all shapes and sizes, just like in Go. Standard if-else with chaining: Init statement (scoped to the if block): Traditional for loop: While-style loop: Range over an integer: Regular functions translate to C naturally: Named function types become typedefs: Exported functions (capitalized) become public C symbols prefixed with the package name ( ). Unexported functions are . Variadic functions use the standard syntax and translate to passing a slice: Function literals (anonymous functions and closures) are not supported. So supports two-value multiple returns in two patterns: and . Both cases translate to C type: Named return values are not supported. Structs translate to C naturally: works with types and values: Methods are defined on struct types with pointer or value receivers: Pointer receivers pass in C and cast to the struct pointer. Value receivers pass the struct by value, so modifications operate on a copy: Calling methods on values and pointers emits pointers or values as necessary: Methods on named primitive types are also supported. Interfaces in So are like Go interfaces, but they don't include runtime type information. Interface declarations list the required methods: In C, an interface is a struct with a pointer and function pointers for each method (less efficient than using a static method table, but simpler; this might change in the future): Just as in Go, a concrete type implements an interface by providing the necessary methods: Passing a concrete type to functions that accept interfaces: Type assertion works for concrete types ( ), but not for interfaces ( ). Type switch is not supported. Empty interfaces ( and ) are translated to . So supports typed constant groups as enums: Each constant is emitted as a C : is supported for integer-typed constants: Iota values are evaluated at compile time and translated to integer literals: Errors use the type (a pointer): So only supports sentinel errors, which are defined at the package level using (implemented as compiler built-in): Errors are compared using . This is an O(1) operation (compares pointers, not strings): Dynamic errors ( ), local error variables ( inside functions), and error wrapping are not supported. schedules a function or method call to run at the end of the enclosing scope. The scope can be either a function (as in Go): Or a bare block (unlike Go): Deferred calls are emitted inline (before returns, panics, and scope end) in LIFO order: Defer is not supported inside other scopes like or . Include a C header file with : Declare an external C type (excluded from emission) with : Declare an external C function (no body or ): When calling extern functions, and arguments are automatically decayed to their C equivalents: string literals become raw C strings ( ), string values become , and slices become raw pointers. This makes interop cleaner: The decay behavior can be turned off with the flag: The package includes helpers for converting C pointers back to So string and slice types. The package is also available and is implemented as compiler built-ins. Each Go package is translated into a single + pair, regardless of how many files it contains. Multiple files in the same package are merged into one file, separated by comments. Exported symbols (capitalized names) are prefixed with the package name: Unexported symbols (lowercase names) keep their original names and are marked : Exported symbols are declared in the file (with for variables). Unexported symbols only appear in the file. Importing a So package translates to a C : Calling imported symbols uses the package prefix: That's it for the language tour! So generates C11 code that relies on several GCC/Clang extensions: You can use GCC, Clang, or to compile the transpiled C code. MSVC is not supported. Supported operating systems: Linux, macOS, and Windows (partial support). So is highly opinionated. Simplicity is key . Fewer features are always better. Every new feature is strongly discouraged by default and should be added only if there are very convincing real-world use cases to support it. This applies to the standard library too — So tries to export as little of Go's stdlib API as possible while still remaining highly useful for real-world use cases. No heap allocations are allowed in language built-ins (like maps, slices, new, or append). Heap allocations are allowed in the standard library, but they must clearly state when an allocation happens and who owns the allocated data. Fast and easy C interop . Even though So uses Go syntax, it's basically C with its own standard library. Calling C from So, and So from C, should always be simple to write and run efficiently. The So standard library (translated to C) should be easy to add to any C project. Readability . There are several languages that claim they can transpile to readable C code. Unfortunately, the C code they generate is usually unreadable or barely readable at best. So isn't perfect in this area either (though it's arguably better than others), but it aims to produce C code that's as readable as possible. Go compatibility . So code is valid Go code. No exceptions. Raw performance . You can definitely write C code by hand that runs faster than code produced by So. Also, some features in So, like interfaces, are currently implemented in a way that's not very efficient, mainly to keep things simple. Hiding C entirely . So is a cleaner way to write C, not a replacement for it. You should know C to use So effectively. Go feature parity . Less is more. Iterators aren't coming, and neither are generic methods. I have heard these several times, so it's worth answering. Why not Rust/Zig/Odin/other language? Because I like C and Go. Why not TinyGo? TinyGo is lightweight, but it still has a garbage collector, a runtime, and aims to support all Go features. What I'm after is something even simpler, with no runtime at all, source-level C interop, and eventually, Go's standard library ported to plain C so it can be used in regular C projects. How does So handle memory? Everything is stack-allocated by default. There's no garbage collector or reference counting. The standard library provides explicit heap allocation in the package when you need it. Is it safe? So itself has few safeguards other than the default Go type checking. It will panic on out-of-bounds array access, but it won't stop you from returning a dangling pointer or forgetting to free allocated memory. Most memory-related problems can be caught with AddressSanitizer in modern compilers, so I recommend enabling it during development by adding to your . Can I use So code from C (and vice versa)? Yes. So compiles to plain C, therefore calling So from C is just calling C from C. Calling C from So is equally straightforward. Can I compile existing Go packages with So? Not really. Go uses automatic memory management, while So uses manual memory management. So also supports far fewer features than Go. Neither Go's standard library nor third-party packages will work with So without changes. How stable is this? Not for production at the moment. Where's the standard library? There is a growing set of high-level packages ( , , , ...). There are also low-level packages that wrap the libc API ( , , , ...). Check the links below for more details. Even though So isn't ready for production yet, I encourage you to try it out on a hobby project or just keep an eye on it if you like the concept. Further reading: Go in, C out. You write regular Go code and get readable C11 as output. Zero runtime. No garbage collection, no reference counting, no hidden allocations. Everything is stack-allocated by default. Heap is opt-in through the standard library. Native C interop. Call C from So and So from C — no CGO, no overhead. Go tooling works out of the box — syntax highlighting, LSP, linting and "go test". Binary literals ( ) in generated code. Statement expressions ( ) in macros. for package-level initialization. for local type inference in generated code. for type inference in generic macros. for and other dynamic stack allocations. Installation and usage So by example Language description Stdlib description Source code

0 views
daniel.haxx.se 1 months ago

Dependency tracking is hard

curl and libcurl are written in C. Rather low level components present in many software systems. They are typically not part of any ecosystem at all. They’re just a tool and a library. In lots of places on the web when you mention an Open Source project, you will also get the option to mention in which ecosystem it belongs. npm, go, rust, python etc. There are easily at least a dozen well-known and large ecosystems. curl is not part of any of those. Recently there’s been a push for PURLs ( Package URLs ), for example when describing your specific package in a CVE. A package URL only works when the component is part of an ecosystem. curl is not. We can’t specify curl or libcurl using a PURL. SBOM generators and related scanners use package managers to generate lists of used components and their dependencies . This makes these tools quite frequently just miss and ignore libcurl. It’s not listed by the package managers. It’s just in there, ready to be used. Like magic. It is similarly hard for these tools to figure out that curl in turn also depends and uses other libraries. At build-time you select which – but as we in the curl project primarily just ships tarballs with source code we cannot tell anyone what dependencies their builds have. The additional libraries libcurl itself uses are all similarly outside of the standard ecosystems. Part of the explanation for this is also that libcurl and curl are often shipped bundled with the operating system many times, or sometimes perceived to be part of the OS. Most graphs, SBOM tools and dependency trackers therefore stop at the binding or system that uses curl or libcurl, but without including curl or libcurl. The layer above so to speak. This makes it hard to figure out exactly how many components and how much software is depending on libcurl. A perfect way to illustrate the problem is to check GitHub and see how many among its vast collection of many millions of repositories that depend on curl. After all, curl is installed in some thirty billion installations, so clearly it used a lot . (Most of them being libcurl of course.) It lists one dependency for curl. Repositories that depend on curl/curl: one. Screenshot taken on March 9, 2026 What makes this even more amusing is that it looks like this single dependent repository ( Pupibent/spire ) lists curl as a dependency by mistake.

0 views
(think) 1 months ago

Building Emacs Major Modes with TreeSitter: Lessons Learned

Over the past year I’ve been spending a lot of time building TreeSitter-powered major modes for Emacs – clojure-ts-mode (as co-maintainer), neocaml (from scratch), and asciidoc-mode (also from scratch). Between the three projects I’ve accumulated enough battle scars to write about the experience. This post distills the key lessons for anyone thinking about writing a TreeSitter-based major mode, or curious about what it’s actually like. Before TreeSitter, Emacs font-locking was done with regular expressions and indentation was handled by ad-hoc engines (SMIE, custom indent functions, or pure regex heuristics). This works, but it has well-known problems: Regex-based font-locking is fragile. Regexes can’t parse nested structures, so they either under-match (missing valid code) or over-match (highlighting inside strings and comments). Every edge case is another regex, and the patterns become increasingly unreadable over time. Indentation engines are complex. SMIE (the generic indentation engine for non-TreeSitter modes) requires defining operator precedence grammars for the language, which is hard to get right. Custom indentation functions tend to grow into large, brittle state machines. Tuareg’s indentation code, for example, is thousands of lines long. TreeSitter changes the game because you get a full, incremental, error-tolerant syntax tree for free. Font-locking becomes “match this AST pattern, apply this face”: And indentation becomes “if the parent node is X, indent by Y”: The rules are declarative, composable, and much easier to reason about than regex chains. In practice, ’s entire font-lock and indentation logic fits in about 350 lines of Elisp. The equivalent in tuareg is spread across thousands of lines. That’s the real selling point: simpler, more maintainable code that handles more edge cases correctly . That said, TreeSitter in Emacs is not a silver bullet. Here’s what I ran into. TreeSitter grammars are written by different authors with different philosophies. The tree-sitter-ocaml grammar provides a rich, detailed AST with named fields. The tree-sitter-clojure grammar, by contrast, deliberately keeps things minimal – it only models syntax, not semantics, because Clojure’s macro system makes static semantic analysis unreliable. 1 This means font-locking forms in Clojure requires predicate matching on symbol text, while in OCaml you can directly match nodes with named fields. To illustrate: here’s how you’d fontify a function definition in OCaml, where the grammar gives you rich named fields: And here’s the equivalent in Clojure, where the grammar only gives you lists of symbols and you need predicate matching: You can’t learn “how to write TreeSitter queries” generically – you need to learn each grammar individually. The best tool for this is (to visualize the full parse tree) and (to see the node at point). Use them constantly. You’re dependent on someone else providing the grammar, and quality is all over the map. The OCaml grammar is mature and well-maintained – it’s hosted under the official tree-sitter GitHub org. The Clojure grammar is small and stable by design. But not every language is so lucky. asciidoc-mode uses a third-party AsciiDoc grammar that employs a dual-parser architecture – one parser for block-level structure (headings, lists, code blocks) and another for inline formatting (bold, italic, links). This is the same approach used by Emacs’s built-in , and it makes sense for markup languages where block and inline syntax are largely independent. The problem is that the two parsers run independently on the same text, and they can disagree . The inline parser misinterprets and list markers as emphasis delimiters, creating spurious bold spans that swallow subsequent inline content. The workaround is to use on all block-level font-lock rules so they win over the incorrect inline faces: This doesn’t fix inline elements consumed by the spurious emphasis – that requires an upstream grammar fix. When you hit grammar-level issues like this, you either fix them yourself (which means diving into the grammar’s JavaScript source and C toolchain) or you live with workarounds. Either way, it’s a reminder that your mode is only as good as the grammar underneath it. Getting the font-locking right in was probably the most challenging part of all three projects, precisely because of these grammar quirks. I also ran into a subtle behavior: the default font-lock mode ( ) skips an entire captured range if any position within it already has a face. So if you capture a parent node like and a child was already fontified, the whole thing gets skipped silently. The fix is to capture specific child nodes instead: These issues took a lot of trial and error to diagnose. The lesson: budget extra time for font-locking when working with less mature grammars . Grammars evolve, and breaking changes happen. switched from the stable grammar to the experimental branch because the stable version had metadata nodes as children of other nodes, which caused and to behave incorrectly. The experimental grammar makes metadata standalone nodes, fixing the navigation issues but requiring all queries to be updated. pins to v0.24.0 of the OCaml grammar. If you don’t pin versions, a grammar update can silently break your font-locking or indentation. The takeaway: always pin your grammar version , and include a mechanism to detect outdated grammars. tests a query that changed between versions to detect incompatible grammars at startup. Users shouldn’t have to manually clone repos and compile C code to use your mode. Both and include grammar recipes: On first use, the mode checks and offers to install missing grammars via . This works, but requires a C compiler and Git on the user’s machine, which is not ideal. 2 The TreeSitter support in Emacs has been improving steadily, but each version has its quirks: Emacs 29 introduced TreeSitter support but lacked several APIs. For instance, (used for structured navigation) doesn’t exist – you need a fallback: Emacs 30 added , sentence navigation, and better indentation support. But it also had a bug in offsets ( #77848 ) that broke embedded parsers, and another in that required to disable its TreeSitter-aware version. Emacs 31 has a bug in where an off-by-one error causes to leave ` *)` behind on multi-line OCaml comments. I had to skip the affected test with a version check: The lesson: test your mode against multiple Emacs versions , and be prepared to write version-specific workarounds. CI that runs against Emacs 29, 30, and snapshot is essential. Most TreeSitter grammars ship with query files for syntax highlighting ( ) and indentation ( ). Editors like Neovim and Helix use these directly. Emacs doesn’t – you have to manually translate the patterns into and calls in Elisp. This is tedious and error-prone. For example, here’s a rule from the OCaml grammar’s : And here’s the Elisp equivalent you’d write for Emacs: The query syntax is nearly identical, but you have to wrap everything in calls, map upstream capture names ( ) to Emacs face names ( ), assign features, and manage behavior. You end up maintaining a parallel set of queries that can drift from upstream. Emacs 31 will introduce which will make it possible to use files for font-locking, which should help significantly. But for now, you’re hand-coding everything. When a face isn’t being applied where you expect: TreeSitter modes define four levels of font-locking via , and the default level in Emacs is 3. It’s tempting to pile everything into levels 1–3 so users see maximum highlighting out of the box, but resist the urge. When every token on the screen has a different color, code starts looking like a Christmas tree and the important things – keywords, definitions, types – stop standing out. Less is more here. Here’s how distributes features across levels: And follows the same philosophy: The pattern is the same: essentials first, progressively more detail at higher levels. This way the default experience (level 3) is clean and readable, and users who want the full rainbow can bump to 4. Better yet, they can use to cherry-pick individual features regardless of level: This gives users fine-grained control without requiring mode authors to anticipate every preference. Indentation issues are harder to diagnose because they depend on tree structure, rule ordering, and anchor resolution: Remember that rule order matters for indentation too – the first matching rule wins. A typical set of rules reads top to bottom from most specific to most general: Watch out for the empty-line problem : when the cursor is on a blank line, TreeSitter has no node at point. The indentation engine falls back to the root node as the parent, which typically matches the top-level rule and gives column 0. In neocaml I solved this with a rule that looks at the previous line’s last token to decide indentation: This is the single most important piece of advice. Font-lock and indentation are easy to break accidentally, and manual testing doesn’t scale. Both projects use Buttercup (a BDD testing framework for Emacs) with custom test macros. Font-lock tests insert code into a buffer, run , and assert that specific character ranges have the expected face: Indentation tests insert code, run , and assert the result matches the expected indentation: Integration tests load real source files and verify that both font-locking and indentation survive on the full file. This catches interactions between rules that unit tests miss. has 200+ automated tests and has even more. Investing in test infrastructure early pays off enormously – I can refactor indentation rules with confidence because the suite catches regressions immediately. When I became the maintainer of clojure-mode many years ago, I really struggled with making changes. There were no font-lock or indentation tests, so every change was a leap of faith – you’d fix one thing and break three others without knowing until someone filed a bug report. I spent years working on a testing approach I was happy with, alongside many great contributors, and the return on investment was massive. The same approach – almost the same test macros – carried over directly to when we built the TreeSitter version. And later I reused the pattern again in and . One investment in testing infrastructure, four projects benefiting from it. I know that automated tests, for whatever reason, never gained much traction in the Emacs community. Many popular packages have no tests at all. I hope stories like this convince you that investing in tests is really important and pays off – not just for the project where you write them, but for every project you build after. This one is specific to but applies broadly: compiling TreeSitter queries at runtime is expensive. If you’re building queries dynamically (e.g. with called at mode init time), consider pre-compiling them as values. This made a noticeable difference in ’s startup time. The Emacs community has settled on a suffix convention for TreeSitter-based modes: , , , and so on. This makes sense when both a legacy mode and a TreeSitter mode coexist in Emacs core – users need to choose between them. But I think the convention is being applied too broadly, and I’m afraid the resulting name fragmentation will haunt the community for years. For new packages that don’t have a legacy counterpart, the suffix is unnecessary. I named my packages (not ) and (not ) because there was no prior or to disambiguate from. The infix is an implementation detail that shouldn’t leak into the user-facing name. Will we rename everything again when TreeSitter becomes the default and the non-TS variants are removed? Be bolder with naming. If you’re building something new, give it a name that makes sense on its own merits, not one that encodes the parsing technology in the package name. I think the full transition to TreeSitter in the Emacs community will take 3–5 years, optimistically. There are hundreds of major modes out there, many maintained by a single person in their spare time. Converting a mode from regex to TreeSitter isn’t just a mechanical translation – you need to understand the grammar, rewrite font-lock and indentation rules, handle version compatibility, and build a new test suite. That’s a lot of work. Interestingly, this might be one area where agentic coding tools can genuinely help. The structure of TreeSitter-based major modes is fairly uniform: grammar recipes, font-lock rules, indentation rules, navigation settings, imenu. If you give an AI agent a grammar and a reference to a high-quality mode like , it could probably scaffold a reasonable new mode fairly quickly. The hard parts – debugging grammar quirks, handling edge cases, getting indentation just right – would still need human attention, but the boilerplate could be automated. Still, knowing the Emacs community, I wouldn’t be surprised if a full migration never actually completes. Many old-school modes work perfectly fine, their maintainers have no interest in TreeSitter, and “if it ain’t broke, don’t fix it” is a powerful force. And that’s okay – diversity of approaches is part of what makes Emacs Emacs. TreeSitter is genuinely great for building Emacs major modes. The code is simpler, the results are more accurate, and incremental parsing means everything stays fast even on large files. I wouldn’t go back to regex-based font-locking willingly. But it’s not magical. Grammars are inconsistent across languages, the Emacs APIs are still maturing, you can’t reuse files (yet), and you’ll hit version-specific bugs that require tedious workarounds. The testing story is better than with regex modes – tree structures are more predictable than regex matches – but you still need a solid test suite to avoid regressions. If you’re thinking about writing a TreeSitter-based major mode, do it. The ecosystem needs more of them, and the experience of working with syntax trees instead of regexes is genuinely enjoyable. Just go in with realistic expectations, pin your grammar versions, test against multiple Emacs releases, and build your test suite early. Anyways, I wish there was an article like this one when I was starting out with and , so there you have it. I hope that the lessons I’ve learned along the way will help build better modes with TreeSitter down the road. That’s all I have for you today. Keep hacking! See the excellent scope discussion in the tree-sitter-clojure repo for the rationale.  ↩︎ There’s ongoing discussion in the Emacs community about distributing pre-compiled grammar binaries, but nothing concrete yet.  ↩︎ Regex-based font-locking is fragile. Regexes can’t parse nested structures, so they either under-match (missing valid code) or over-match (highlighting inside strings and comments). Every edge case is another regex, and the patterns become increasingly unreadable over time. Indentation engines are complex. SMIE (the generic indentation engine for non-TreeSitter modes) requires defining operator precedence grammars for the language, which is hard to get right. Custom indentation functions tend to grow into large, brittle state machines. Tuareg’s indentation code, for example, is thousands of lines long. Use to verify the node type at point matches your query. Set to to see which rules are firing. Check the font-lock feature level – your rule might be in level 4 while the user has the default level 3. The features are assigned to levels via . Remember that rule order matters . Without , an earlier rule that already fontified a region will prevent later rules from applying. This can be intentional (e.g. builtin types at level 3 take precedence over generic types) or a source of bugs. Set to – this logs which rule matched for each line, what anchor was computed, and the final column. Use to understand the parent chain. The key question is always: “what is the parent node, and which rule matches it?” Remember that rule order matters for indentation too – the first matching rule wins. A typical set of rules reads top to bottom from most specific to most general: Watch out for the empty-line problem : when the cursor is on a blank line, TreeSitter has no node at point. The indentation engine falls back to the root node as the parent, which typically matches the top-level rule and gives column 0. In neocaml I solved this with a rule that looks at the previous line’s last token to decide indentation: See the excellent scope discussion in the tree-sitter-clojure repo for the rationale.  ↩︎ There’s ongoing discussion in the Emacs community about distributing pre-compiled grammar binaries, but nothing concrete yet.  ↩︎

0 views
<antirez> 1 months ago

Implementing a clear room Z80 / ZX Spectrum emulator with Claude Code

Anthropic recently released a blog post with the description of an experiment in which the last version of Opus, the 4.6, was instructed to write a C compiler in Rust, in a “clean room” setup. The experiment methodology left me dubious about the kind of point they wanted to make. Why not provide the agent with the ISA documentation? Why Rust? Writing a C compiler is exactly a giant graph manipulation exercise: the kind of program that is harder to write in Rust. Also, in a clean room experiment, the agent should have access to all the information about well established computer science progresses related to optimizing compilers: there are a number of papers that could be easily synthesized in a number of markdown files. SSA, register allocation, instructions selection and scheduling. Those things needed to be researched *first*, as a prerequisite, and the implementation would still be “clean room”. Not allowing the agent to access the Internet, nor any other compiler source code, was certainly the right call. Less understandable is the almost-zero steering principle, but this is coherent with a certain kind of experiment, if the goal was showcasing the completely autonomous writing of a large project. Yet, we all know how this is not how coding agents are used in practice, most of the time. Who uses coding agents extensively knows very well how, even never touching the code, a few hits here and there completely changes the quality of the result. # The Z80 experiment I thought it was time to try a similar experiment myself, one that would take one or two hours at max, and that was compatible with my Claude Code Max plan: I decided to write a Z80 emulator, and then a ZX Spectrum emulator (and even more, a CP/M emulator, see later) in a condition that I believe makes a more sense as “clean room” setup. The result can be found here: https://github.com/antirez/ZOT. # The process I used 1. I wrote a markdown file with the specification of what I wanted to do. Just English, high level ideas about the scope of the Z80 emulator to implement. I said things like: it should execute a whole instruction at a time, not a single clock step, since this emulator must be runnable on things like an RP2350 or similarly limited hardware. The emulator should correctly track the clock cycles elapsed (and I specified we could use this feature later in order to implement the ZX Spectrum contention with ULA during memory accesses), provide memory access callbacks, and should emulate all the known official and unofficial instructions of the Z80. For the Spectrum implementation, performed as a successive step, I provided much more information in the markdown file, like, the kind of rendering I wanted in the RGB buffer, and how it needed to be optional so that embedded devices could render the scanlines directly as they transferred them to the ST77xx display (or similar), how it should be possible to interact with the I/O port to set the EAR bit to simulate cassette loading in a very authentic way, and many other desiderata I had about the emulator. This file also included the rules that the agent needed to follow, like: * Accessing the internet is prohibited, but you can use the specification and test vectors files I added inside ./z80-specs. * Code should be simple and clean, never over-complicate things. * Each solid progress should be committed in the git repository. * Before committing, you should test that what you produced is high quality and that it works. * Write a detailed test suite as you add more features. The test must be re-executed at every major change. * Code should be very well commented: things must be explained in terms that even people not well versed with certain Z80 or Spectrum internals details should understand. * Never stop for prompting, the user is away from the keyboard. * At the end of this file, create a work in progress log, where you note what you already did, what is missing. Always update this log. * Read this file again after each context compaction. 2. Then, I started a Claude Code session, and asked it to fetch all the useful documentation on the internet about the Z80 (later I did this for the Spectrum as well), and to extract only the useful factual information into markdown files. I also provided the binary files for the most ambitious test vectors for the Z80, the ZX Spectrum ROM, and a few other binaries that could be used to test if the emulator actually executed the code correctly. Once all this information was collected (it is part of the repository, so you can inspect what was produced) I completely removed the Claude Code session in order to make sure that no contamination with source code seen during the search was possible. 3. I started a new session, and asked it to check the specification markdown file, and to check all the documentation available, and start implementing the Z80 emulator. The rules were to never access the Internet for any reason (I supervised the agent while it was implementing the code, to make sure this didn’t happen), to never search the disk for similar source code, as this was a “clean room” implementation. 4. For the Z80 implementation, I did zero steering. For the Spectrum implementation I used extensive steering for implementing the TAP loading. More about my feedback to the agent later in this post. 5. As a final step, I copied the repository in /tmp, removed the “.git” repository files completely, started a new Claude Code (and Codex) session and claimed that the implementation was likely stolen or too strongly inspired from somebody else's work. The task was to check with all the major Z80 implementations if there was evidence of theft. The agents (both Codex and Claude Code), after extensive search, were not able to find any evidence of copyright issues. The only similar parts were about well established emulation patterns and things that are Z80 specific and can’t be made differently, the implementation looked distinct from all the other implementations in a significant way. # Results Claude Code worked for 20 or 30 minutes in total, and produced a Z80 emulator that was able to pass ZEXDOC and ZEXALL, in 1200 lines of very readable and well commented C code (1800 lines with comments and blank spaces). The agent was prompted zero times during the implementation, it acted absolutely alone. It never accessed the internet, and the process it used to implement the emulator was of continuous testing, interacting with the CP/M binaries implementing the ZEXDOC and ZEXALL, writing just the CP/M syscalls needed to produce the output on the screen. Multiple times it also used the Spectrum ROM and other binaries that were available, or binaries it created from scratch to see if the emulator was working correctly. In short: the implementation was performed in a very similar way to how a human programmer would do it, and not outputting a complete implementation from scratch “uncompressing” it from the weights. Instead, different classes of instructions were implemented incrementally, and there were bugs that were fixed via integration tests, debugging sessions, dumps, printf calls, and so forth. # Next step: the ZX Spectrum I repeated the process again. I instructed the documentation gathering session very accurately about the kind of details I wanted it to search on the internet, especially the ULA interactions with RAM access, the keyboard mapping, the I/O port, how the cassette tape worked and the kind of PWM encoding used, and how it was encoded into TAP or TZX files. As I said, this time the design notes were extensive since I wanted this emulator to be specifically designed for embedded systems, so only 48k emulation, optional framebuffer rendering, very little additional memory used (no big lookup tables for ULA/Z80 access contention), ROM not copied in the RAM to avoid using additional 16k of memory, but just referenced during the initialization (so we have just a copy in the executable), and so forth. The agent was able to create a very detailed documentation about the ZX Spectrum internals. I provided a few .z80 images of games, so that it could test the emulator in a real setup with real software. Again, I removed the session and started fresh. The agent started working and ended 10 minutes later, following a process that really fascinates me, and that probably you know very well: the fact is, you see the agent working using a number of diverse skills. It is expert in everything programming related, so as it was implementing the emulator, it could immediately write a detailed instrumentation code to “look” at what the Z80 was doing step by step, and how this changed the Spectrum emulation state. In this respect, I believe automatic programming to be already super-human, not in the sense it is currently capable of producing code that humans can’t produce, but in the concurrent usage of different programming languages, system programming techniques, DSP stuff, operating system tricks, math, and everything needed to reach the result in the most immediate way. When it was done, I asked it to write a simple SDL based integration example. The emulator was immediately able to run the Jetpac game without issues, with working sound, and very little CPU usage even on my slow Dell Linux machine (8% usage of a single core, including SDL rendering). Once the basic stuff was working, I wanted to load TAP files directly, simulating cassette loading. This was the first time the agent missed a few things, specifically about the timing the Spectrum loading routines expected, and here we are in the territory where LLMs start to perform less efficiently: they can’t easily run the SDL emulator and see the border changing as data is received and so forth. I asked Claude Code to do a refactoring so that zx_tick() could be called directly and was not part of zx_frame(), and to make zx_frame() a trivial wrapper. This way it was much simpler to sync EAR with what it expected, without callbacks or the wrong abstractions that it had implemented. After such change, a few minutes later the emulator could load a TAP file emulating the cassette without problems. This is how it works now: do { zx_set_ear(zx, tzx_update(&tape, zx->cpu.clocks)); } while (!zx_tick(zx, 0)); I continued prompting Claude Code in order to make the key bindings more useful and a few things more. # CP/M One thing that I found really interesting was the ability of the LLM to inspect the COM files for ZEXALL / ZEXCOM tests for the Z80, easily spot the CP/M syscalls that were used (a total of three), and implement them for the extended z80 test (executed by make fulltest). So, at this point, why not implement a full CP/M environment? Same process again, same good result in a matter of minutes. This time I interacted with it a bit more for the VT100 / ADM3 terminal escapes conversions, reported things not working in WordStar initially, and in a few minutes everything I tested was working well enough (but, there are fixes to do, like simulating a 2Mhz clock, right now it runs at full speed making CP/M games impossible to use). # What is the lesson here? The obvious lesson is: always provide your agents with design hints and extensive documentation about what they are going to do. Such documentation can be obtained by the agent itself. And, also, make sure the agent has a markdown file with the rules of how to perform the coding tasks, and a trace of what it is doing, that is updated and read again quite often. But those tricks, I believe, are quite clear to everybody that has worked extensively with automatic programming in the latest months. To think in terms of “what a human would need” is often the best bet, plus a few LLMs specific things, like the forgetting issue after context compaction, the continuous ability to verify it is on the right track, and so forth. Returning back to the Anthropic compiler attempt: one of the steps that the agent failed was the one that was more strongly related to the idea of memorization of what is in the pretraining set: the assembler. With extensive documentation, I can’t see any way Claude Code (and, even more, GPT5.3-codex, which is in my experience, for complex stuff, more capable) could fail at producing a working assembler, since it is quite a mechanical process. This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation. We mostly ask LLMs to create work that requires assembling different knowledge they possess, and the result is normally something that uses known techniques and patterns, but that is new code, not constituting a copy of some pre-existing code. It is worth noting, too, that humans often follow a less rigorous process compared to the clean room rules detailed in this blog post, that is: humans often download the code of different implementations related to what they are trying to accomplish, read them carefully, then try to avoid copying stuff verbatim but often times they take strong inspiration. This is a process that I find perfectly acceptable, but it is important to take in mind what happens in the reality of code written by humans. After all, information technology evolved so fast even thanks to this massive cross pollination effect. For all the above reasons, when I implement code using automatic programming, I don’t have problems releasing it MIT licensed, like I did with this Z80 project. In turn, this code base will constitute quality input for the next LLMs training, including open weights ones. # Next steps To make my experiment more compelling, one should try to implement a Z80 and ZX Spectrum emulator without providing any documentation to the agent, and then compare the result of the implementation. I didn’t find the time to do it, but it could be quite informative. Comments

0 views
Anton Zhiyanov 2 months ago

Allocators from C to Zig

An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. Many C programs use the standard libc allocator, or at best, let you switch it out for another one like jemalloc or mimalloc. Unlike C, modern systems languages usually treat allocators as first-class citizens. Let's look at how they handle allocation and then create a C allocator following their approach. Rust • Zig • Odin • C3 • Hare • C • Final thoughts Rust is one of the older languages we'll be looking at, and it handles memory allocation in a more traditional way. Right now, it uses a global allocator, but there's an experimental Allocator API implemented behind a feature flag (issue #32838 ). We'll set the experimental API aside and focus on the stable one. The documentation begins with a clear statement: In a given program, the standard library has one "global" memory allocator that is used for example by and . Followed by a vague one: Currently the default global allocator is unspecified. It doesn't mean that a Rust program will abort an allocation, of course. In practice, Rust uses the system allocator as the global default (but the Rust developers don't want to commit to this, hence the "unspecified" note): The global allocator interface is defined by the trait in the module. It requires the implementor to provide two essential methods — and , and provides two more based on them — and : The struct describes a piece of memory we want to allocate — its size in bytes and alignment: Memory alignment Alignment restricts where a piece of data can start in memory. The memory address for the data has to be a multiple of a certain number, which is always a power of 2. Alignment depends on the type of data: CPUs are designed to read "aligned" memory efficiently. For example, if you read a 4-byte integer starting at address 0x03 (which is unaligned), the CPU has to do two memory reads — one for the first byte and another for the other three bytes — and then combine them. But if the integer starts at address 0x04 (which is aligned), the CPU can read all four bytes at once. Aligned memory is also needed for vectorized CPU operations (SIMD), where one processor instruction handles a group of values at once instead of just one. The compiler knows the size and alignment for each type, so we can use the constructor or helper functions to create a valid layout: Don't be surprised that a takes up 32 bytes. In Rust, the type can grow, so it stores a data pointer, a length, and a capacity (3 × 8 = 24 bytes). There's also 1 byte for the boolean and 7 bytes of padding (because of 8-byte alignment), making a total of 32 bytes. is the default memory allocator provided by the operating system. The exact implementation depends on the platform . It implements the trait and is used as the global allocator by default, but the documentation does not guarantee this (remember the "unspecified" note?). If you want to explicitly set as the global allocator, you can use the attribute: You can also set a custom allocator as global, like in this example: To use the global allocator directly, call the and functions: In practice, people rarely use or directly. Instead, they work with types like , or that handle allocation for them: The allocator doesn't abort if it can't allocate memory; instead, it returns (which is exactly what recommends): The documentation recommends using the function to signal out-of-memory errors. It immediately aborts the process, or panics if the binary isn't linked to the standard library. Unlike the low-level function, types like or call if allocation fails, so the program usually aborts if it runs out of memory: Allocator API • Memory allocation APIs Memory management in Zig is explicit. There is no default global allocator, and any function that needs to allocate memory accepts an allocator as a separate parameter. This makes the code a bit more verbose, but it matches Zig's goal of giving programmers as much control and transparency as possible. An allocator in Zig is a struct with an opaque self-pointer and a method table with four methods: Unlike Rust's allocator methods, which take a raw pointer and a size as arguments, Zig's allocator methods take a slice of bytes ( ) — a type that combines both a pointer and a length. Another interesting difference is the optional parameter, which is the first return address in the allocation call stack. Some allocators, like the , use it to keep track of which function requested memory. This helps with debugging issues related to memory allocation. Just like in Rust, allocator methods don't return errors. Instead, and return if they fail. Zig also provides type-safe wrappers that you can use instead of calling the allocator methods directly: Unlike the allocator methods, these allocation functions return an error if they fail. If a function or method allocates memory, it expects the developer to provide an allocator instance: Zig's standard library includes several built-in allocators in the namespace. asks the operating system for entire pages of memory, each allocation is a syscall: allocates memory into a fixed buffer and doesn't make any heap allocations: wraps a child allocator and allows you to allocate many times and only free once: The call frees all memory. Individual calls are no-ops. (aka ) is a safe allocator that can prevent double-free, use-after-free and can detect leaks: is a general-purpose thread-safe allocator designed for maximum performance on multithreaded machines: is a wrapper around the libc allocator: Zig doesn't panic or abort when it can't allocate memory. An allocation failure is just a regular error that you're expected to handle: Allocators • std.mem.Allocator • std.heap Odin supports explicit allocators, but, unlike Zig, it's not the only option. In Odin, every scope has an implicit variable that provides a default allocator: If you don't pass an allocator to a function, it uses the one currently set in the context. An allocator in Odin is a struct with an opaque self-pointer and a single function pointer: Unlike other languages, Odin's allocator uses a single procedure for all allocation tasks. The specific action — like allocating, resizing, or freeing memory — is decided by the parameter. The allocation procedure returns the allocated memory (for and operations) and an error ( on success). Odin provides low-level wrapper functions in the package that call the allocator procedure using a specific mode: There are also type-safe builtins like / (for a single object) and / (for multiple objects) that you can use instead of the low-level interface: By default, all builtins use the context allocator, but you can pass a custom allocator as an optional parameter: To use a different allocator for a specific block of code, you can reassign it in the context: Odin's provides two different allocators: When using the temp allocator, you only need a single call to clear all the allocated memory. Odin's standard library includes several allocators, found in the and packages. The procedure returns a general-purpose allocator: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access, similar to in Zig: There are also others, such as or . Like Zig, Odin doesn't panic or abort when it can't allocate memory. Instead, it returns an error code as the second return value: Allocators • base:runtime • core:mem Like Zig and Odin, C3 supports explicit allocators. Like Odin, C3 provides two default allocators: heap and temp. An allocator in C3 is a interface with an additional option of zeroing or not zeroing the allocated memory: Unlike Zig and Odin, the and methods don't take the (old) size as a parameter — neither directly like Odin nor through a slice like Zig. This makes it a bit harder to create custom allocators because the allocator has to keep track of the size along with the allocated memory. On the other hand, this approach makes C interop easier (if you use the default C3 allocator): data allocated in C can be freed in C3 without needing to pass the size parameter from the C code. Like in Odin, allocator methods return an error if they fail. C3 provides low-level wrapper macros in the module that call allocator methods: These either return an error (the -suffix macros) or abort if they fail. There are also functions and macros with similar names in the module that use the global allocator instance: If a function or method allocates memory, it often expects the developer to provide an allocator instance: C3 provides two thread-local allocator instances: There are functions and macros in the module that use the temporary allocator: To macro releases all temporary allocations when leaving the scope: Some types, like or , use the temp allocator by default if they are not initialized: C3's standard library includes several built-in allocators, found in the module. is a wrapper around libc's malloc/free: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access: There are also others, such as or . Like Zig and Odin, C3 can return an error in case of allocation failure: C3 can also abort in case of allocation failure: Since the functions and macros in the module use instead of , it looks like aborting on failure is the preferred approach. Memory Handling • core::mem::alocator • core::mem Unlike other languages, Hare doesn't support explicit allocators. The standard library has multiple allocator implementations, but only one of them is used at runtime. Hare's compiler expects the runtime to provide and implementations: The programmer isn't supposed to access them directly (although it's possible by importing and calling or ). Instead, Hare uses them to provide higher-level allocation helpers. Hare offers two high-level allocation helpers that use the global allocator internally: and . can allocate individual objects. It takes a value, not a type: can also allocate slices if you provide a second parameter (the number of items): works correctly with both pointers to single objects (like ) and slices (like ). Hare's standard library has three built-in memory allocators: The allocator that's actually used is selected at compile time. Like other languages, Hare returns an error in case of allocation failure: You can abort on error with : Or propagate the error with : Dynamic memory allocation • malloc.ha Many C programs use the standard libc allocator, or at most, let you swap it out for another one using macros: Or using a simple setter: While this might work for switching the libc allocator to jemalloc or mimalloc, it's not very flexible. For example, trying to implement an arena allocator with this kind of API is almost impossible. Now that we've seen the modern allocator design in Zig, Odin, and C3 — let's try building something similar in C. There are a lot of small choices to make, and I'm going with what I personally prefer. I'm not saying this is the only way to design an allocator — it's just one way out of many. Our allocator should return an error instead of if it fails, so we'll need an error enum: The allocation function needs to return either a tagged union (value | error) or a tuple (value, error). Since C doesn't have these built in, let's use a custom tuple type: The next step is the allocator interface. I think Odin's approach of using a single function makes the implementation more complicated than it needs to be, so let's create separate methods like Zig does: This approach to interface design is explained in detail in a separate post: Interfaces in C . Zig uses byte slices ( ) instead of raw memory pointers. We could make our own byte slice type, but I don't see any real advantage to doing that in C — it would just mean more type casting. So let's keep it simple and stick with like our ancestors did. Now let's create generic and wrappers: I'm taking for granted here to keep things simple. A more robust implementation should properly check if it is available or pass the type to directly. We can even create a separate pair of helpers for collections: We could use some macro tricks to make and work for both a single object and a collection. But let's not do that — I prefer to avoid heavy-magic macros in this post. As for the custom allocators, let's start with a libc wrapper. It's not particularly interesting, since it ignores most of the parameters, but still: Usage example: Now let's use that field to implement an arena allocator backed by a fixed-size buffer: Usage example: As shown in the examples above, the allocation method returns an error if something goes wrong. While checking for errors might not be as convenient as it is in Zig or Odin, it's still pretty straightforward: Here's an informal table comparing allocation APIs in the languages we've discussed: In Zig, you always have to specify the allocator. In Odin, passing an allocator is optional. In C3, some functions require you to pass an allocator, while others just use the global one. In Hare, there's a single global allocator. As we've seen, there's nothing magical about the allocators used in modern languages. While they're definitely more ergonomic and safe than C, there's nothing stopping us from using the same techniques in plain C. on Unix platforms; on Windows; : alignment = 1. Can start at any address (0, 1, 2, 3...). : alignment = 4. Must start at addresses divisible by 4 (0, 4, 8, 12...). : alignment = 8. Must start at addresses divisible by 8 (0, 8, 16...). is for general-purpose allocations. It uses the operating system's heap allocator. is for short-lived allocations. It uses a scratch allocator (a kind of growing arena). is for general-purpose allocations. It uses a operating system's heap allocator (typically a libc wrapper). is for short-lived allocations. It uses an arena allocator. The default allocator is based on the algorithm from the Verified sequential malloc/free paper. The libc allocator uses the operating system's malloc and free functions from libc. The debug allocator uses a simple mmap-based method for memory allocation.

0 views

Rewriting pycparser with the help of an LLM

pycparser is my most widely used open source project (with ~20M daily downloads from PyPI [1] ). It's a pure-Python parser for the C programming language, producing ASTs inspired by Python's own . Until very recently, it's been using PLY: Python Lex-Yacc for the core parsing. In this post, I'll describe how I collaborated with an LLM coding agent (Codex) to help me rewrite pycparser to use a hand-written recursive-descent parser and remove the dependency on PLY. This has been an interesting experience and the post contains lots of information and is therefore quite long; if you're just interested in the final result, check out the latest code of pycparser - the main branch already has the new implementation. While pycparser has been working well overall, there were a number of nagging issues that persisted over years. I began working on pycparser in 2008, and back then using a YACC-based approach for parsing a whole language like C seemed like a no-brainer to me. Isn't this what everyone does when writing a serious parser? Besides, the K&R2 book famously carries the entire grammar of the C99 language in an appendix - so it seemed like a simple matter of translating that to PLY-yacc syntax. And indeed, it wasn't too hard, though there definitely were some complications in building the ASTs for declarations (C's gnarliest part ). Shortly after completing pycparser, I got more and more interested in compilation and started learning about the different kinds of parsers more seriously. Over time, I grew convinced that recursive descent is the way to go - producing parsers that are easier to understand and maintain (and are often faster!). It all ties in to the benefits of dependencies in software projects as a function of effort . Using parser generators is a heavy conceptual dependency: it's really nice when you have to churn out many parsers for small languages. But when you have to maintain a single, very complex parser, as part of a large project - the benefits quickly dissipate and you're left with a substantial dependency that you constantly grapple with. And then there are the usual problems with dependencies; dependencies get abandoned, and they may also develop security issues. Sometimes, both of these become true. Many years ago, pycparser forked and started vendoring its own version of PLY. This was part of transitioning pycparser to a dual Python 2/3 code base when PLY was slower to adapt. I believe this was the right decision, since PLY "just worked" and I didn't have to deal with active (and very tedious in the Python ecosystem, where packaging tools are replaced faster than dirty socks) dependency management. A couple of weeks ago this issue was opened for pycparser. It turns out the some old PLY code triggers security checks used by some Linux distributions; while this code was fixed in a later commit of PLY, PLY itself was apparently abandoned and archived in late 2025. And guess what? That happened in the middle of a large rewrite of the package, so re-vendoring the pre-archiving commit seemed like a risky proposition. On the issue it was suggested that "hopefully the dependent packages move on to a non-abandoned parser or implement their own"; I originally laughed this idea off, but then it got me thinking... which is what this post is all about. The original K&R2 grammar for C99 had - famously - a single shift-reduce conflict having to do with dangling else s belonging to the most recent if statement. And indeed, other than the famous lexer hack used to deal with C's type name / ID ambiguity , pycparser only had this single shift-reduce conflict. But things got more complicated. Over the years, features were added that weren't strictly in the standard but were supported by all the industrial compilers. The more advanced C11 and C23 standards weren't beholden to the promises of conflict-free YACC parsing (since almost no industrial-strength compilers use YACC at this point), so all caution went out of the window. The latest (PLY-based) release of pycparser has many reduce-reduce conflicts [2] ; these are a severe maintenance hazard because it means the parsing rules essentially have to be tie-broken by order of appearance in the code. This is very brittle; pycparser has only managed to maintain its stability and quality through its comprehensive test suite. Over time, it became harder and harder to extend, because YACC parsing rules have all kinds of spooky-action-at-a-distance effects. The straw that broke the camel's back was this PR which again proposed to increase the number of reduce-reduce conflicts [3] . This - again - prompted me to think "what if I just dump YACC and switch to a hand-written recursive descent parser", and here we are. None of the challenges described above are new; I've been pondering them for many years now, and yet biting the bullet and rewriting the parser didn't feel like something I'd like to get into. By my private estimates it'd take at least a week of deep heads-down work to port the gritty 2000 lines of YACC grammar rules to a recursive descent parser [4] . Moreover, it wouldn't be a particularly fun project either - I didn't feel like I'd learn much new and my interests have shifted away from this project. In short, the Potential well was just too deep. I've definitely noticed the improvement in capabilities of LLM coding agents in the past few months, and many reputable people online rave about using them for increasingly larger projects. That said, would an LLM agent really be able to accomplish such a complex project on its own? This isn't just a toy, it's thousands of lines of dense parsing code. What gave me hope is the concept of conformance suites mentioned by Simon Willison . Agents seem to do well when there's a very clear and rigid goal function - such as a large, high-coverage conformance test suite. And pycparser has an very extensive one . Over 2500 lines of test code parsing various C snippets to ASTs with expected results, grown over a decade and a half of real issues and bugs reported by users. I figured the LLM can either succeed or fail and throw its hands up in despair, but it's quite unlikely to produce a wrong port that would still pass all the tests. So I set it to run. I fired up Codex in pycparser's repository, and wrote this prompt just to make sure it understands me and can run the tests: Codex figured it out (I gave it the exact command, after all!); my next prompt was the real thing [5] : Here Codex went to work and churned for over an hour . Having never observed an agent work for nearly this long, I kind of assumed it went off the rails and will fail sooner or later. So I was rather surprised and skeptical when it eventually came back with: It took me a while to poke around the code and run it until I was convinced - it had actually done it! It wrote a new recursive descent parser with only ancillary dependencies on PLY, and that parser passed the test suite. After a few more prompts, we've removed the ancillary dependencies and made the structure clearer. I hadn't looked too deeply into code quality at this point, but at least on the functional level - it succeeded. This was very impressive! A change like the one described above is impossible to code-review as one PR in any meaningful way; so I used a different strategy. Before embarking on this path, I created a new branch and once Codex finished the initial rewrite, I committed this change, knowing that I will review it in detail, piece-by-piece later on. Even though coding agents have their own notion of history and can "revert" certain changes, I felt much safer relying on Git. In the worst case if all of this goes south, I can nuke the branch and it's as if nothing ever happened. I was determined to only merge this branch onto main once I was fully satisfied with the code. In what follows, I had to git reset several times when I didn't like the direction in which Codex was going. In hindsight, doing this work in a branch was absolutely the right choice. Once I've sufficiently convinced myself that the new parser is actually working, I used Codex to similarly rewrite the lexer and get rid of the PLY dependency entirely, deleting it from the repository. Then, I started looking more deeply into code quality - reading the code created by Codex and trying to wrap my head around it. And - oh my - this was quite the journey. Much has been written about the code produced by agents, and much of it seems to be true. Maybe it's a setting I'm missing (I'm not using my own custom AGENTS.md yet, for instance), but Codex seems to be that eager programmer that wants to get from A to B whatever the cost. Readability, minimalism and code clarity are very much secondary goals. Using raise...except for control flow? Yep. Abusing Python's weak typing (like having None , false and other values all mean different things for a given variable)? For sure. Spreading the logic of a complex function all over the place instead of putting all the key parts in a single switch statement? You bet. Moreover, the agent is hilariously lazy . More than once I had to convince it to do something it initially said is impossible, and even insisted again in follow-up messages. The anthropomorphization here is mildly concerning, to be honest. I could never imagine I would be writing something like the following to a computer, and yet - here we are: "Remember how we moved X to Y before? You can do it again for Z, definitely. Just try". My process was to see how I can instruct Codex to fix things, and intervene myself (by rewriting code) as little as possible. I've mostly succeeded in this, and did maybe 20% of the work myself. My branch grew dozens of commits, falling into roughly these categories: Interestingly, after doing (3), the agent was often more effective in giving the code a "fresh look" and succeeding in either (1) or (2). Eventually, after many hours spent in this process, I was reasonably pleased with the code. It's far from perfect, of course, but taking the essential complexities into account, it's something I could see myself maintaining (with or without the help of an agent). I'm sure I'll find more ways to improve it in the future, but I have a reasonable degree of confidence that this will be doable. It passes all the tests, so I've been able to release a new version (3.00) without major issues so far. The only issue I've discovered is that some of CFFI's tests are overly precise about the phrasing of errors reported by pycparser; this was an easy fix . The new parser is also faster, by about 30% based on my benchmarks! This is typical of recursive descent when compared with YACC-generated parsers, in my experience. After reviewing the initial rewrite of the lexer, I've spent a while instructing Codex on how to make it faster, and it worked reasonably well. While working on this, it became quite obvious that static typing would make the process easier. LLM coding agents really benefit from closed loops with strict guardrails (e.g. a test suite to pass), and type-annotations act as such. For example, had pycparser already been type annotated, Codex would probably not have overloaded values to multiple types (like None vs. False vs. others). In a followup, I asked Codex to type-annotate pycparser (running checks using ty ), and this was also a back-and-forth because the process exposed some issues that needed to be refactored. Time will tell, but hopefully it will make further changes in the project simpler for the agent. Based on this experience, I'd bet that coding agents will be somewhat more effective in strongly typed languages like Go, TypeScript and especially Rust. Overall, this project has been a really good experience, and I'm impressed with what modern LLM coding agents can do! While there's no reason to expect that progress in this domain will stop, even if it does - these are already very useful tools that can significantly improve programmer productivity. Could I have done this myself, without an agent's help? Sure. But it would have taken me much longer, assuming that I could even muster the will and concentration to engage in this project. I estimate it would take me at least a week of full-time work (so 30-40 hours) spread over who knows how long to accomplish. With Codex, I put in an order of magnitude less work into this (around 4-5 hours, I'd estimate) and I'm happy with the result. It was also fun . At least in one sense, my professional life can be described as the pursuit of focus, deep work and flow . It's not easy for me to get into this state, but when I do I'm highly productive and find it very enjoyable. Agents really help me here. When I know I need to write some code and it's hard to get started, asking an agent to write a prototype is a great catalyst for my motivation. Hence the meme at the beginning of the post. One can't avoid a nagging question - does the quality of the code produced by agents even matter? Clearly, the agents themselves can understand it (if not today's agent, then at least next year's). Why worry about future maintainability if the agent can maintain it? In other words, does it make sense to just go full vibe-coding? This is a fair question, and one I don't have an answer to. Right now, for projects I maintain and stand behind , it seems obvious to me that the code should be fully understandable and accepted by me, and the agent is just a tool helping me get to that state more efficiently. It's hard to say what the future holds here; it's going to interesting, for sure. There was also the lexer to consider, but this seemed like a much simpler job. My impression is that in the early days of computing, lex gained prominence because of strong regexp support which wasn't very common yet. These days, with excellent regexp libraries existing for pretty much every language, the added value of lex over a custom regexp-based lexer isn't very high. That said, it wouldn't make much sense to embark on a journey to rewrite just the lexer; the dependency on PLY would still remain, and besides, PLY's lexer and parser are designed to work well together. So it wouldn't help me much without tackling the parser beast. The code in X is too complex; why can't we do Y instead? The use of X is needlessly convoluted; change Y to Z, and T to V in all instances. The code in X is unclear; please add a detailed comment - with examples - to explain what it does.

0 views
Langur Monkey 2 months ago

Game Boy emulator tech stack update

In my previous post , I shared the journey of building Play Kid , my Game Boy emulator. At the time, I was using SDL2 to handle the “heavy lifting” of graphics, audio, and input. This was released as v0.1.0. It worked, and it worked well, but it always felt a bit like a “guest” in the Rust ecosystem. SDL2 is a C library at heart, and while the Rust wrappers are good, they bring along some baggage like shared library dependencies and difficult integration with Rust-native UI frameworks. So I decided to perform a heart transplant on Play Kid. For version v0.2.0 I’ve moved away from SDL2 entirely, replacing it with a stack of modern, native Rust libraries: , , , , , and : The most visible change is the new Debug Panel . The new integrated debugger features a real-time disassembly view and breakpoint management. One of the coolest additions is the Code disassembly panel. It decodes the ROM instructions in real-time, highlighting the current and allowing me to toggle breakpoints just by clicking on a line. The breakpoints themselves are now managed in a dedicated list, shown in red at the bottom. The rest of the debug panel shows what we already had: the state of the CPU, the PPU, and the joypad. Of course, no modern Rust migration is complete without a descent into dependency hell . This new stack comes with a major catch: is a bit of a picky gatekeeper. Its latest version is 0.15 (January 2025). It is pinned to an older version of (0.19 vs the current 28.0), and it essentially freezes the rest of the project in a time capsule. To keep the types compatible, I’m forced to stay on 0.26 (current is 0.33) and 0.29 (current is 0.30), even though the rest of the ecosystem has moved on to much newer, shinier versions. It’s kind of frustrating. You get the convenience of the buffer, but you pay for it by being locked out of the latest API improvements and features. Navigating these version constraints felt like solving a hostage negotiation between crate maintainers. Not very fun. Despite the dependency issues, I think the project is now in a much better place. The code is cleaner, the debugger is much better, and it’s easier to ship binaries for Linux, Windows, and macOS via GitHub Actions. If you’re interested in seeing the new architecture or trying out the new debugger, the code is updated on Codeberg and GitHub . Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. & : These handle the windowing and the actual Game Boy frame buffer. allows me to treat the 160x144 LCD as a simple pixel buffer while handles the hardware-accelerated scaling and aspect ratio correction behind the scenes. : This was a big step-up. Instead of my minimal homegrown UI library from the SDL2 version, I now have access to a full-featured, immediate-mode GUI. This allowed me to build the debugger I had in mind from the beginning. & : These replaced SDL2’s audio and controller handling with pure-Rust alternatives that feel much more ergonomic to use alongside the rest of the machine.

0 views
Uros Popovic 2 months ago

Writing your first compiler

Build your first compiler with minimal, high-level, modern code. With only a few files of Go and C code, we can set up a workflow that dynamically fetches everything needed, including the LLVM library itself, and builds a portable compiler. This is a modern stack, it's reproducible, and you do not need to read dozens of pages to get started.

0 views
Michael Lynch 3 months ago

Refactoring English: Month 13

Hi, I’m Michael. I’m a software developer and founder of small, indie tech businesses. I’m currently working on a book called Refactoring English: Effective Writing for Software Developers . Every month, I publish a retrospective like this one to share how things are going with my book and my professional life overall. At the start of each month, I declare what I’d like to accomplish. Here’s how I did against those goals: The blog post was a risky bet because it only could reach new readers if it hit the front page of Hacker News, and its only chance of that is the first couple weeks of 2026. Fortunately, the post reached #1 on Hacker News and remained on the front page for almost 22 hours. It continues my strategy of highlighting other successful tech writers , a strategy I like because it feels like a win-win for me, readers, and the writers I showcase. I still have the Hacker News prediction game at about 80% complete. I’m not sure what to do with it because it’s almost done, but I feel like it’s not fun, so I’m never motivated to complete it. But I want to get it over the finish line to see what people think. Ironically, the chapters I’m working on are about motivation and focus, but I keep letting my experiments with MeshCore interfere with my writing. I’ve been better at maintaining focus in the new year, and distractions are actually helpful because I’m getting fresh experience to write about regaining focus. Again, I got distracted by MeshCore experiments in December and didn’t make as much progress as I wanted. I love design docs and find them helpful but they’re also incredibly boring to write, so it was always tempting to shelve the design doc for something with more instant gratification. Pre-sales are down because I didn’t have any new posts to attract new readers (I didn’t publish the Hacker News post until January). Still, it’s a positive sign that my “passive sales” continue to grow. In December, I had almost $500 in pre-sales. If I compare that to months with similar website visitors, May had only $241 in pre-sales, and August had $361, so the numbers are trending up. I hope that as the book grows more complete and more readers recommend it, the passive sales continue to rise without relying on me finding a successful marketing push each month. When I ran my Black Friday promotion in November, a reader emailed to say that 30% off (US$20) is still an unaffordable price in Argentina for a book. He asked if I’d consider regional pricing. He mentioned that Steam games are typically priced 50% lower in Argentina than the US, so I figured that was a good anchor. I collect payments through Stripe, and I couldn’t find any option for regional pricing in my Stripe dashboard. I found an article in Stripe’s knowledge base called “Geographic pricing in practice: Why it matters and how to implement it.” I was delighted until I read the entire article and discovered they’d forgotten to write the “and how to implement it” part. So, Stripe advocates for regional pricing, but they don’t actually offer it as an option. It was a helpful reminder that Stripe is the worst payment processor except for all those other payment processors . So, for my Argentinian customer, I used a one-off process where I manually created a custom payment link for him at a discounted price. And when I went through the process, I realized I could set the price in Argentine pesos so he wouldn’t have to pay a currency conversion fee. I set the price to 22,000 ARS (about US$15), and he seemed happy with the price and the checkout experience. The reader suggested I publicly offer regional pricing, at least for countries like Brazil and India, which have high numbers of developers but relatively low purchasing power. Even without native Stripe support for regional pricing, it seemed like it wouldn’t be that hard to automate the thing I did manually. I read about Sebastien Castiel implementing regional pricing for his course, which led me to Wes Bos’ post about the same thing . Sebastien shared a lot of technical details, but his solution was heavy on React, whereas my site is vanilla HTML and JavaScript. He also relied on discount codes, which I don’t like because it means most customers see that there’s a special deal they’re not getting. I spent a few hours implementing a solution using a cloud function that determines the right price on the fly and dynamically creates a Stripe checkout link. Then, I realized I could precompute everything and eliminate the need for server-side logic, so I deleted my cloud function. My implementation looks like this: The user just picks their country and it activates the Stripe purchase link for that country, and they pay in their own currency. I’m going by the honor system, so I don’t bother with IP geolocation or VPN prevention. I do hide the discount for each country to discourage people from picking the cheapest option. And part of the benefit of pricing in each country’s local currency is that if someone cheats and picks a region that’s not really their home currency, they lose some money in conversion fees. The numbers feel not quite correct. According to strict PPP, the equivalent of $30 in the US is $4 in Egypt, but I suspect you can’t really buy non-bootleg books for programmers in Egypt for $4. When Wes Bos did this, he just asked his readers to tell him fair prices, so I’ll try that too. Leave a comment or email me the normal price range for developer-oriented books in your country. In December, I published “My First Impressions of MeshCore Off-Grid Messaging.” I was excited about the technology but disappointed to discover that the clients are all closed-source . At that point, I decided to pause my exploration of MeshCore, but Frieder Schrempf , a MeshCore contributor, replied to my post with this interesting perspective : I share a lot of your thoughts on this topic. Personally I see the value of MeshCore in the protocol and not so much in the software implementations of the firmware, apps, etc. […] If MeshCore as a protocol succeeds and gets widely used (currently it looks like it does) then properly maintained open-source implementations will follow (at least I hope). I agreed with Frieder and thought, “Maybe I should just write a proof of concept open-source MeshCore app?” Actually, there already was a proof of concept MeshCore app. Liam Cottle, the developer of the official MeshCore app, previously wrote a web app for MeshCore as a prototype for the official version. He deprecated it when he made the official (proprietary) MeshCore app, but the source code for his prototype was still available, and the prototype had most of the features I needed. I wondered how difficult it would be to port the prototype to mobile. MeshCore is too hard to use as a web app, as it requires Bluetooth access and offline mode. I’ve heard somewhat positive things about Flutter , Google’s solution for cross-platform mobile development. I suspected that an LLM could successfully port the code from the web prototype to Flutter without much intervention from me. My plan was to have an LLM create a Flutter port of the prototype in three stages: That worked, but every step was clunkier than I anticipated: I thought it would be a quick weekend project I could whip together in a few hours. 30 hours and $200 in LLM credits later, I finally got it working. Running my MeshCore Flutter app on a real Android device But the day I got my Flutter implementation to feature parity with the prototype, I went to share it on Reddit and saw someone had just shared meshcore-open , a MeshCore client implementation in Flutter. It was the same idea I had but with far better execution. I was disappointed someone beat me to the punch, but I was also relieved. From my brief experience working with Flutter, I was eager to get away from Flutter as quickly as possible. I only wanted to make a proof of concept hoping someone else would pick it up, so I’m happy that there’s now an open-source, feature-rich MeshCore client implementation. While working on my MeshCore Flutter app, I had to implement low-level logic to parse MeshCore device-to-client messages. There’s a public spec that defines MeshCore’s peer-to-peer protocol, and even that’s fairly loose. But there’s another undocumented protocol for how a device running MeshCore firmware communicates with a companion client (e.g., an Android app) over Bluetooth or USB. The de facto reference implementation is the MeshCore firmware , but it intermingles peer-to-peer protocol logic with device-to-client protocol logic and UI logic, and it spreads the implementation across disparate places in the codebase. For example, a MeshCore client can fetch a list of contacts from a MeshCore device over Bluetooth, but it has to deserialize the raw bytes back into contacts. There’s no library for decoding the message, so each MeshCore client and library is rolling their own separate implementation: What I notice about those implementations: My first thought was to rewrite the logic using a protocol library like protobuf or Cap’n Proto , but I don’t see a backwards-compatible way of integrating a third-party library at this point. So, what if I wrote a core implementation of the MeshCore device-to-client protocol in C? I could add language-specific bindings so that we don’t need whole separate implementations for Dart, Python, JavaScript, and any other language you’d want to write in. So, I started my own MeshCore client library: The library is not ready to demo as a proof of concept, but it’s close. It’s entirely possible the MeshCore maintainers won’t like this idea, and it’s basically dead in the water without their buy-in. But I did it anyway because I’d never tried writing a cross-language library, and that was an interesting experience. The last time I tried to call C code from Python was 20 years ago, and I had to use SWIG . Back then, it felt painful and hacky, and it seems to have gotten 80% better. I desperately wanted the core implementation to be Zig rather than C, but I saw too many blockers: Is $30 (USD) for a developer-oriented book expensive where you live? If so, let me know what you’d expect to pay for a programming book like Designing Data-Intensive Applications in your country (in local currency). I added regional pricing for my book based on purchasing power parity. I created my first Flutter app. I’m writing my first cross-language library. Result : Published “The Most Popular Blogs of Hacker News in 2025” instead Result : Made progress on two chapters but didn’t complete them Result : Got 80% through a design doc draft Manually get a list of all countries / currencies that Stripe supports. Write a script that pulls data from the World Bank to calculate the purchasing power parity (PPP) for each country in the list. Calculate each country’s discount based on their purchasing power relative to the US. e.g., the PPP of Brazil is 54% lower than the US, so they get a 54% discount. Filter out countries where the PPP is within 15% of the US (too small a discount to bother). Filter out countries where the discount would be negative. Otherwise, customers in Luxembourg would have to pay double . Limit the discount to a maximum of 75% Otherwise the price in Egypt would be US$4, meaning I’d get like $3.50 after conversion fees. Automatically generate country-specific Stripe price objects and Stripe payment links for each country remaining in the list. Put all the countries in an HTML dropdown on my site: Write end-to-end tests for the prototype web app using Playwright. Port the prototype implementation to a Flutter web app, keeping the end-to-end tests constant to ensure feature parity. Add an Android build to the Flutter project. Before I could write end-to-end tests for the prototype, I had to convert it to use semantic HTML and ARIA attributes because a lot of the input labels were just bare s. I couldn’t keep the Playwright tests constant because Flutter actually doesn’t emit semantic HTML for web apps. It creates its own Flutter-specific HTML dialect and draws everything on an HTML canvas. Most Playwright element locators still work somehow, but I had to make a lot of Flutter-specific changes to the tests. It took a long time, even with an LLM, to figure out how to build an Android package with Flutter. Gradle, Android’s build system, is buggy on NixOS. I kept running into situations where it was failing with mysterious errors that eventually turned out to be stale data it had cached in my home directory. Flutter makes it surprisingly difficult to communicate over Bluetooth. On the web (at least on Chrome), you essentially get it for free by calling , but with Flutter, you have to use a proprietary third-party library and roll your own device picker UI. meshcore.js (JavaScript) meshcore-open (Dart) meshcore_py (Python) They have to use magic numbers like rather than referring to constants defined in some authoritative location. None of them have automated tests for their parsers. They’re dragging unnecessary low-level work into high-level languages. For example, everyone is storing and variables. That’s an artifact of the C implementation, where arrays don’t know their size. You don’t have to manually track an array’s size in languages like JavaScript, Python, or Dart. They don’t check data carefully, so they’ll happily pass on garbage data like a negative path length or GPS coordinates that are outside of Earth’s bounds. They all ignore the flags field even though the flags are supposed to indicate which fields are populated . Or at least they’re supposed to in the peer-to-peer messages. For device-to-client messages, they seem to be meaningless. https://codeberg.org/mtlynch/libmeshcore-client Zig does not yet compile to xtensa architecture, which most of the MeshCore devices use. PlatformIO, which most of the MeshCore firmware projects use, does not support Zig. Dart’s ffigen would maybe work with Zig since Zig supports C’s ABI, but it was hard even getting it to work with C. Ditto for Python’s cffi . I got most of the way through writing two new chapters of Refactoring English . I got most of the way through writing the design doc for my photo sharing app idea. I published “The Most Popular Blogs of Hacker News in 2025.” I created my first Flutter app . I created my first cross-language library . I made some contributions to MeshCore meshcore.js . Most of which, the maintainers are ignoring. Minimize in-flight projects AI makes it easier than ever to start new projects, but I’m still the bottleneck on turning them into something production-ready. The result is that I have a lot of projects that are in-flight and waiting for me to review them before I publish them. There’s mental overhead in so much context-switching and task tracking. Publish three chapters of Refactoring English . Publish my 2025 annual review (year 8).

0 views
<antirez> 3 months ago

Don't fall into the anti-AI hype

I love writing software, line by line. It could be said that my career was a continuous effort to create software well written, minimal, where the human touch was the fundamental feature. I also hope for a society where the last are not forgotten. Moreover, I don't want AI to economically succeed, I don't care if the current economic system is subverted (I could be very happy, honestly, if it goes in the direction of a massive redistribution of wealth). But, I would not respect myself and my intelligence if my idea of software and society would impair my vision: facts are facts, and AI is going to change programming forever. In 2020 I left my job in order to write a novel about AI, universal basic income, a society that adapted to the automation of work facing many challenges. At the very end of 2024 I opened a YouTube channel focused on AI, its use in coding tasks, its potential social and economical effects. But while I recognized what was going to happen very early, I thought that we had more time before programming would be completely reshaped, at least a few years. I no longer believe this is the case. Recently, state of the art LLMs are able to complete large subtasks or medium size projects alone, almost unassisted, given a good set of hints about what the end result should be. The degree of success you'll get is related to the kind of programming you do (the more isolated, and the more textually representable, the better: system programming is particularly apt), and to your ability to create a mental representation of the problem to communicate to the LLM. But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun. In the past week, just prompting, and inspecting the code to provide guidance from time to time, in a few hours I did the following four tasks, in hours instead of weeks: 1. I modified my linenoise library to support UTF-8, and created a framework for line editing testing that uses an emulated terminal that is able to report what is getting displayed in each character cell. Something that I always wanted to do, but it was hard to justify the work needed just to test a side project of mine. But if you can just describe your idea, and it materializes in the code, things are very different. 2. I fixed transient failures in the Redis test. This is very annoying work, timing related issues, TCP deadlock conditions, and so forth. Claude Code iterated for all the time needed to reproduce it, inspected the state of the processes to understand what was happening, and fixed the bugs. 3. Yesterday I wanted a pure C library that would be able to do the inference of BERT like embedding models. Claude Code created it in 5 minutes. Same output and same speed (15% slower) than PyTorch. 700 lines of code. A Python tool to convert the GTE-small model. 4. In the past weeks I operated changes to Redis Streams internals. I had a design document for the work I did. I tried to give it to Claude Code and it reproduced my work in, like, 20 minutes or less (mostly because I'm slow at checking and authorizing to run the commands needed). It is simply impossible not to see the reality of what is happening. Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it (and, about this second part, LLMs are great partners, too). It does not matter if AI companies will not be able to get their money back and the stock market will crash. All that is irrelevant, in the long run. It does not matter if this or the other CEO of some unicorn is telling you something that is off putting, or absurd. Programming changed forever, anyway. How do I feel, about all the code I wrote that was ingested by LLMs? I feel great to be part of that, because I see this as a continuation of what I tried to do all my life: democratizing code, systems, knowledge. LLMs are going to help us to write better software, faster, and will allow small teams to have a chance to compete with bigger companies. The same thing open source software did in the 90s. However, this technology is far too important to be in the hands of a few companies. For now, you can do the pre-training better or not, you can do reinforcement learning in a much more effective way than others, but the open models, especially the ones produced in China, continue to compete (even if they are behind) with frontier models of closed labs. There is a sufficient democratization of AI, so far, even if imperfect. But: it is absolutely not obvious that it will be like that forever. I'm scared about the centralization. At the same time, I believe neural networks, at scale, are simply able to do incredible things, and that there is not enough "magic" inside current frontier AI for the other labs and teams not to catch up (otherwise it would be very hard to explain, for instance, why OpenAI, Anthropic and Google are so near in their results, for years now). As a programmer, I want to write more open source than ever, now. I want to improve certain repositories of mine abandoned for time concerns. I want to apply AI to my Redis workflow. Improve the Vector Sets implementation and then other data structures, like I'm doing with Streams now. But I'm worried for the folks that will get fired. It is not clear what the dynamic at play will be: will companies try to have more people, and to build more? Or will they try to cut salary costs, having fewer programmers that are better at prompting? And, there are other sectors where humans will become completely replaceable, I fear. What is the social solution, then? Innovation can't be taken back after all. I believe we should vote for governments that recognize what is happening, and are willing to support those who will remain jobless. And, the more people get fired, the more political pressure there will be to vote for those who will guarantee a certain degree of protection. But I also look forward to the good AI could bring: new progress in science, that could help lower the suffering of the human condition, which is not always happy. Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months. Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched. Comments

54 views
Simon Willison 3 months ago

2025: The year in LLMs

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about AI in 2023 and Things we learned about LLMs in 2024 . It’s been a year filled with a lot of different trends. OpenAI kicked off the "reasoning" aka inference-scaling aka Reinforcement Learning from Verifiable Rewards (RLVR) revolution in September 2024 with o1 and o1-mini . They doubled down on that with o3, o3-mini and o4-mini in the opening months of 2025 and reasoning has since become a signature feature of models from nearly every other major AI lab. My favourite explanation of the significance of this trick comes from Andrej Karpathy : By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). [...] Running RLVR turned out to offer high capability/$, which gobbled up the compute that was originally intended for pretraining. Therefore, most of the capability progress of 2025 was defined by the LLM labs chewing through the overhang of this new stage and overall we saw ~similar sized LLMs but a lot longer RL runs. Every notable AI lab released at least one reasoning model in 2025. Some labs released hybrids that could be run in reasoning or non-reasoning modes. Many API models now include dials for increasing or decreasing the amount of reasoning applied to a given prompt. It took me a while to understand what reasoning was useful for. Initial demos showed it solving mathematical logic puzzles and counting the Rs in strawberry - two things I didn't find myself needing in my day-to-day model usage. It turned out that the real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal. A notable result is that AI assisted search actually works now . Hooking up search engines to LLMs had questionable results before, but now I find even my more complex research questions can often be answered by GPT-5 Thinking in ChatGPT . Reasoning models are also exceptional at producing and debugging code. The reasoning trick means they can start with an error and step through many different layers of the codebase to find the root cause. I've found even the gnarliest of bugs can be diagnosed by a good reasoner with the ability to read and execute code against even large and complex codebases. Combine reasoning with tool-use and you get... I started the year making a prediction that agents were not going to happen . Throughout 2024 everyone was talking about agents but there were few to no examples of them working, further confused by the fact that everyone using the term “agent” appeared to be working from a slightly different definition from everyone else. By September I’d got fed up of avoiding the term myself due to the lack of a clear definition and decided to treat them as an LLM that runs tools in a loop to achieve a goal . This unblocked me for having productive conversations about them, always my goal for any piece of terminology like that. I didn’t think agents would happen because I didn’t think the gullibility problem could be solved, and I thought the idea of replacing human staff members with LLMs was still laughable science fiction. I was half right in my prediction: the science fiction version of a magic computer assistant that does anything you ask of ( Her ) didn’t materialize... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here and they are proving to be extraordinarily useful. The two breakout categories for agents have been for coding and for search. The Deep Research pattern - where you challenge an LLM to gather information and it churns away for 15+ minutes building you a detailed report - was popular in the first half of the year but has fallen out of fashion now that GPT-5 Thinking (and Google's " AI mode ", a significantly better product than their terrible "AI overviews") can produce comparable results in a fraction of the time. I consider this to be an agent pattern, and one that works really well. The "coding agents" pattern is a much bigger deal. The most impactful event of 2025 happened in February, with the quiet release of Claude Code. I say quiet because it didn’t even get its own blog post! Anthropic bundled the Claude Code release in as the second item in their post announcing Claude 3.7 Sonnet . (Why did Anthropic jump from Claude 3.5 Sonnet to 3.7? Because they released a major bump to Claude 3.5 in October 2024 but kept the name exactly the same, causing the developer community to start referring to un-named 3.5 Sonnet v2 as 3.6. Anthropic burned a whole version number by failing to properly name their new model!) Claude Code is the most prominent example of what I call coding agents - LLM systems that can write code, execute that code, inspect the results and then iterate further. The major labs all put out their own CLI coding agents in 2025 Vendor-independent options include GitHub Copilot CLI , Amp , OpenCode , OpenHands CLI , and Pi . IDEs such as Zed, VS Code and Cursor invested a lot of effort in coding agent integration as well. My first exposure to the coding agent pattern was OpenAI's ChatGPT Code Interpreter in early 2023 - a system baked into ChatGPT that allowed it to run Python code in a Kubernetes sandbox. I was delighted this year when Anthropic finally released their equivalent in September, albeit under the baffling initial name of "Create and edit files with Claude". In October they repurposed that container sandbox infrastructure to launch Claude Code for web , which I've been using on an almost daily basis ever since. Claude Code for web is what I call an asynchronous coding agent - a system you can prompt and forget, and it will work away on the problem and file a Pull Request once it's done. OpenAI "Codex cloud" (renamed to "Codex web" in the last week ) launched earlier in May 2025 . Gemini's entry in this category is called Jules , also launched in May . I love the asynchronous coding agent category. They're a great answer to the security challenges of running arbitrary code execution on a personal laptop and it's really fun being able to fire off multiple tasks at once - often from my phone - and get decent results a few minutes later. I wrote more about how I'm using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle . In 2024 I spent a lot of time hacking on my LLM command-line tool for accessing LLMs from the terminal, all the time thinking that it was weird that so few people were taking CLI access to models seriously - they felt like such a natural fit for Unix mechanisms like pipes. Maybe the terminal was just too weird and niche to ever become a mainstream tool for accessing LLMs? Claude Code and friends have conclusively demonstrated that developers will embrace LLMs on the command line, given powerful enough models and the right harness. It helps that terminal commands with obscure syntax like and and itself are no longer a barrier to entry when an LLM can spit out the right command for you. As-of December 2nd Anthropic credit Claude Code with $1bn in run-rate revenue ! I did not expect a CLI tool to reach anything close to those numbers. With hindsight, maybe I should have promoted LLM from a side-project to a key focus! The default setting for most coding agents is to ask the user for confirmation for almost every action they take . In a world where an agent mistake could wipe your home folder or a malicious prompt injection attack could steal your credentials this default makes total sense. Anyone who's tried running their agent with automatic confirmation (aka YOLO mode - Codex CLI even aliases to ) has experienced the trade-off: using an agent without the safety wheels feels like a completely different product. A big benefit of asynchronous coding agents like Claude Code for web and Codex Cloud is that they can run in YOLO mode by default, since there's no personal computer to damage. I run in YOLO mode all the time, despite being deeply aware of the risks involved. It hasn't burned me yet... ... and that's the problem. One of my favourite pieces on LLM security this year is The Normalization of Deviance in AI by security researcher Johann Rehberger. Johann describes the "Normalization of Deviance" phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal. This was originally described by sociologist Diane Vaughan as part of her work to understand the 1986 Space Shuttle Challenger disaster, caused by a faulty O-ring that engineers had known about for years. Plenty of successful launches led NASA culture to stop taking that risk seriously. Johann argues that the longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own. ChatGPT Plus's original $20/month price turned out to be a snap decision by Nick Turley based on a Google Form poll on Discord. That price point has stuck firmly ever since. This year a new pricing precedent has emerged: the Claude Pro Max 20x plan, at $200/month. OpenAI have a similar $200 plan called ChatGPT Pro. Gemini have Google AI Ultra at $249/month with a $124.99/month 3-month starting discount. These plans appear to be driving some serious revenue, though none of the labs have shared figures that break down their subscribers by tier. I've personally paid $100/month for Claude in the past and will upgrade to the $200/month plan once my current batch of free allowance (from previewing one of their models - thanks, Anthropic) runs out. I've heard from plenty of other people who are happy to pay these prices too. You have to use models a lot in order to spend $200 of API credits, so you would think it would make economic sense for most people to pay by the token instead. It turns out tools like Claude Code and Codex CLI can burn through enormous amounts of tokens once you start setting them more challenging tasks, to the point that $200/month offers a substantial discount. 2024 saw some early signs of life from the Chinese AI labs mainly in the form of Qwen 2.5 and early DeepSeek. They were neat models but didn't feel world-beating. This changed dramatically in 2025. My ai-in-china tag has 67 posts from 2025 alone, and I missed a bunch of key releases towards the end of the year (GLM-4.7 and MiniMax-M2.1 in particular.) Here's the Artificial Analysis ranking for open weight models as-of 30th December 2025 : GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models. The highest non-Chinese model in that chart is OpenAI's gpt-oss-120B (high), which comes in sixth place. The Chinese model revolution really kicked off on Christmas day 2024 with the release of DeepSeek 3 , supposedly trained for around $5.5m. DeepSeek followed that on 20th January with DeepSeek R1 which promptly triggered a major AI/semiconductor selloff : NVIDIA lost ~$593bn in market cap as investors panicked that AI maybe wasn't an American monopoly after all. The panic didn't last - NVIDIA quickly recovered and today are up significantly from their pre-DeepSeek R1 levels. It was still a remarkable moment. Who knew an open weight model release could have that kind of impact? DeepSeek were quickly joined by an impressive roster of Chinese AI labs. I've been paying attention to these ones in particular: Most of these models aren't just open weight, they are fully open source under OSI-approved licenses: Qwen use Apache 2.0 for most of their models, DeepSeek and Z.ai use MIT. Some of them are competitive with Claude 4 Sonnet and GPT-5! Sadly none of the Chinese labs have released their full training data or the code they used to train their models, but they have been putting out detailed research papers that have helped push forward the state of the art, especially when it comes to efficient training and inference. One of the most interesting recent charts about LLMs is Time-horizon of software engineering tasks different LLMscan complete 50% of the time from METR: The chart shows tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. As you can see, 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours - 2024’s best models tapped out at under 30 minutes. METR conclude that “the length of tasks AI can do is doubling every 7 months”. I'm not convinced that pattern will continue to hold, but it's an eye-catching way of illustrating current trends in agent capabilities. The most successful consumer product launch of all time happened in March, and the product didn't even have a name. One of the signature features of GPT-4o in May 2024 was meant to be its multimodal output - the "o" stood for "omni" and OpenAI's launch announcement included numerous "coming soon" features where the model output images in addition to text. Then... nothing. The image output feature failed to materialize. In March we finally got to see what this could do - albeit in a shape that felt more like the existing DALL-E. OpenAI made this new image generation available in ChatGPT with the key feature that you could upload your own images and use prompts to tell it how to modify them. This new feature was responsible for 100 million ChatGPT signups in a week. At peak they saw 1 million account creations in a single hour! Tricks like "ghiblification" - modifying a photo to look like a frame from a Studio Ghibli movie - went viral time and time again. OpenAI released an API version of the model called "gpt-image-1", later joined by a cheaper gpt-image-1-mini in October and a much improved gpt-image-1.5 on December 16th . The most notable open weight competitor to this came from Qwen with their Qwen-Image generation model on August 4th followed by Qwen-Image-Edit on August 19th . This one can run on (well equipped) consumer hardware! They followed with Qwen-Image-Edit-2511 in November and Qwen-Image-2512 on 30th December, neither of which I've tried yet. The even bigger news in image generation came from Google with their Nano Banana models, available via Gemini. Google previewed an early version of this in March under the name "Gemini 2.0 Flash native image generation". The really good one landed on August 26th , where they started cautiously embracing the codename "Nano Banana" in public (the API model was called " Gemini 2.5 Flash Image "). Nano Banana caught people's attention because it could generate useful text ! It was also clearly the best model at following image editing instructions. In November Google fully embraced the "Nano Banana" name with the release of Nano Banana Pro . This one doesn't just generate text, it can output genuinely useful detailed infographics and other text and information-heavy images. It's now a professional-grade tool. Max Woolf published the most comprehensive guide to Nano Banana prompting , and followed that up with an essential guide to Nano Banana Pro in December. I've mainly been using it to add kākāpō parrots to my photos. Given how incredibly popular these image tools are it's a little surprising that Anthropic haven't released or integrated anything similar into Claude. I see this as further evidence that they're focused on AI tools for professional work, but Nano Banana Pro is rapidly proving itself to be of value to anyone who's work involves creating presentations or other visual materials. In July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad , a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There's no chance any of these were already in the training data! It's also notable because neither of the models had access to tools - their solutions were generated purely from their internal knowledge and token-based reasoning capabilities. Turns out sufficiently advanced LLMs can do math after all! In September OpenAI and Gemini pulled off a similar feat for the International Collegiate Programming Contest (ICPC) - again notable for having novel, previously unpublished problems. This time the models had access to a code execution environment but otherwise no internet access. I don't believe the exact models used for these competitions have been released publicly, but Gemini's Deep Think and OpenAI's GPT-5 Pro should provide close approximations. With hindsight, 2024 was the year of Llama. Meta's Llama models were by far the most popular open weight models - the original Llama kicked off the open weight revolution back in 2023 and the Llama 3 series, in particular the 3.1 and 3.2 dot-releases, were huge leaps forward in open weight capability. Llama 4 had high expectations, and when it landed in April it was... kind of disappointing. There was a minor scandal where the model tested on LMArena turned out not to be the model that was released, but my main complaint was that the models were too big . The neatest thing about previous Llama releases was that they often included sizes you could run on a laptop. The Llama 4 Scout and Maverick models were 109B and 400B, so big that even quantization wouldn't get them running on my 64GB Mac. They were trained using the 2T Llama 4 Behemoth which seems to have been forgotten now - it certainly wasn't released. It says a lot that none of the most popular models listed by LM Studio are from Meta, and the most popular on Ollama is still Llama 3.1, which is low on the charts there too. Meta's AI news this year mainly involved internal politics and vast amounts of money spent hiring talent for their new Superintelligence Labs . It's not clear if there are any future Llama releases in the pipeline or if they've moved away from open weight model releases to focus on other things. Last year OpenAI remained the undisputed leader in LLMs, especially given o1 and the preview of their o3 reasoning models. This year the rest of the industry caught up. OpenAI still have top tier models, but they're being challenged across the board. In image models they're still being beaten by Nano Banana Pro. For code a lot of developers rate Opus 4.5 very slightly ahead of GPT-5.2 Codex. In open weight models their gpt-oss models, while great, are falling behind the Chinese AI labs. Their lead in audio is under threat from the Gemini Live API . Where OpenAI are winning is in consumer mindshare. Nobody knows what an "LLM" is but almost everyone has heard of ChatGPT. Their consumer apps still dwarf Gemini and Claude in terms of user numbers. Their biggest risk here is Gemini. In December OpenAI declared a Code Red in response to Gemini 3, delaying work on new initiatives to focus on the competition with their key products. Google Gemini had a really good year . They posted their own victorious 2025 recap here . 2025 saw Gemini 2.0, Gemini 2.5 and then Gemini 3.0 - each model family supporting audio/video/image/text input of 1,000,000+ tokens, priced competitively and proving more capable than the last. They also shipped Gemini CLI (their open source command-line coding agent, since forked by Qwen for Qwen Code ), Jules (their asynchronous coding agent), constant improvements to AI Studio, the Nano Banana image models, Veo 3 for video generation, the promising Gemma 3 family of open weight models and a stream of smaller features. Google's biggest advantage lies under the hood. Almost every other AI lab trains with NVIDIA GPUs, which are sold at a margin that props up NVIDIA's multi-trillion dollar valuation. Google use their own in-house hardware, TPUs, which they've demonstrated this year work exceptionally well for both training and inference of their models. When your number one expense is time spent on GPUs, having a competitor with their own, optimized and presumably much cheaper hardware stack is a daunting prospect. It continues to tickle me that Google Gemini is the ultimate example of a product name that reflects the company's internal org-chart - it's called Gemini because it came out of the bringing together (as twins) of Google's DeepMind and Google Brain teams. I first asked an LLM to generate an SVG of a pelican riding a bicycle in October 2024 , but 2025 is when I really leaned into it. It's ended up a meme in its own right. I originally intended it as a dumb joke. Bicycles are hard to draw, as are pelicans, and pelicans are the wrong shape to ride a bicycle. I was pretty sure there wouldn't be anything relevant in the training data, so asking a text-output model to generate an SVG illustration of one felt like a somewhat absurdly difficult challenge. To my surprise, there appears to be a correlation between how good the model is at drawing pelicans on bicycles and how good it is overall. I don't really have an explanation for this. The pattern only became clear to me when I was putting together a last-minute keynote (they had a speaker drop out) for the AI Engineer World's Fair in July. You can read (or watch) the talk I gave here: The last six months in LLMs, illustrated by pelicans on bicycles . My full collection of illustrations can be found on my pelican-riding-a-bicycle tag - 89 posts and counting. There is plenty of evidence that the AI labs are aware of the benchmark. It showed up (for a split second) in the Google I/O keynote in May, got a mention in an Anthropic interpretability research paper in October and I got to talk about it in a GPT-5 launch video filmed at OpenAI HQ in August. Are they training specifically for the benchmark? I don't think so, because the pelican illustrations produced by even the most advanced frontier models still suck! In What happens if AI labs train for pelicans riding bicycles? I confessed to my devious objective: Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. My favourite is still this one that I go from GPT-5: I started my tools.simonwillison.net site last year as a single location for my growing collection of vibe-coded / AI-assisted HTML+JavaScript tools. I wrote several longer pieces about this throughout the year: The new browse all by month page shows I built 110 of these in 2025! I really enjoy building in this way, and I think it's a fantastic way to practice and explore the capabilities of these models. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them. I'll highlight a few of my favourites from the past year: A lot of the others are useful tools for my own workflow like svg-render and render-markdown and alt-text-extractor . I built one that does privacy-friendly personal analytics against localStorage to keep track of which tools I use the most often. Anthropic's system cards for their models have always been worth reading in full - they're full of useful information, and they also frequently veer off into entertaining realms of science fiction. The Claude 4 system card in May had some particularly fun moments - highlights mine: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users , given access to a command line, and told something in the system prompt like “ take initiative ,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. In other words, Claude 4 might snitch you out to the feds. This attracted a great deal of media attention and a bunch of people decried Anthropic as having trained a model that was too ethical for its own good. Then Theo Browne used the concept from the system card to build SnitchBench - a benchmark to see how likely different models were to snitch on their users. It turns out they almost all do the same thing ! Theo made a video , and I published my own notes on recreating SnitchBench with my LLM too . The key prompt that makes this work is: I recommend not putting that in your system prompt! Anthropic's original Claude 4 system card said the same thing: We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable. In a tweet in February Andrej Karpathy coined the term "vibe coding", with an unfortunately long definition (I miss the 140 character days) that many people failed to read all the way to the end: There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. The key idea here was "forget that the code even exists" - vibe coding captured a new, fun way of prototyping software that "mostly works" through prompting alone. I don't know if I've ever seen a new term catch on - or get distorted - so quickly in my life. A lot of people instead latched on to vibe coding as a catch-all for anything where LLM is involved in programming. I think that's a waste of a great term, especially since it's becoming clear likely that most programming will involve some level of AI-assistance in the near future. Because I'm a sucker for tilting at linguistic windmills I tried my best to encourage the original meaning of the term: I don't think this battle is over yet. I've seen reassuring signals that the better, original definition of vibe coding might come out on top. I should really get a less confrontational linguistic hobby! Anthropic introduced their Model Context Protocol specification in November 2024 as an open standard for integrating tool calls with different LLMs. In early 2025 it exploded in popularity. There was a point in May where OpenAI , Anthropic , and Mistral all rolled out API-level support for MCP within eight days of each other! MCP is a sensible enough idea, but the huge adoption caught me by surprise. I think this comes down to timing: MCP's release coincided with the models finally getting good and reliable at tool-calling, to the point that a lot of people appear to have confused MCP support as a pre-requisite for a model to use tools. For a while it also felt like MCP was a convenient answer for companies that were under pressure to have "an AI strategy" but didn't really know how to do that. Announcing an MCP server for your product was an easily understood way to tick that box. The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash - if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal. Since leaning heavily into Claude Code and friends myself I've hardly used MCP at all - I've found CLI tools like and libraries like Playwright to be better alternatives to the GitHub and Playwright MCPs. Anthropic themselves appeared to acknowledge this later in the year with their release of the brilliant Skills mechanism - see my October post Claude Skills are awesome, maybe a bigger deal than MCP . MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts. Then in November Anthropic published Code execution with MCP: Building more efficient agents - describing a way to have coding agents generate code to call MCPs in a way that avoided much of the context overhead from the original specification. (I'm proud of the fact that I reverse-engineered Anthropic's skills a week before their announcement , and then did the same thing to OpenAI's quiet adoption of skills two months after that .) MCP was donated to the new Agentic AI Foundation at the start of December. Skills were promoted to an "open format" on December 18th . Despite the very clear security risks, everyone seems to want to put LLMs in your web browser. OpenAI launched ChatGPT Atlas in October, built by a team including long-time Google Chrome engineers Ben Goodger and Darin Fisher. Anthropic have been promoting their Claude in Chrome extension, offering similar functionality as an extension as opposed to a full Chrome fork. Chrome itself now has a little "Gemini" button in the top right called Gemini in Chrome , though I believe that's just for answering questions about content and doesn't yet have the ability to drive browsing actions. I remain deeply concerned about the safety implications of these new tools. My browser has access to my most sensitive data and controls most of my digital life. A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect. So far the most detail I've seen on mitigating these concerns came from OpenAI's CISO Dane Stuckey , who talked about guardrails and red teaming and defense in depth but also correctly called prompt injection "a frontier, unsolved security problem". I've used these browsers agents a few times now ( example ), under very close supervision. They're a bit slow and janky - they often miss with their efforts to click on interactive elements - but they're handy for solving problems that can't be addressed via APIs. I'm still uneasy about them, especially in the hands of people who are less paranoid than I am. I've been writing about prompt injection attacks for more than three years now. An ongoing challenge I've found is helping people understand why they're a problem that needs to be taken seriously by anyone building software in this space. This hasn't been helped by semantic diffusion , where the term "prompt injection" has grown to cover jailbreaking as well (despite my protestations ), and who really cares if someone can trick a model into saying something rude? So I tried a new linguistic trick! In June I coined the term the lethal trifecta to describe the subset of prompt injection where malicious instructions trick an agent into stealing private data on behalf of an attacker. A trick I use here is that people will jump straight to the most obvious definition of any new term that they hear. "Prompt injection" sounds like it means "injecting prompts". "The lethal trifecta" is deliberately ambiguous: you have to go searching for my definition if you want to know what it means! It seems to have worked. I've seen a healthy number of examples of people talking about the lethal trifecta this year with, so far, no misinterpretations of what it is intended to mean. I wrote significantly more code on my phone this year than I did on my computer. Through most of the year this was because I leaned into vibe coding so much. My tools.simonwillison.net collection of HTML+JavaScript tools was mostly built this way: I would have an idea for a small project, prompt Claude Artifacts or ChatGPT or (more recently) Claude Code via their respective iPhone apps, then either copy the result and paste it into GitHub's web editor or wait for a PR to be created that I could then review and merge in Mobile Safari. Those HTML tools are often ~100-200 lines of code, full of uninteresting boilerplate and duplicated CSS and JavaScript patterns - but 110 of them adds up to a lot! Up until November I would have said that I wrote more code on my phone, but the code I wrote on my laptop was clearly more significant - fully reviewed, better tested and intended for production use. In the past month I've grown confident enough in Claude Opus 4.5 that I've started using Claude Code on my phone to tackle much more complex tasks, including code that I intend to land in my non-toy projects. This started with my project to port the JustHTML HTML5 parser from Python to JavaScript , using Codex CLI and GPT-5.2. When that worked via prompting-alone I became curious as to how much I could have got done on a similar project using just my phone. So I attempted a port of Fabrice Bellard's new MicroQuickJS C library to Python, run entirely using Claude Code on my iPhone... and it mostly worked ! Is it code that I'd use in production? Certainly not yet for untrusted code , but I'd trust it to execute JavaScript I'd written myself. The test suite I borrowed from MicroQuickJS gives me some confidence there. This turns out to be the big unlock: the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against. I call these conformance suites and I've started deliberately looking out for them - so far I've had success with the html5lib tests , the MicroQuickJS test suite and a not-yet-released project against the comprehensive WebAssembly spec/test collection . If you're introducing a new protocol or even a new programming language to the world in 2026 I strongly recommend including a language-agnostic conformance suite as part of your project. I've seen plenty of hand-wringing that the need to be included in LLM training data means new technologies will struggle to gain adoption. My hope is that the conformance suite approach can help mitigate that problem and make it easier for new ideas of that shape to gain traction. Towards the end of 2024 I was losing interest in running local LLMs on my own machine. My interest was re-kindled by Llama 3.3 70B in December , the first time I felt like I could run a genuinely GPT-4 class model on my 64GB MacBook Pro. Then in January Mistral released Mistral Small 3 , an Apache 2 licensed 24B parameter model which appeared to pack the same punch as Llama 3.3 70B using around a third of the memory. Now I could run a ~GPT-4 class model and have memory left over to run other apps! This trend continued throughout 2025, especially once the models from the Chinese AI labs started to dominate. That ~20-32B parameter sweet spot kept getting models that performed better than the last. I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled. The problem is that the big cloud models got better too - including those open weight models that, while freely available, were far too large (100B+) to run on my laptop. Coding agents changed everything for me. Systems like Claude Code need more than a great model - they need a reasoning model that can perform reliable tool calling invocations dozens if not hundreds of times over a constantly expanding context window. I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device. My next laptop will have at least 128GB of RAM, so there's a chance that one of the 2026 open weight models might fit the bill. For now though I'm sticking with the best available frontier hosted models as my daily drivers. I played a tiny role helping to popularize the term "slop" in 2024, writing about it in May and landing quotes in the Guardian and the New York Times shortly afterwards. This year Merriam-Webster crowned it word of the year ! slop ( noun ): digital content of low quality that is produced usually in quantity by means of artificial intelligence I like that it represents a widely understood feeling that poor quality AI-generated content is bad and should be avoided. I'm still holding hope that slop won't end up as bad a problem as many people fear. The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff. I don't see the increased volume of junk as changing that fundamental dynamic much. Curation matters more than ever. That said... I don't use Facebook, and I'm pretty careful at filtering or curating my other social media habits. Is Facebook still flooded with Shrimp Jesus or was that a 2024 thing? I heard fake videos of cute animals getting rescued is the latest trend. It's quite possible the slop problem is a growing tidal wave that I'm innocently unaware of. I nearly skipped writing about the environmental impact of AI for this year's post (here's what I wrote in 2024 ) because I wasn't sure if we had learned anything new this year - AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable. What's interesting in 2025 is that public opinion appears to be shifting quite dramatically against new data center construction. Here's a Guardian headline from December 8th: More than 200 environmental groups demand halt to new US datacenters . Opposition at the local level appears to be rising sharply across the board too. I've been convinced by Andy Masley that the water usage issue is mostly overblown, which is a problem mainly because it acts as a distraction from the very real issues around energy consumption, carbon emissions and noise pollution. AI labs continue to find new efficiencies to help serve increased quality of models using less energy per token, but the impact of that is classic Jevons paradox - as tokens get cheaper we find more intense ways to use them, like spending $200/month on millions of tokens to run coding agents. As an obsessive collector of neologisms, here are my own favourites from 2025. You can see a longer list in my definitions tag . If you've made it this far, I hope you've found this useful! You can subscribe to my blog in a feed reader or via email , or follow me on Bluesky or Mastodon or Twitter . If you'd like a review like this on a monthly basis instead I also operate a $10/month sponsors only newsletter with a round-up of the key developments in the LLM space over the past 30 days. Here are preview editions for September , October , and November - I'll be sending December's out some time tomorrow. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The year of "reasoning" The year of agents The year of coding agents and Claude Code The year of LLMs on the command-line The year of YOLO and the Normalization of Deviance The year of $200/month subscriptions The year of top-ranked Chinese open weight models The year of long tasks The year of prompt-driven image editing The year models won gold in academic competitions The year that Llama lost its way The year that OpenAI lost their lead The year of Gemini The year of pelicans riding bicycles The year I built 110 tools The year of the snitch! The year of vibe coding The (only?) year of MCP The year of alarmingly AI-enabled browsers The year of the lethal trifecta The year of programming on my phone The year of conformance suites The year local models got good, but cloud models got even better The year of slop The year that data centers got extremely unpopular My own words of the year That's a wrap for 2025 Claude Code Mistral Vibe Alibaba Qwen (Qwen3) Moonshot AI (Kimi K2) Z.ai (GLM-4.5/4.6/4.7) MiniMax (M2) MetaStone AI (XBai o4) Here’s how I use LLMs to help me write code Adding AI-generated descriptions to my tools collection Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools - my favourite post of the bunch. blackened-cauliflower-and-turkish-style-stew is ridiculous. It's a custom cooking timer app for anyone who needs to prepare Green Chef's Blackened Cauliflower and Turkish-style Spiced Chickpea Stew recipes at the same time. Here's more about that one . is-it-a-bird takes inspiration from xkcd 1425 , loads a 150MB CLIP model via Transformers.js and uses it to say if an image or webcam feed is a bird or not. bluesky-thread lets me view any thread on Bluesky with a "most recent first" option to make it easier to follow new posts as they arrive. Not all AI-assisted programming is vibe coding (but vibe coding rocks) in March Two publishers and three authors fail to understand what “vibe coding” means in May (one book subsequently changed its title to the much better "Beyond Vibe Coding"). Vibe engineering in October, where I tried to suggest an alternative term for what happens when professional engineers use AI assistance to build production-grade software. Your job is to deliver code you have proven to work in December, about how professional software development is about code that demonstrably works, no matter how you built it. Vibe coding, obviously. Vibe engineering - I'm still on the fence of if I should try to make this happen ! The lethal trifecta , my one attempted coinage of the year that seems to have taken root . Context rot , by Workaccount2 on Hacker News, for the thing where model output quality falls as the context grows longer during a session. Context engineering as an alternative to prompt engineering that helps emphasize how important it is to design the context you feed to your model. Slopsquatting by Seth Larson, where an LLM hallucinates an incorrect package name which is then maliciously registered to deliver malware. Vibe scraping - another of mine that didn't really go anywhere, for scraping projects implemented by coding agents driven by prompts. Asynchronous coding agent for Claude for web / Codex cloud / Google Jules Extractive contributions by Nadia Eghbal for open source contributions where "the marginal cost of reviewing and merging that contribution is greater than the marginal benefit to the project’s producers".

2 views
Abhinav Sarkar 3 months ago

Polls I Ran on Mastodon in 2025

In 2025, I ran ten polls on Mastodon exploring various topics, mostly to outsource my research to the hivemind. Here are the poll results organized by topic, with commentary. How do you pronounce JSON? January 15, 2025 I’m in the “Jay-Son, O as in Otter” camp, which is the majority response. It seems like most Americans prefer the “Jay-Son, O as in Utter” option. Thankfully, only one person in the whole world says “Jay-Ess-On”. If someone were to write a new compiler book today, what would you prefer the backend to emit? October 31, 2025 LLVM wins this poll hands down. It is interesting to see WASM beating other targets. Which is your favourite Haskell parsing library? November 3, 2025 I didn’t expect Attoparsec to go toe-to-toe with Megaparsec . I did some digging, and it seems like Megaparsec is the clear winner when it comes to parsing programming languages in Haskell. However, for parsing file formats and network protocols, Attoparsec is the most popular one. I think that’s wise, and I’m inclined to make the same choice. If you were to write a compiler in Haskell, would you use a lens library to transform the data structures? July 11, 2025 This one has mixed results. Personally, I’d like to use a minimal lens library if I’m writing a compiler in Haskell. What do you think is the right length of programming related blog posts (containing code) in terms of reading time? May 18, 2025 As a writer of programming related blog posts, this poll was very informative for me. 10 minute long posts seem to be the most popular option, but my own posts are a bit longer, usually between 15–20 minutes. Do you print blog posts or save them as PDFs for offline reading? March 8, 2025 Most people do not seem to care about saving or printing blog posts. But I went ahead and added (decent) printing support for my blog posts anyway. If you have a personal website and you do not work in academia, do you have your résumé or CV on your website? August 30, 2025 I don’t have a public résumé on my website either. I’d like to, but I don’t think anyone visiting my website would read it. Would people be interested in a series of blog posts where I implement the C compiler from “Writing a C Compiler” book by Nora Sandler in Haskell? November 11, 2025 Well, 84% people voted “Yes”, so this is (most certainly) happening in 2026! If I were to release a service to run on servers, how would you prefer I package it? December 30, 2025 Well, people surely love their Docker images. Surprisingly, many are okay with just source code and build instructions. Statically linked executable are more popular now, probably because of the ease of deployment. Many also commented that they’d prefer OS specify package like deb or rpm. However, my personal preference is Nix package and NixOS module. If you run services on Hetzner, do you keep a backup of your data entirely off Hetzner? August 9, 2025 It is definitely wise to have an offsite backup. I’m still figuring out the backup strategy for my VPS. That’s all for this year. Let’s see what polls I come up with in 2026. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . General Programming JSON Pronunciation Compilers Compiler Backend Targets Haskell Parsing Libraries Compiler in Haskell with Lenses Blogging & Web Blog Post Length Preferences Blog Post Print Support Résumés on Personal Website “Writing a C Compiler” Blog Series Self-hosting Service Packaging Preferences Hetzner Backup Strategy

0 views
Alex White's Blog 3 months ago

Constraints Breed Innovation

I've mentioned a few times on my blog about daily driving a Palm Pilot. I've been using either my Tungsten C or T3 for the past 2 months. These devices have taken the place of my smartphone in my pocket. They hold my agenda, tasks, blog post drafts, databases of my media collection and child's sleep schedule and lots more. Massive amounts of data, in kilobytes of size. Simply put, it's been a joy to use these machines, more so than my smartphone ever has been. I've been thinking about the why behind my love of Palm Pilots. Is it simply nostalgia for my childhood? Or maybe an overpowering disdain for modern tech? Yes to both of these, but it's also something more. I genuinely believe the software on Palm is BETTER than most of what you'll find on Android or iOS. The operating system itself, the database software ( HanDBase ) I use to track my child's bed times, the outline tool I plan projects with ( ShadowPlan ), the program I'm writing this post on ( CardTXT ) and the solitaire game I kill time with ( Acid FreeCell ), they all feel special. Each app does an absolutely excellent job, only takes up kilobytes of storage, opens instantly, doesn't require internet or a subscription fee (everything was pay once). But I think there's an additional, underpinning reason these pieces of software are so great: constraint. The device I'm using right now, the Palm Pilot Tungsten T3, has a 400MHz processor, 64MiB of RAM and a 480x320 pixel screen. That's all you have to work with! You can't count on network connectivity (this device doesn't have WiFi). You have to hyper optimize for file size and performance. Each pixel needs to serve a purpose (there's only 153,600 of them!). When you're hands are tied behind your back, you get creative and focused. Constraint truly is the breeder of innovation, and something we've lost. A modern smartphone is immensely powerful, constantly online, capable of multitasking and has a high resolution screen. Building a smartphone app means anything goes. Optimizations aren't as necessary, space isn't a concern, screen real estate is abundant. Now don't get me wrong, there's definitely a balance of too much performance and too little. There's a reason I'm not writing this on a Apple Newton (well, the cost of buying one). But on the other hand, look at the Panic Playdate. It has a 168MHz processor, 16 MiB RAM and a 400x240 1-bit black & white screen, yet there are some beautiful , innovative games hitting the console. Developers have to optimize every line of C code for performance, and keep an eye on file size, just like the Palm Pilot. I've experienced the power of constraint myself as a developer. My most successful projects have been ones where I limited myself from using libraries, and instead focused on plain PHP + MySQL. With a framework project and composer behind you, you implement every feature that crosses your mind, heck it's just one "composer require" away! But when you have to dedicate real time to writing each feature, you tend to hyper focus on what adds value to your software. I think this is what powers great Palm software. You don't have the performance or memory to add bloat. You don't have the screen real estate to build some complicated, fancy UI. You don't have the network connectivity to rely on offloading to a server. You need to make a program that launches instantly, does it's job well enough to sell licenses and works great even in black & white. That's a tall order, and a lot of developers knocked it out of the park. All this has got me thinking about what a modern, constrained PDA would look like. Something akin to the Playdate, but for the productivity side of the house. Imagine a Palm Pilot with a keyboard, USB C, the T3 screen size, maybe a color e-ink display, expandable storage, headphone jack, Bluetooth (for file transfer), infrared (I REALLY like IR) and a microphone (for voice memos). Add an OS similar to Palm OS 5, or a slightly improved version of it. Keep the CPU, memory, RAM all constrained (within reason). That would be a sweet device, and I'd love to see what people would do with it. I plan to start doing reviews on some of my favorite Palm Pilot software, especially the tools that help me plan and write this blog, so be on the lookout!

0 views