Posts in Rust (20 found)
Jeff Geerling 3 days ago

How to Securely Erase an old Hard Drive on macOS Tahoe

Apparently Apple thinks nobody with a modern Mac uses spinning rust (hard drives with platters) anymore. I plugged in a hard drive from an old iMac into my Mac Studio using my Sabrent USB to SATA Hard Drive enclosure, and opened up Disk Utility, clicked on the top-level disk in the sidebar, and clicked 'Erase'. Lo and behold, there's no 'Security Options' button on there, as there had been since—I believe—the very first version of Disk Utility in Mac OS X!

0 views
<antirez> 5 days ago

Implementing a clear room Z80 / ZX Spectrum emulator with Claude Code

Anthropic recently released a blog post with the description of an experiment in which the last version of Opus, the 4.6, was instructed to write a C compiler in Rust, in a “clean room” setup. The experiment methodology left me dubious about the kind of point they wanted to make. Why not provide the agent with the ISA documentation? Why Rust? Writing a C compiler is exactly a giant graph manipulation exercise: the kind of program that is harder to write in Rust. Also, in a clean room experiment, the agent should have access to all the information about well established computer science progresses related to optimizing compilers: there are a number of papers that could be easily synthesized in a number of markdown files. SSA, register allocation, instructions selection and scheduling. Those things needed to be researched *first*, as a prerequisite, and the implementation would still be “clean room”. Not allowing the agent to access the Internet, nor any other compiler source code, was certainly the right call. Less understandable is the almost-zero steering principle, but this is coherent with a certain kind of experiment, if the goal was showcasing the completely autonomous writing of a large project. Yet, we all know how this is not how coding agents are used in practice, most of the time. Who uses coding agents extensively knows very well how, even never touching the code, a few hits here and there completely changes the quality of the result. # The Z80 experiment I thought it was time to try a similar experiment myself, one that would take one or two hours at max, and that was compatible with my Claude Code Max plan: I decided to write a Z80 emulator, and then a ZX Spectrum emulator (and even more, a CP/M emulator, see later) in a condition that I believe makes a more sense as “clean room” setup. The result can be found here: https://github.com/antirez/ZOT. # The process I used 1. I wrote a markdown file with the specification of what I wanted to do. Just English, high level ideas about the scope of the Z80 emulator to implement. I said things like: it should execute a whole instruction at a time, not a single clock step, since this emulator must be runnable on things like an RP2350 or similarly limited hardware. The emulator should correctly track the clock cycles elapsed (and I specified we could use this feature later in order to implement the ZX Spectrum contention with ULA during memory accesses), provide memory access callbacks, and should emulate all the known official and unofficial instructions of the Z80. For the Spectrum implementation, performed as a successive step, I provided much more information in the markdown file, like, the kind of rendering I wanted in the RGB buffer, and how it needed to be optional so that embedded devices could render the scanlines directly as they transferred them to the ST77xx display (or similar), how it should be possible to interact with the I/O port to set the EAR bit to simulate cassette loading in a very authentic way, and many other desiderata I had about the emulator. This file also included the rules that the agent needed to follow, like: * Accessing the internet is prohibited, but you can use the specification and test vectors files I added inside ./z80-specs. * Code should be simple and clean, never over-complicate things. * Each solid progress should be committed in the git repository. * Before committing, you should test that what you produced is high quality and that it works. * Write a detailed test suite as you add more features. The test must be re-executed at every major change. * Code should be very well commented: things must be explained in terms that even people not well versed with certain Z80 or Spectrum internals details should understand. * Never stop for prompting, the user is away from the keyboard. * At the end of this file, create a work in progress log, where you note what you already did, what is missing. Always update this log. * Read this file again after each context compaction. 2. Then, I started a Claude Code session, and asked it to fetch all the useful documentation on the internet about the Z80 (later I did this for the Spectrum as well), and to extract only the useful factual information into markdown files. I also provided the binary files for the most ambitious test vectors for the Z80, the ZX Spectrum ROM, and a few other binaries that could be used to test if the emulator actually executed the code correctly. Once all this information was collected (it is part of the repository, so you can inspect what was produced) I completely removed the Claude Code session in order to make sure that no contamination with source code seen during the search was possible. 3. I started a new session, and asked it to check the specification markdown file, and to check all the documentation available, and start implementing the Z80 emulator. The rules were to never access the Internet for any reason (I supervised the agent while it was implementing the code, to make sure this didn’t happen), to never search the disk for similar source code, as this was a “clean room” implementation. 4. For the Z80 implementation, I did zero steering. For the Spectrum implementation I used extensive steering for implementing the TAP loading. More about my feedback to the agent later in this post. 5. As a final step, I copied the repository in /tmp, removed the “.git” repository files completely, started a new Claude Code (and Codex) session and claimed that the implementation was likely stolen or too strongly inspired from somebody else's work. The task was to check with all the major Z80 implementations if there was evidence of theft. The agents (both Codex and Claude Code), after extensive search, were not able to find any evidence of copyright issues. The only similar parts were about well established emulation patterns and things that are Z80 specific and can’t be made differently, the implementation looked distinct from all the other implementations in a significant way. # Results Claude Code worked for 20 or 30 minutes in total, and produced a Z80 emulator that was able to pass ZEXDOC and ZEXALL, in 1200 lines of very readable and well commented C code (1800 lines with comments and blank spaces). The agent was prompted zero times during the implementation, it acted absolutely alone. It never accessed the internet, and the process it used to implement the emulator was of continuous testing, interacting with the CP/M binaries implementing the ZEXDOC and ZEXALL, writing just the CP/M syscalls needed to produce the output on the screen. Multiple times it also used the Spectrum ROM and other binaries that were available, or binaries it created from scratch to see if the emulator was working correctly. In short: the implementation was performed in a very similar way to how a human programmer would do it, and not outputting a complete implementation from scratch “uncompressing” it from the weights. Instead, different classes of instructions were implemented incrementally, and there were bugs that were fixed via integration tests, debugging sessions, dumps, printf calls, and so forth. # Next step: the ZX Spectrum I repeated the process again. I instructed the documentation gathering session very accurately about the kind of details I wanted it to search on the internet, especially the ULA interactions with RAM access, the keyboard mapping, the I/O port, how the cassette tape worked and the kind of PWM encoding used, and how it was encoded into TAP or TZX files. As I said, this time the design notes were extensive since I wanted this emulator to be specifically designed for embedded systems, so only 48k emulation, optional framebuffer rendering, very little additional memory used (no big lookup tables for ULA/Z80 access contention), ROM not copied in the RAM to avoid using additional 16k of memory, but just referenced during the initialization (so we have just a copy in the executable), and so forth. The agent was able to create a very detailed documentation about the ZX Spectrum internals. I provided a few .z80 images of games, so that it could test the emulator in a real setup with real software. Again, I removed the session and started fresh. The agent started working and ended 10 minutes later, following a process that really fascinates me, and that probably you know very well: the fact is, you see the agent working using a number of diverse skills. It is expert in everything programming related, so as it was implementing the emulator, it could immediately write a detailed instrumentation code to “look” at what the Z80 was doing step by step, and how this changed the Spectrum emulation state. In this respect, I believe automatic programming to be already super-human, not in the sense it is currently capable of producing code that humans can’t produce, but in the concurrent usage of different programming languages, system programming techniques, DSP stuff, operating system tricks, math, and everything needed to reach the result in the most immediate way. When it was done, I asked it to write a simple SDL based integration example. The emulator was immediately able to run the Jetpac game without issues, with working sound, and very little CPU usage even on my slow Dell Linux machine (8% usage of a single core, including SDL rendering). Once the basic stuff was working, I wanted to load TAP files directly, simulating cassette loading. This was the first time the agent missed a few things, specifically about the timing the Spectrum loading routines expected, and here we are in the territory where LLMs start to perform less efficiently: they can’t easily run the SDL emulator and see the border changing as data is received and so forth. I asked Claude Code to do a refactoring so that zx_tick() could be called directly and was not part of zx_frame(), and to make zx_frame() a trivial wrapper. This way it was much simpler to sync EAR with what it expected, without callbacks or the wrong abstractions that it had implemented. After such change, a few minutes later the emulator could load a TAP file emulating the cassette without problems. This is how it works now: do { zx_set_ear(zx, tzx_update(&tape, zx->cpu.clocks)); } while (!zx_tick(zx, 0)); I continued prompting Claude Code in order to make the key bindings more useful and a few things more. # CP/M One thing that I found really interesting was the ability of the LLM to inspect the COM files for ZEXALL / ZEXCOM tests for the Z80, easily spot the CP/M syscalls that were used (a total of three), and implement them for the extended z80 test (executed by make fulltest). So, at this point, why not implement a full CP/M environment? Same process again, same good result in a matter of minutes. This time I interacted with it a bit more for the VT100 / ADM3 terminal escapes conversions, reported things not working in WordStar initially, and in a few minutes everything I tested was working well enough (but, there are fixes to do, like simulating a 2Mhz clock, right now it runs at full speed making CP/M games impossible to use). # What is the lesson here? The obvious lesson is: always provide your agents with design hints and extensive documentation about what they are going to do. Such documentation can be obtained by the agent itself. And, also, make sure the agent has a markdown file with the rules of how to perform the coding tasks, and a trace of what it is doing, that is updated and read again quite often. But those tricks, I believe, are quite clear to everybody that has worked extensively with automatic programming in the latest months. To think in terms of “what a human would need” is often the best bet, plus a few LLMs specific things, like the forgetting issue after context compaction, the continuous ability to verify it is on the right track, and so forth. Returning back to the Anthropic compiler attempt: one of the steps that the agent failed was the one that was more strongly related to the idea of memorization of what is in the pretraining set: the assembler. With extensive documentation, I can’t see any way Claude Code (and, even more, GPT5.3-codex, which is in my experience, for complex stuff, more capable) could fail at producing a working assembler, since it is quite a mechanical process. This is, I think, in contradiction with the idea that LLMs are memorizing the whole training set and uncompress what they have seen. LLMs can memorize certain over-represented documents and code, but while they can extract such verbatim parts of the code if prompted to do so, they don’t have a copy of everything they saw during the training set, nor they spontaneously emit copies of already seen code, in their normal operation. We mostly ask LLMs to create work that requires assembling different knowledge they possess, and the result is normally something that uses known techniques and patterns, but that is new code, not constituting a copy of some pre-existing code. It is worth noting, too, that humans often follow a less rigorous process compared to the clean room rules detailed in this blog post, that is: humans often download the code of different implementations related to what they are trying to accomplish, read them carefully, then try to avoid copying stuff verbatim but often times they take strong inspiration. This is a process that I find perfectly acceptable, but it is important to take in mind what happens in the reality of code written by humans. After all, information technology evolved so fast even thanks to this massive cross pollination effect. For all the above reasons, when I implement code using automatic programming, I don’t have problems releasing it MIT licensed, like I did with this Z80 project. In turn, this code base will constitute quality input for the next LLMs training, including open weights ones. # Next steps To make my experiment more compelling, one should try to implement a Z80 and ZX Spectrum emulator without providing any documentation to the agent, and then compare the result of the implementation. I didn’t find the time to do it, but it could be quite informative. Comments

0 views
underlap 6 days ago

ipc-channel-mux router support

The IPC channel multiplexing crate, ipc-channel-mux, now includes a “router”. The router provides a means of automatically forwarding messages from subreceivers to Crossbeam receivers so that users can enjoy Crossbeam receiver features, such as selection (explained below). The absence of a router blocked the adoption of the crate by Servo, so it was an important feature to support. Routing involves running a thread which receives from various subreceivers and forwards the results to Crossbeam channels. Without a separate thread, a receive on one of the Crossbeam receivers would block and when a message became available on the subchannel, it wouldn’t be forwarded to the Crossbeam channel. Before we explain routing further, we need to introduce a concept which may be unfamiliar to some readers. Suppose you have a set of data sources – servers, file descriptors, or, in our case, channels – which may or may not be ready to deliver data. To wait for one or more of these to be ready, one option is to poll the items in the set. But if none of the items are ready, what should you do? If you loop around and repeatedly poll the items, you’ll consume a lot of CPU. If you delay for a period of time before polling again and an item becomes ready before the period has elapsed, you won’t notice. So polling either consumes excessive CPU or reduces responsiveness. How do we balance the requirements of efficiency and responsiveness? The solution is to somehow block until at least one item is ready. That’s just what selection does. In the context of IPC channel, this selection logic applies to a set of receivers, known as an . An holds a set of IPC receivers and, when requested, waits for at least one of the receivers to be ready and then returns a collection of the results from all the receivers which became ready. The purpose of routing is that users, such as Servo, can then select [1] [2] over a heterogeneous collection of IPC receivers and Crossbeam receivers. By converting IPC receivers into Crossbeam receivers, it’s possible to use Crossbeam channel’s selection feature on a homogeneous collection of Crossbeam receivers to implement a select on the corresponding heterogeneous collection of IPC receivers and Crossbeam receivers. Routing for has the same requirement: to convert a collection of subreceivers to Crossbeam receivers so that Crossbeam channel’s selection feature can be used on a homogeneous collection of Crossbeam receivers to implement selection on the corresponding heterogeneous collection of subreceivers and Crossbeam receivers. Let’s look at how this is implemented. The most obvious approach was to mirror the design of IPC channel routing and implement subchannel routing in terms of sets of subreceivers known as s. Receiving from a collection of subreceivers could be implemented by attempting a receive (using ) from each subreceiver of the collection in turn and returning any results returned. However there is a difficulty: if none of the subreceivers returns a result, what should happen? If we loop around and repeatedly attempt to receive from each subreceiver in the collection, we’ll consume a lot of CPU. If we delay for a certain period of time, we won’t be responsive if a subreceiver becomes ready to return a result. The solution is to somehow block until at least one of the subreceivers is ready to return a result. A does just that. It holds a set of subreceivers and, when requested, returns a collection of the results from all the receivers which became ready. This is a specific example of the advantages of using selection over polling, discussed above. Remember that the results of a subreceiver are demultiplexed from the results of an IPC receiver (provided by the crate). The following diagram shows how a MultiReceiver sits between an IpcReceiver and the SubReceivers served by that IpcReceiver: IPC channels already implements an . So a can be implemented in terms of an containing all the IPC receivers underlying the subreceivers in the set. There are some complications however. When a subreceiver is added to a , there may be other subreceivers with the same underlying IPC receiver which do not belong to the set and yet the will return a message that could demultiplex either to a subreceiver in the set or a subreceiver not in the set. Worse than that, subreceivers with the same underlying IPC receiver may be added to distinct s. So if we use an to implement a , more than one may need to share the same . There is one case where Servo uses directly, rather than via the router and it’s in the implementation of . So one option would be to avoid adding IpcReceiverSet to the API of . Then there would be at most one instance of and so some of the complications might not arise. But there’s a danger that it would be possible to encounter the same complication using the router, e.g. if some subreceivers were added to the router and other subreceivers with the same underlying IPC channel as those added to the router were used directly. Another complication of routing is that the router thread needs to receive messages from subchannels which originate outside that thread. So subreceivers need to be moved into the thread. In terms of Rust, they need to be . Given that some subreceivers can be moved into the thread and other subreceivers which have not not moved into the thread can share the same underlying IPC channel, subreceivers (or at least substantial parts of their implementation) need to be . To avoid polling, essentially it must be possible for a select operation on an SubReceiverSet to result in a select operation on an IpcReceiverSet comprising the underlying IpcReceiver(s). I expermented with the situation where some subreceivers were added to the router and other subreceivers with the same underlying IPC channel as those added to the router were used directly. This resulted in liveness and/or fairness issues when the thread using a subreceiver directly competed with the router thread. Both these threads would attempt to issue a select on an . The cleanest solution initially appeared to be to make both these depend on the router to issue the select operation. This came with some restrictions though, such as the stand-alone subreceiver not being able to receive any more messages after the router was shut down. A radical alternative was to restructure the router API so that it would not be possible for some subreceivers to be added to the router and other subreceivers with the same underlying IPC channel as those added to the router to be used directly. This may be a reasonable restriction for Servo because receivers tend to be added to the router soon after the receiver’s channel is created. With this redesigned router API in which subreceivers destined for routing are hidden from the API, the above liveness and fairness problems can be side-stepped. v0.0.5 of the ipc-channel-mux crate includes the redesigned router API. v0.0.6 improves the throughput for both subchannel receives and routing. The next step is to try to improve the code structure since the module has grown considerably and could do with some parts splitting into separate modules. After that, I’ll need to see if some of the missing features relative to ipc-channel need to be added to ipc-channel-mux before it’s ready to be tried out in Servo. [3] Another possibility, if some of the IPC receivers has been disconnected, is that select can return which IPC receivers have been disconnected. ↩︎ Crossbeam selection is a little more general. They allow the user to wait for operations to complete, each of which may be a send or a receive. An arbitrary one of the completed operations is chosen and its resultant value is returned. ↩︎ The main functional gaps in ipc-channel-mux compared to ipc-channel are shared memory transmission and non-blocking subchannel receive. ↩︎ Another possibility, if some of the IPC receivers has been disconnected, is that select can return which IPC receivers have been disconnected. ↩︎ Crossbeam selection is a little more general. They allow the user to wait for operations to complete, each of which may be a send or a receive. An arbitrary one of the completed operations is chosen and its resultant value is returned. ↩︎ The main functional gaps in ipc-channel-mux compared to ipc-channel are shared memory transmission and non-blocking subchannel receive. ↩︎

0 views
baby steps 1 weeks ago

What it means that Ubuntu is using Rust

Righty-ho, I’m back from Rust Nation, and busily horrifying my teenage daughter with my (admittedly atrocious) attempts at doing an English accent 1 . It was a great trip with a lot of good conversations and some interesting observations. I am going to try to blog about some of them, starting with some thoughts spurred by Jon Seager’s closing keynote, “Rust Adoption At Scale with Ubuntu”. For some time now I’ve been debating with myself, has Rust “crossed the chasm” ? If you’re not familiar with that term, it comes from a book that gives a kind of “pop-sci” introduction to the Technology Adoption Life Cycle . The answer, of course, is it depends on who you ask . Within Amazon, where I have the closest view, the answer is that we are “most of the way across”: Rust is squarely established as the right way to build at-scale data planes or resource-aware agents and it is increasingly seen as the right choice for low-level code in devices and robotics as well – but there remains a lingering perception that Rust is useful for “those fancy pants developers at S3” (or wherever) but a bit overkill for more average development 3 . On the other hand, within the realm of Safety Critical Software, as Pete LeVasseur wrote in a recent rust-lang blog post , Rust is still scrabbling for a foothold. There are a number of successful products but most of the industry is in a “wait and see” mode, letting the early adopters pave the path. The big idea that I at least took away from reading Crossing the Chasm and other references on the technology adoption life cycle is the need for “reference customers”. When you first start out with something new, you are looking for pioneers and early adopters that are drawn to new things: What an early adopter is buying [..] is some kind of change agent . By being the first to implement this change in the industry, the early adopters expect to get a jump on the competition. – from Crossing the Chasm But as your technology matures, you have to convince people with a lower and lower tolerance for risk: The early majority want to buy a productivity improvement for existing operations. They are looking to minimize discontinuity with the old ways. They want evolution, not revolution. – from Crossing the Chasm So what is most convincing to people to try something new? The answer is seeing that others like them have succeeded. You can see this at play in both the Amazon example and the Safety Critical Software example. Clearly seeing Rust used for network services doesn’t mean it’s ready to be used in your car’s steering column 4 . And even within network services, seeing a group like S3 succeed with Rust may convince other groups building at-scale services to try Rust, but doesn’t necessarily persuade a team to use Rust for their next CRUD service. And frankly, it shouldn’t! They are likely to hit obstacles. All of this was on my mind as I watched the keynote by Jon Seager, the VP of Engineering at Canonical, which is the company behind Ubuntu. Similar to Lars Bergstrom’s epic keynote from year’s past on Rust adoption within Google, Jon laid out a pitch for why Canonical is adopting Rust that was at once visionary and yet deeply practical . “Visionary and yet deeply practical” is pretty much the textbook description of what we need to cross from early adopters to early majority . We need folks who care first and foremost about delivering the right results, but are open to new ideas that might help them do that better; folks who can stand on both sides of the chasm at once. Jon described how Canonical focuses their own development on a small set of languages: Python, C/C++, and Go, and how they had recently brought in Rust and were using it as the language of choice for new foundational efforts , replacing C, C++, and (some uses of) Python. Jon talked about how he sees it as part of Ubuntu’s job to “pay it forward” by supporting the construction of memory-safe foundational utilities. Jon meant support both in terms of finances – Canonical is sponsoring the Trifecta Tech Foundation’s to develop sudo-rs and ntpd-rs and sponsoring the uutils org’s work on coreutils – and in terms of reputation. Ubuntu can take on the risk of doing something new, prove that it works, and then let others benefit. Remember how the Crossing the Chasm book described early majority people? They are “looking to minimize discontinuity with the old ways”. And what better way to do that than to have drop-in utilities that fit within their existing workflows. With new adoption comes new perspectives. On Thursday night I was at dinner 5 organized by Ernest Kissiedu 6 . Jon Seager was there along with some other Rust adopters from various industries, as were a few others from the Rust Foundation and the open-source project. Ernest asked them to give us their unvarnished takes on Rust. Jon made the provocative comment that we needed to revisit our policy around having a small standard library. He’s not the first to say something like that, it’s something we’ve been hearing for years and years – and I think he’s right! Though I don’t think the answer is just to ship a big standard library. In fact, it’s kind of a perfect lead-in to (what I hope will be) my next blog post, which is about a project I call “battery packs” 7 . The broader point though is that shifting from targeting “pioneers” and “early adopters” to targeting “early majority” sometimes involves some uncomfortable changes: Transition between any two adoption segments is normally excruciatingly awkward because you must adopt new strategies just at the time you have become most comfortable with the old ones. [..] The situation can be further complicated if the high-tech company, fresh from its marketing success with visionaries, neglects to change its sales pitch. [..] The company may be saying “state-of-the-art” when the pragmatist wants to hear “industry standard”. – Crossing the Chasm (emphasis mine) Not everybody will remember it, but in 2016 there was a proposal called the Rust Platform . The idea was to bring in some crates and bless them as a kind of “extended standard library”. People hated it. After all, they said, why not just add dependencies to your ? It’s easy enough. And to be honest, they were right – at least at the time. I think the Rust Platform is a good example of something that was a poor fit for early adopters, who want the newest thing and don’t mind finding the best crates, but which could be a great fit for the Early Majority. 8 Anyway, I’m not here to argue for one thing or another in this post, but more for the concept that we have to be open to adapting our learned wisdom to new circumstances. In the past, we were trying to bootstrap Rust into the industry’s consciousness – and we have succeeded. The task before us now is different: we need to make Rust the best option not just in terms of “what it could be ” but in terms of “what it actually is ” – and sometimes those are in tension. Later in the dinner, the talk turned, as it often does, to money. Growing Rust adoption also comes with growing needs placed on the Rust project and its ecosystem. How can we connect the dots? This has been a big item on my mind, and I realize in writing this paragraph how many blog posts I have yet to write on the topic, but let me lay out a few interesting points that came up over this dinner and at other recent points. First, there are more ways to offer support than $$. For Canonical specifically, as they are an open-source organization through-and-through, what I would most want is to build stronger relationships between our organizations. With the Rust for Linux developers, early on Rust maintainers were prioritizing and fixing bugs on behalf of RfL devs, but more and more, RfL devs are fixing things themselves, with Rust maintainers serving as mentors. This is awesome! Second, there’s an interesting trend about $$ that I’ve seen crop up in a few places. We often think of companies investing in the open-source dependencies that they rely upon. But there’s an entirely different source of funding, and one that might be even easier to tap, which is to look at companies that are considering Rust but haven’t adopted it yet. For those “would be” adopters, there are often individuals in the org who are trying to make the case for Rust adoption – these individuals are early adopters, people with a vision for how things could be, but they are trying to sell to their early majority company. And to do that, they often have a list of “table stakes” features that need to be supported; what’s more, they often have access to some budget to make these things happen. This came up when I was talking to Alexandru Radovici, the Foundation’s Silver Member Directory, who said that many safety critical companies have money they’d like to spend to close various gaps in Rust, but they don’t know how to spend it. Jon’s investments in Trifecta Tech and the uutils org have the same character: he is looking to close the gaps that block Ubuntu from using Rust more. Well, first of all, you should watch Jon’s talk. “Brilliant”, as the Brits have it. But my other big thought is that this is a crucial time for Rust. We are clearly transitioning in a number of areas from visionaries and early adopters towards that pragmatic majority, and we need to be mindful that doing so may require us to change some of the way that we’ve always done things. I liked this paragraph from Crossing the Chasm : To market successfully to pragmatists, one does not have to be one – just understand their values and work to serve them. To look more closely into these values, if the goal of visionaries is to take a quantum leap forward, the goal of pragmatists is to make a percentage improvement–incremental, measurable, predictable progress. [..] To market to pragmatists, you must be patient. You need to be conversant with the issues that dominate their particular business. You need to show up at the industry-specific conferences and trade shows they attend. Re-reading Crossing the Chasm as part of writing this blog post has really helped me square where Rust is – for the most part, I think we are still crossing the chasm, but we are well on our way. I think what we see is a consistent trend now where we have Rust champions who fit the “visionary” profile of early adopters successfully advocating for Rust within companies that fit the pragmatist, early majority profile. It strikes me that open-source is just an amazing platform for doing this kind of marketing. Unlike a company, we don’t have to do everything ourselves. We have to leverage the fact that open source helps those who help themselves – find those visionary folks in industries that could really benefit from Rust, bring them into the Rust orbit, and then (most important!) support and empower them to adapt Rust to their needs. This last part may sound obvious, but it’s harder than it sounds. When you’re embedded in open source, it seems like a friendly place where everyone is welcome. But the reality is that it can be a place full of cliques and “oral traditions” that “everybody knows” 9 . People coming with an idea can get shutdown for using the wrong word. They can readily mistake the, um, “impassioned” comments from a random contributor (or perhaps just a troll…) for the official word from project leadership. It only takes one rude response to turn somebody away. So what will ultimately help Rust the most to succeed? Empathy in Open Source . Let’s get out there, find out where Rust can help people, and make it happen. Exciting times! I am famously bad at accents. My best attempt at posh British sounds more like Apu from the Simpsons. I really wish I could pull off a convincing Greek accent, but sadly no.  ↩︎ Another of my pearls of wisdom is “there is nothing more permanent than temporary code”. I used to say that back at the startup I worked at after college, but years of experience have only proven it more and more true.  ↩︎ Russel Cohen and Jess Izen gave a great talk at last year’s RustConf about what our team is doing to help teams decide if Rust is viable for them. But since then another thing having a big impact is AI, which is bringing previously unthinkable projects, like rewriting older systems, within reach.  ↩︎ I have no idea if there is code in a car’s steering column, for the record. I assume so by now? For power steering or some shit?  ↩︎ Or am I supposed to call it “tea”? Or maybe “supper”? I can’t get a handle on British mealtimes.  ↩︎ Ernest is such a joy to be around. He’s quiet, but he’s got a lot of insights if you can convince him to share them. If you get the chance to meet him, take it! If you live in London, go to the London Rust meetup! Find Ernest and introduce yourself. Tell him Niko sent you and that you are supposed to say how great he is and how you want to learn from the wisdom he’s accrued over the years. Then watch him blush. What a doll.  ↩︎ If you can’t wait, you can read some Zulip discussion here.  ↩︎ The Battery Packs proposal I want to talk about is similar in some ways to the Rust Platform, but decentralized and generally better in my opinion– but I get ahead of myself!  ↩︎ Betteridge’s Law of Headlines has it that “Any headline that ends in a question mark can be answered by the word no ”. Well, Niko’s law of open-source 2 is that “nobody actually knows anything that ’everybody’ knows”.  ↩︎ I am famously bad at accents. My best attempt at posh British sounds more like Apu from the Simpsons. I really wish I could pull off a convincing Greek accent, but sadly no.  ↩︎ Another of my pearls of wisdom is “there is nothing more permanent than temporary code”. I used to say that back at the startup I worked at after college, but years of experience have only proven it more and more true.  ↩︎ Russel Cohen and Jess Izen gave a great talk at last year’s RustConf about what our team is doing to help teams decide if Rust is viable for them. But since then another thing having a big impact is AI, which is bringing previously unthinkable projects, like rewriting older systems, within reach.  ↩︎ I have no idea if there is code in a car’s steering column, for the record. I assume so by now? For power steering or some shit?  ↩︎ Or am I supposed to call it “tea”? Or maybe “supper”? I can’t get a handle on British mealtimes.  ↩︎ Ernest is such a joy to be around. He’s quiet, but he’s got a lot of insights if you can convince him to share them. If you get the chance to meet him, take it! If you live in London, go to the London Rust meetup! Find Ernest and introduce yourself. Tell him Niko sent you and that you are supposed to say how great he is and how you want to learn from the wisdom he’s accrued over the years. Then watch him blush. What a doll.  ↩︎ If you can’t wait, you can read some Zulip discussion here.  ↩︎ The Battery Packs proposal I want to talk about is similar in some ways to the Rust Platform, but decentralized and generally better in my opinion– but I get ahead of myself!  ↩︎ Betteridge’s Law of Headlines has it that “Any headline that ends in a question mark can be answered by the word no ”. Well, Niko’s law of open-source 2 is that “nobody actually knows anything that ’everybody’ knows”.  ↩︎

0 views
Evan Schwartz 1 weeks ago

PSA: Your SQLite Connection Pool Might Be Ruining Your Write Performance

Update (Feb 18, 2026): After a productive discussion on Reddit and additional benchmarking , I found that the solutions I originally proposed (batched writes or using a synchronous connection) don't actually help. The real issue is simpler and more fundamental than I described: SQLite is single-writer, so any amount of contention at the SQLite level will severely hurt write performance. The fix is to use a single writer connection with writes queued at the application level, and a separate connection pool for concurrent reads. The original blog post text is preserved below, with retractions and updates marked accordingly. My apologies to the SQLx maintainers for suggesting that this behavior was unique to SQLx. Write transactions can lead to lock starvation and serious performance degradation when using SQLite with SQLx , the popular async Rust SQL library. In retrospect, I feel like this should have been obvious, but it took a little more staring at suspiciously consistent "slow statement" logs than I'd like to admit, so I'm writing it up in case it helps others avoid this footgun. SQLite is single-writer. In WAL mode, it can support concurrent reads and writes (or, technically "write" singular), but no matter the mode there is only ever one writer at a time. Before writing, a process needs to obtain an EXCLUSIVE lock on the database. If you start a read transaction with a SELECT and then perform a write in the same transaction, the transaction will need to be upgraded to write transaction with an exclusive lock: A read transaction is used for reading only. A write transaction allows both reading and writing. A read transaction is started by a SELECT statement, and a write transaction is started by statements like CREATE, DELETE, DROP, INSERT, or UPDATE (collectively "write statements"). If a write statement occurs while a read transaction is active, then the read transaction is upgraded to a write transaction if possible. ( source ) Transactions started with or also take the exclusive write lock as soon as they are started. Transactions in SQLx look like this: This type of transaction where you read and then write is completely fine. The transaction starts as a read transaction and then is upgraded to a write transaction for the . Update: This section incorrectly attributes the performance degradation to the interaction between async Rust and SQLite. The problem is actually that any contention for the EXCLUSIVE lock at the SQLite level, whether from single statements or batches, will hurt write performance. The problem arises when you call within a write transaction. For example, this could happen if you call multiple write statements within a transaction: This code will cause serious performance degradation if you have multiple concurrent tasks that might be trying this operation, or any other write, at the same time. When the program reaches the first statement, the transaction is upgraded to a write transaction with an exclusive lock. However, when you call , the task yields control back to the async runtime. The runtime may schedule another task before returning to this one. The problem is that this task is now holding an exclusive lock on the database. All other writers must wait for this one to finish. If the newly scheduled task tries to write, it will simply wait until it hits the and returns a busy timeout error. The original task might be able to make progress if no other concurrent writers are scheduled before it, but under higher load you might continuously have new tasks that block the original writer from progressing. Starting a transaction with will also cause this problem, because you will immediately take the exclusive lock and then yield control with . In practice, you can spot this issue in your production logs if you see a lot of SQLx warnings that say where the time is very close to your (which is 5 seconds by default). This is the result of other tasks being scheduled by the runtime and then trying and failing to obtain the exclusive lock they need to write to the database while being blocked by a parked task. SQLite's concurrency model (in WAL mode) is many concurrent readers with exactly one writer. Mirroring this architecture at the application level provides the best performance. Instead of a single connection pool, where connections may be upgraded to write at any time, use two separate pools: With this setup, write transactions serialize within the application. Tasks will queue waiting for the single writer connection, rather than all trying to obtain SQLite's EXCLUSIVE lock. In my benchmarks , this approach was ~20x faster than using a single pool with multiple connections: An alternative to separate pools is wrapping writes in a Mutext , which achieves similar performance (95ms in the benchmarks). However, separate pools make the intent clearer and, if the reader pool is configured as read-only, prevent accidentally issuing a write on a reader connection. Having separate pools works when reads and writes are independent, but sometimes you need to atomically read and then write based on it: Sending this transaction to the single write connection is fine if the read is extremely fast, such as a single lookup by primary key. However, if your application requires expensive reads that must precede writes in a single atomic transaction, the shared connection pool with moderate concurrency might outperform a single writer. Retraction: Benchmarking showed that batched writes perform no better than the naive loop under concurrency, because 50 connections still contend for the write lock regardless of whether each connection issues 100 small s or one large . QueryBuilder is still useful for reducing per-statement overhead, but it does not fix the contention problem. We could safely replace the example code above with this snippet that uses a bulk insert to avoid the lock starvation problem: Note that if you do this with different numbers of values, you should call . By default, SQLx caches prepared statements. However, each version of the query with a different number of arguments will be cached separately, which may thrash the cache. Retraction: Benchmarking showed that this did not actually improve performance. Unfortunately, the fix for atomic writes to multiple tables is uglier and potentially very dangerous. To avoid holding an exclusive lock across an , you need to use the interface to execute a transaction in one shot: However, this can lead to catastrophic SQL injection attacks if you use this for user input, because does not support binding and sanitizing query parameters. Note that you can technically run a transaction with multiple statements in a call but the docs say: The query string may only contain a single DML statement: SELECT, INSERT, UPDATE, DELETE and variants. The SQLite driver does not currently follow this restriction, but that behavior is deprecated. If you find yourself needing atomic writes to multiple tables with SQLite and Rust, you might be better off rethinking your schema to combine those tables or switching to a synchronous library like with a single writer started with . Update: the most useful change would actually be making a distinction between a and a . Libraries like SQLx could enforce the distinction at compile time or runtime by inspecting the queries for the presence of write statements, or the could be configured as read-only. Maybe, but it probably won't. If SQLx offered both a sync and async API (definitely out of scope) and differentiated between read and write statements, a write could be like , which would prevent it from being held across an point. However, SQLx is not an ORM and it probably isn't worth it for the library to have different methods for read versus write statements. Without that, there isn't a way to prevent write transaction locks from being held across s while allowing safe read transactions to be used across s. So, in lieu of type safety to prevent this footgun, I wrote up this blog post and this pull request to include a warning about this in the docs. Discuss on r/rust and Hacker News .

0 views
Jimmy Miller 1 weeks ago

Untapped Way to Learn a Codebase: Build a Visualizer

The biggest shock of my early career was just how much code I needed to read that others wrote. I had never dealt with this. I had a hard enough time understanding my own code. The idea of understanding hundreds of thousands or even millions of lines of code written by countless other people scared me. What I quickly learned is that you don't have to understand a codebase in its entirety to be effective in it. But just saying that is not super helpful. So rather than tell, I want to show. In this post, I'm going to walk you through how I learn an unfamiliar codebase. But I'll admit, this isn't precisely how I would do it today. After years of working on codebases, I've learned quite a lot of shortcuts. Things that come with experience that just don't translate for other people. So what I'm going to present is a reconstruction. I want to show bits and parts of how I go from knowing very little to gaining knowledge and ultimately, asking the right questions. To do this, I will use just a few techniques: I want to do this on a real codebase, so I've chosen one whose purpose and scope I'm generally familiar with. But one that I've never contributed to or read, Next.js . But I've chosen to be a bit more particular than that. I'm particularly interested in learning more about the Rust bundler setup (turbopack) that Next.js has been building out. So that's where we will concentrate our time. Trying to learn a codebase is distinctly different from trying to simply fix a bug or add a feature. In post, we may use bugs, talk about features, make changes, etc. But we are not trying to contribute to the codebase, yet. Instead, we are trying to get our mind around how the codebase generally works. We aren't concerned with things like coding standards, common practices, or the development roadmap. We aren't even concerned with correctness. The changes we make are about seeing how the codebase responds so we can make sense of it. I find starting at to be almost always completely unhelpful. From main, yes, we have a single entry point, but now we are asking ourselves to understand the whole. But things actually get worse when dealing with a large codebase like this. There isn't even one main. Which main would we choose? So instead, let's start by figuring out what our library even consists of. A couple of things to note. We have packages, crates, turbo, and turbopack. Crates are relevant here because we know we are interested in some of the Rust code, but we are also interested in turbopack in particular. A quick look at these shows that turbo, packages, and crates are probably not our target. Why do I say that? Because turbopack has its own crates folder. So there are 54 crates under turbopack.... This is beginning to feel a bit daunting. So why don't we take a step back and find a better starting point? One starting point I find particularly useful is a bug report . I found this by simply looking at recently opened issues. When I first found it, it had no comments on it. In fact, I find bug reports with only reproducing instructions to be the most useful. Remember, we are trying to learn, not fix a bug. So I spent a little time looking at the bug report. It is fairly clear. It does indeed reproduce. But it has a lot of code. So, as is often the case, it is useful to reduce it to the minimal case. So that's what I did. Here is the important code and the problem we are using to learn from. MyEnum here is dead code. It should not show up in our final bundle. But when we do and look for it, we get: If we instead do The code is completely gone from our build. So now we have our bug. But remember. Our goal here is not to fix the bug. But to understand the code. So our goal is going to be to use this little mini problem to understand what code is involved in this bug. To understand the different ways we could fix this bug. To understand why this bug happened in the first place. To understand some small slice of the turbopack codebase. So at each junction, we are going to resist the urge to simply find the offending code. We are going to take detours. We are going to ask questions. We hope that from the start of this process to the end, we no longer think of the code involved in this action as a black box. But we will intentionally leave ourselves with open questions. As I write these words, I have no idea where this will take us. I have not prepared this ahead of time. I am not telling you a fake tale from a codebase I already know. Yes, I will simplify and skip parts. But you will come along with me. The first step for understanding any project is getting some part of it running. Well, I say that, but in my day job, I've been at companies where this is a multi-day or week-long effort. Sometimes, because of a lack of access, sometimes from unclear instructions, if you find yourself in that situation, you now have a new task, understand it well enough to get it to build. Well, unfortunately, that is the scenario we find ourselves in. I can't think of a single one of these endeavors I've gone on to learn a codebase that didn't involve a completely undesirable, momentum-stopping side quest. For this one, it was as soon as I tried to make changes to the turbopack Rust code and get it working in my test project. There are instructions on how to do this . In fact, we even get an explanation on why it is a bit weird. Since Turbopack doesn't support symlinks when pointing outside of the workspace directory, it can be difficult to develop against a local Next.js version. Neither nor imports quite cut it. An alternative is to pack the Next.js version you want to test into a tarball and add it to the pnpm overrides of your test application. The following script will do it for you: Okay, straightforward enough. I start by finding somewhere in the turbopack repo that I think will be called more than once, and I add the following: Yes. Very scientific, I know. But I've found this to be a rather effective method of making sure my changes are picked up. So I do that, make sure I've built and done the necessary things. I run Then that script tells me to add some overrides and dependencies to my test project. I go to build my project and HERERE!!!!!!! does not show up at all... I will save you the fun details here of looking through this system. But I think it's important to mention a few things. First, being a dependency immediately stood out to me. In my day job, I maintain a fork of swc (WHY???) for some custom stuff. I definitely won't pretend to be an expert on swc, but I know it's written in Rust. I know it's a native dependency. The changes I'm making are native dependencies. But I see no mention at all of turbopack. In fact, if I search in my test project, I get the following: So I have a sneaking suspicion my turbopack code should be in that tar. So let's look at the tar. Ummm. That seems a bit small... Let's look at what's inside. Okay, I think we found our problem. There's really nothing in this at all. Definitely no native code. After lots of searching, the culprit came down to: In our case, the input came from this file and f was . Unfortunately, this little set + regex setup causes to be filtered out. Why? Because it doesn't match the regex. This regex is looking for a with characters after it. We have none. So since we are already in the set (we just added ourselves), we filter ourselves out. How do we solve this problem? There are countless answers, really. I had Claude whip me up one without regex. But my gut says the sorting lets us do this much simpler. But after this fix, let's look at the tar now: Much better. After this change, we can finally see HERERE!!!!!!! a lot. Update : As I wrote this article, someone fixed this in a bit of a different way . Keeping the regex and just changing to . Fairly practical decision. Okay, we now have something we can test. But where do we even begin? This is one reason we chose this bug. It gives a few avenues to go down. First, the report says that these enums are not being "tree-shaken." Is that the right term? One thing I've learned from experience is to never assume that the end user is using terms in the same manner as the codebase. So this can be a starting point, but it might be wrong. With some searching around, we can actually see that there is a configuration for turning turbopackTreeShaking on or off. It was actually a bit hard to find exactly where the default for this was. It isn't actually documented. So let's just enable it and see what we get. Well, I think we figured out that the default is off. So one option is that we never "tree shake" anything. But that seems wrong. At this point, I looked into tree shaking a bit in the codebase, and while I started to understand a few things, I've been at this point before. Sometimes it is good to go deep. But how much of this codebase do I really understand? If tree shaking is our culprit (seeming unlikely at this point), it might be good to know how code gets there. Here, we of course found a bug. But it is an experimental feature. Maybe we can come back and fix it? Maybe we can file a bug? Maybe this code just isn't at all ready for primetime. It's hard to know as an outsider. Our "search around the codebase" strategy failed. So now we try a different tactic. We know a couple of things. We now have two points we can use to try to trace what happens. Let's start with parsing. Luckily, here it is straightforward: . When we look at this code, we can see that swc does the heavy lifting. First, it parses it into a TypeScript AST, then applies transforms to turn it into JavaScript. At this point, we don't write to a string, but if you edit the code and use an emitter, you see this: Now, to find where we write the chunks. In most programs, this would be pretty easy. Typically, there is a linear flow somewhere that just shows you the steps. Or if you can't piece one together, you can simply breakpoint and follow the flow. But Turbopack is a rather advanced system involving async Rust (more on this later). So, in keeping with the tradition of not trying to do things that rely too heavily on my knowledge, I have done the tried and true, log random things until they look relevant. And what I found made me realize that logging was not going to be enough. It was time to do my tried and true learning technique, visualization. Ever since my first job , I have been building custom tools to visualize codebases. Perhaps this is due to my aphantasia. I'm not really sure. Some of these visualizers make their way into general use for me. But more often than not, they are a means of understanding. When I applied for a job at Shopify working on YJIT, I built a simple visualizer but never got around to making it more useful than a learning tool. The same thing is true here, but this time, thanks to AI, it looks a bit more professional. This time, we want to give a bit more structure than what we'd be able to do with a simple print. 1 We are trying to get events out that have a bunch of information. Mostly, we are interested in files and their contents over time. Looking through the codebase, we find that one key abstract is an ident; this will help us identify files. We will simply find points that seem interesting, make a corresponding event, make sure it has idents associated with it, and send that event over a WebSocket. Then, with that raw information, we can have our visualizer stitch together what exactly happens. If we take a look, we can see our code step through the process. And ultimately end up in the bundle despite not being used. If you notice, though, between steps 3 and 4, our code changed a bit. We lost this PURE annotation. Why? Luckily, because we tried to capture as much context as we could. We can see that a boolean "Scope Hoisting" has been enabled. Could that be related? If we turn it off, we instead see: Our pure annotation is kept around, and as a result, our code is eliminated. If we take a step back, this can make sense. Something during the parse step is creating a closure around our enum code, but when it does so, it is marking that as a "pure" closure, meaning it has no side effects. Later, because this annotation is dropped, the minifier doesn't know that it can just get rid of this closure. As I've been trying to find time to write this up, it seems that people on the bug report have found this on their own as well. So we've found the behavior of the bug. Now we need to understand why it is happening. Remember, we are fixing a bug as a means of understanding the software. Not just to fix a bug. So what exactly is going on? Well, we are trying to stitch together two libraries. Software bugs are way more likely to occur on these seams. In this case, after reading the code for a while, the problem becomes apparent. SWC parses our code and turns it into an AST. But if you take a look at an AST , comments are not a part of the AST. So instead, SWC stores comments off in a hashmap where we can look them up by byte pos. So for each node in the AST, it can see if there is a comment attached. But for the PURE comment, it doesn't actually need to look this comment up. It is not a unique comment that was in the source code. It is a pre-known meta comment. So rather than store each instance in the map, it makes a special value. Now, this encoding scheme causes some problems for turbopack. Turbopack does not act on a single file; it acts across many files. In fact, for scope hoisting, we are trying to take files across modules and condense them into a single scope. So now, when we encounter one of these bytepos encodings, how do we know what module it belongs to? The obvious answer to many might be to simply make a tuple like , and while that certainly works, it does come with tradeoffs. One of these is memory footprint. I didn't find an exact reason. But given the focus on performance on turbopack, I'd imagine this is one of the main motivations. Instead, we get a fairly clever encoding of module and bytepos into a single BytePos, aka a u32. I won't get into the details of the representation here; it involves some condition stuff. But needless to say, now that we are going from something focusing on one file to focusing on multiple and trying to smuggle in this module_id into our BytePos, we ended up missing one detail, PURE. Now our pure value is being interpreted as some module at some very high position instead of the proper bytes. To fix this bug, I found the minimal fix was simply the following: With this our enum properly is marked as PURE and disappears from the output! Now remember, we aren't trying to make a bug fix. We are trying to understand the codebase. Is this the right fix? I'm not sure. I looked around the codebase, and there are a number of other swc sentinel values that I think need to also be handled (PLACEHOLDER and SYNTHESIZED). There is also the decoding path. For dummy, the decoding path panics. Should we do the same? Should we be handling pure at a higher level, where it never even goes through the encoder? Update : As I was writing this, someone else proposed a fix . As I was writing the article, I did see that others had started to figure out the things I had determined from my investigation. But I was not confident enough that it was the right fix yet. In fact, the PR differs a bit from my local fix. It does handle the other sentinel, but at a different layer. It also chooses to decode with module 0. Which felt a bit wrong to me. But again, these are decisions that people who work on this codebase long-term are better equipped to decide than me. I must admit that simply fixing this bug didn't quite help me understand the codebase. Not just because it is a fairly good size. But because I couldn't see this fundamental unit that everything was composed of. In some of the code snippets above, you will see types that mention Vc. This stands for ValueCell. There are a number of ways to try to understand these; you can check out the docs for turbo engine for some details. Or you can read the high-level overview that skips the implementation details for the most part. You can think of these cells like the cells in a spreadsheet. They provide a level of incremental computation. When the value of some cell updates, we can invalidate stuff. Unlike a spreadsheet, the turbo engine is lazy. I've worked with these kinds of systems before. Some are very explicitly modeled after spreadsheets. Others are based on rete networks or propagators. I am also immediately reminded of salsa from the Rust analyzer team. I've also worked with big, complicated non-computational graphs. But even with that background, I know myself, I've never been able to really understand these things until I can visualize them. And while a general network visualizer can be useful (and might actually be quite useful if I used the aggregate graph), I've found that for my understanding, I vastly prefer starting small and exploring out the edges of the graph. So that's what I did. But before we get to that visualization, I want to highlight something fantastic in the implementation: a central place for controlling a ton of the decisions that go into this system. The backend here lets us decide so many things about how the execution of our tasks will run. Because of this, we have one place we can insert a ton of tooling and begin to understand how this system works. As before, we are going to send things on a WebSocket. But unlike last time, our communication will actually be two-way. We are going to be controlling how the tasks run. In my little test project, I edited a file, and my visualizer displayed the following. Admittedly, it is a bit janky, but there are some nice features. First, on the left, we can see all the pending tasks. In this case, something has marked our file read as dirty, so we are trying to read the file. We can see the contents of a cell that this task has. And we can see the dependents of this task. Here is what it looks like once we release that task to run. We can now see 3 parse tasks have kicked off. Why 3? I'll be honest, I haven't looked into it. But a good visualization is about provoking questions, not only answering them. Did I get my visualization wrong because I misunderstood something about the system? Are there three different subsystems that want to parse, and we want to do them in parallel? Have we just accidentally triggered more parses than we should be? This is precisely what we want out of a visualizer. Is it perfect? No, would I ship this as a general visualizer? No. Am I happy with the style? Not in the least. But already it enables a look into the project I couldn't see before. Here we can actually watch the graph unfold as I execute more steps. What a fascinating view of a once opaque project. With this visualizer, I was able to make changes to my project and watch values as they flow through the systems. I made simple views for looking at code. If I extended this, I can imagine it being incredibly useful for debugging general issues, for seeing the ways in which things are scheduled, and for finding redundancies in the graph. Once I was able to visualize this, I really started to understand the codebase better. I was able to see all the values that didn't need to be recomputed when we made changes. The whole thing clicked. This was an exercise in exploring a new codebase that is a bit different of a process than I see others take. It isn't an easy process, it isn't quick. But I've found myself repeating this process over and over again. For the turborepo codebase, this is just the beginning. This exploration was done over a few weekends in the limited spare time I could find. But already I can start to put the big picture together. I can start to see how I could shape my tools to help me answer more questions. If you've never used tool building as a way to learn a codebase, I highly recommend it. One thing I always realize as I go through this process is just how hard it is to work interactively with our current software. Our languages, our tools, our processes are all written without ways to live code, without ways to inspect their inner workings. It is also incredibly hard to find a productive UI environment for this kind of live exploration. The running state of the visualizer contains all the valuable information. Any system that needs you to retrace your steps to get the UI back to the state it was once in to visualize more is incredibly lacking. So I always find myself in the browser, but immediately, I am having to worry about performance. We have made massive strides in so many aspects of software development. I hope that we will fix this one as well. Setting a goal Editing randomly Fixing things I find that are broken Reading to answer questions Making a visualizer Our utilities.ts file is read and parsed. It ends up in a file under a "chunks" directory.

0 views
baby steps 2 weeks ago

Sharing in Dada

OK, let’s talk about sharing . This is the first of Dada blog posts where things start to diverge from Rust in a deep way and I think the first where we start to see some real advantages to the Dada way of doing things (and some of the tradeoffs I made to achieve those advantages). Let’s start with the goal: earlier, I said that Dada was like “Rust where you never have to type ”. But what I really meant is that I want a GC-like experience–without the GC. I also often use the word “composable” to describe the Dada experience I am shooting for. Composable means that you can take different things and put them together to achieve something new. Obviously Rust has many composable patterns – the APIs, for example. But what I have found is that Rust code is often very brittle: there are many choices when it comes to how you declare your data structures and the choices you make will inform how those data structures can be consumed. Let’s create a type that we can use as a running example throughout the post: . In Rust, we might define a like so: Now, suppose that, for whatever reason, we are going to build up a character programmatically: So far, so good. Now suppose I want to share that same struct so it can be referenced from a lot of places without deep copying. To do that, I am going to put it in an : OK, cool! Now I have a that is readily sharable. That’s great. Side note but this is an example of where Rust is composable: we defined once in a fully-owned way and we were able to use it mutably (to build it up imperatively over time) and then able to “freeze” it and get a read-only, shared copy of . This gives us the advantages of an imperative programming language (easy data construction and manipulation) and the advantages of a functional language (immutability prevents bugs when things are referenced from many disjoint places). Nice! Now , suppose that I have some other code, written independently, that just needs to store the character’s name . That code winds up copying the name into a lot of different places. So, just like we used to let us cheaply reference a single character from multiple places, it uses so it can cheaply reference the character’s name from multiple places: OK. Now comes the rub. I want to create a character-sheet widget from our shared character: Shoot, that’s frustrating! What I would like to do is to write or something similar (actually I’d probably like to just write , but anyhow) and get back an . But I can’t do that. Instead, I have to deeply clone the string and allocate a new . Of course any subsequent clones will be cheap. But it’s not great. I often find patterns like this arise in Rust: there’s a bit of an “impedance mismatch” between one piece of code and another. The solution varies, but it’s generally something like The goal with Dada is that we don’t have that kind of thing. So let’s walk through how that same example would play out in Dada. We’ll start by defining the class: Just as in Rust, we can create the character and then modify it afterwards: Cool. Now, I want to share the character so it can be referenced from many places. In Rust, we created an , but in Dada, sharing is “built-in”. We use the operator, which will convert the (i.e., fully owned character) into a : Now that we have a character, we can copy it around: When you have a shared object and you access its field, what you get back is a shared (shallow) copy of the field : To drill home how cool and convenient this is, imagine that I have a that I share with : and then I share it with . What I get back is a . And when I access the elements of that, I get back a : This is as if one could take a in Rust and get out a . So how is sharing implemented? The answer lies in a not-entirely-obvious memory layout. To see how it works, let’s walk how a would be laid out in memory: Here is a built-in type that is the basis for Dada’s unsafe code system. 1 Now imagine we have a like this: The character would be laid out in memory something like this (focusing just on the field): Let’s talk this through. First, every object is laid out flat in memory, just like you would see in Rust. So the fields of are stored on the stack, and the field is laid out flat within that. Each object that owns other objects begins with a hidden field, . This field indicates whether the object is shared or not (in the future we’ll add more values to account for other permissions). If the field is 1, the object is not shared. If it is 2, then it is shared. Heap-allocated objects (i.e., using ) begin with a ref-count before the actual data (actually this is at the offset of -4). In this case we have a so the actual data that follows are just simple characters. If I were to instead create a shared character: The memory layout would be the same, but the flag field on the character is now 2: Now imagine that we created two copies of the same shared character: What happens is that we will copy all the fields of and then, because is 2, we will increment the ref-counts for the heap-allocated data within: Now imagine we were to copy out the name field, instead of the entire character: …what happens is that: The result is that you get: This post showed how values in Dada work and showed how the permission propagates when you access a field. Permissions are how Dada manages object lifetimes. We’ve seen two so far In future posts we’ll see the and permissions, which roughly correspond to and , and talk out how the whole thing fits together. This is the first post where we started to see a bit more of Dada’s character. Reading over the previous few posts, you could be forgiven for thinking Dada was just a cute syntax atop familiar Rust semantics. But as you can see from how works, Dada is quite a bit more than that. I like to think of Dada as “opinionated Rust” in some sense. Unlike Rust, it imposes some standards on how things are done. For example, every object (at least every object with a heap-allocated field) has a field. And every heap allocation has a ref-count. These conventions come at some modest runtime cost. My rule is that basic operations are allowed to do “shallow” operations, e.g., toggling the or adjusting the ref-counts on every field. But they cannot do “deep” operations that require traversing heap structures. In exchange for adopting conventions and paying that cost, you get “composability”, by which I mean that permissions in Dada (like ) flow much more naturally, and types that are semantically equivalent (i.e., you can do the same things with them) generally have the same layout in memory. Remember that I have not implemented all this, I am drawing on my memory and notes from my notebooks. I reserve the right to change any and everything as I go about implementing.  ↩︎ clone some data – it’s not so big anyway, screw it (that’s what happened here). refactor one piece of code – e.g., modify the class to store an . Of course, that has ripple effects, e.g., we can no longer write anymore, but have to use or something. invoke some annoying helper – e.g., write to convert from an to a or write a to convert from a to a . traversing , we observe that the field is 2 and therefore is shared we copy out the fields from . Because the character is shared: we modify the field on the new string to 2 we increment the ref-count for any heap values the permission indicates a uniquely owned value ( , in Rust-speak); the permission indicates a copyable value ( is the closest Rust equivalent). Remember that I have not implemented all this, I am drawing on my memory and notes from my notebooks. I reserve the right to change any and everything as I go about implementing.  ↩︎

0 views
Anton Zhiyanov 2 weeks ago

Allocators from C to Zig

An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. Many C programs use the standard libc allocator, or at best, let you switch it out for another one like jemalloc or mimalloc. Unlike C, modern systems languages usually treat allocators as first-class citizens. Let's look at how they handle allocation and then create a C allocator following their approach. Rust • Zig • Odin • C3 • Hare • C • Final thoughts Rust is one of the older languages we'll be looking at, and it handles memory allocation in a more traditional way. Right now, it uses a global allocator, but there's an experimental Allocator API implemented behind a feature flag (issue #32838 ). We'll set the experimental API aside and focus on the stable one. The documentation begins with a clear statement: In a given program, the standard library has one "global" memory allocator that is used for example by and . Followed by a vague one: Currently the default global allocator is unspecified. It doesn't mean that a Rust program will abort an allocation, of course. In practice, Rust uses the system allocator as the global default (but the Rust developers don't want to commit to this, hence the "unspecified" note): The global allocator interface is defined by the trait in the module. It requires the implementor to provide two essential methods — and , and provides two more based on them — and : The struct describes a piece of memory we want to allocate — its size in bytes and alignment: Memory alignment Alignment restricts where a piece of data can start in memory. The memory address for the data has to be a multiple of a certain number, which is always a power of 2. Alignment depends on the type of data: CPUs are designed to read "aligned" memory efficiently. For example, if you read a 4-byte integer starting at address 0x03 (which is unaligned), the CPU has to do two memory reads — one for the first byte and another for the other three bytes — and then combine them. But if the integer starts at address 0x04 (which is aligned), the CPU can read all four bytes at once. Aligned memory is also needed for vectorized CPU operations (SIMD), where one processor instruction handles a group of values at once instead of just one. The compiler knows the size and alignment for each type, so we can use the constructor or helper functions to create a valid layout: Don't be surprised that a takes up 32 bytes. In Rust, the type can grow, so it stores a data pointer, a length, and a capacity (3 × 8 = 24 bytes). There's also 1 byte for the boolean and 7 bytes of padding (because of 8-byte alignment), making a total of 32 bytes. is the default memory allocator provided by the operating system. The exact implementation depends on the platform . It implements the trait and is used as the global allocator by default, but the documentation does not guarantee this (remember the "unspecified" note?). If you want to explicitly set as the global allocator, you can use the attribute: You can also set a custom allocator as global, like in this example: To use the global allocator directly, call the and functions: In practice, people rarely use or directly. Instead, they work with types like , or that handle allocation for them: The allocator doesn't abort if it can't allocate memory; instead, it returns (which is exactly what recommends): The documentation recommends using the function to signal out-of-memory errors. It immediately aborts the process, or panics if the binary isn't linked to the standard library. Unlike the low-level function, types like or call if allocation fails, so the program usually aborts if it runs out of memory: Allocator API • Memory allocation APIs Memory management in Zig is explicit. There is no default global allocator, and any function that needs to allocate memory accepts an allocator as a separate parameter. This makes the code a bit more verbose, but it matches Zig's goal of giving programmers as much control and transparency as possible. An allocator in Zig is a struct with an opaque self-pointer and a method table with four methods: Unlike Rust's allocator methods, which take a raw pointer and a size as arguments, Zig's allocator methods take a slice of bytes ( ) — a type that combines both a pointer and a length. Another interesting difference is the optional parameter, which is the first return address in the allocation call stack. Some allocators, like the , use it to keep track of which function requested memory. This helps with debugging issues related to memory allocation. Just like in Rust, allocator methods don't return errors. Instead, and return if they fail. Zig also provides type-safe wrappers that you can use instead of calling the allocator methods directly: Unlike the allocator methods, these allocation functions return an error if they fail. If a function or method allocates memory, it expects the developer to provide an allocator instance: Zig's standard library includes several built-in allocators in the namespace. asks the operating system for entire pages of memory, each allocation is a syscall: allocates memory into a fixed buffer and doesn't make any heap allocations: wraps a child allocator and allows you to allocate many times and only free once: The call frees all memory. Individual calls are no-ops. (aka ) is a safe allocator that can prevent double-free, use-after-free and can detect leaks: is a general-purpose thread-safe allocator designed for maximum performance on multithreaded machines: is a wrapper around the libc allocator: Zig doesn't panic or abort when it can't allocate memory. An allocation failure is just a regular error that you're expected to handle: Allocators • std.mem.Allocator • std.heap Odin supports explicit allocators, but, unlike Zig, it's not the only option. In Odin, every scope has an implicit variable that provides a default allocator: If you don't pass an allocator to a function, it uses the one currently set in the context. An allocator in Odin is a struct with an opaque self-pointer and a single function pointer: Unlike other languages, Odin's allocator uses a single procedure for all allocation tasks. The specific action — like allocating, resizing, or freeing memory — is decided by the parameter. The allocation procedure returns the allocated memory (for and operations) and an error ( on success). Odin provides low-level wrapper functions in the package that call the allocator procedure using a specific mode: There are also type-safe builtins like / (for a single object) and / (for multiple objects) that you can use instead of the low-level interface: By default, all builtins use the context allocator, but you can pass a custom allocator as an optional parameter: To use a different allocator for a specific block of code, you can reassign it in the context: Odin's provides two different allocators: When using the temp allocator, you only need a single call to clear all the allocated memory. Odin's standard library includes several allocators, found in the and packages. The procedure returns a general-purpose allocator: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access, similar to in Zig: There are also others, such as or . Like Zig, Odin doesn't panic or abort when it can't allocate memory. Instead, it returns an error code as the second return value: Allocators • base:runtime • core:mem Like Zig and Odin, C3 supports explicit allocators. Like Odin, C3 provides two default allocators: heap and temp. An allocator in C3 is a interface with an additional option of zeroing or not zeroing the allocated memory: Unlike Zig and Odin, the and methods don't take the (old) size as a parameter — neither directly like Odin nor through a slice like Zig. This makes it a bit harder to create custom allocators because the allocator has to keep track of the size along with the allocated memory. On the other hand, this approach makes C interop easier (if you use the default C3 allocator): data allocated in C can be freed in C3 without needing to pass the size parameter from the C code. Like in Odin, allocator methods return an error if they fail. C3 provides low-level wrapper macros in the module that call allocator methods: These either return an error (the -suffix macros) or abort if they fail. There are also functions and macros with similar names in the module that use the global allocator instance: If a function or method allocates memory, it often expects the developer to provide an allocator instance: C3 provides two thread-local allocator instances: There are functions and macros in the module that use the temporary allocator: To macro releases all temporary allocations when leaving the scope: Some types, like or , use the temp allocator by default if they are not initialized: C3's standard library includes several built-in allocators, found in the module. is a wrapper around libc's malloc/free: uses a single backing buffer for allocations, allowing you to allocate many times and only free once: detects leaks and invalid memory access: There are also others, such as or . Like Zig and Odin, C3 can return an error in case of allocation failure: C3 can also abort in case of allocation failure: Since the functions and macros in the module use instead of , it looks like aborting on failure is the preferred approach. Memory Handling • core::mem::alocator • core::mem Unlike other languages, Hare doesn't support explicit allocators. The standard library has multiple allocator implementations, but only one of them is used at runtime. Hare's compiler expects the runtime to provide and implementations: The programmer isn't supposed to access them directly (although it's possible by importing and calling or ). Instead, Hare uses them to provide higher-level allocation helpers. Hare offers two high-level allocation helpers that use the global allocator internally: and . can allocate individual objects. It takes a value, not a type: can also allocate slices if you provide a second parameter (the number of items): works correctly with both pointers to single objects (like ) and slices (like ). Hare's standard library has three built-in memory allocators: The allocator that's actually used is selected at compile time. Like other languages, Hare returns an error in case of allocation failure: You can abort on error with : Or propagate the error with : Dynamic memory allocation • malloc.ha Many C programs use the standard libc allocator, or at most, let you swap it out for another one using macros: Or using a simple setter: While this might work for switching the libc allocator to jemalloc or mimalloc, it's not very flexible. For example, trying to implement an arena allocator with this kind of API is almost impossible. Now that we've seen the modern allocator design in Zig, Odin, and C3 — let's try building something similar in C. There are a lot of small choices to make, and I'm going with what I personally prefer. I'm not saying this is the only way to design an allocator — it's just one way out of many. Our allocator should return an error instead of if it fails, so we'll need an error enum: The allocation function needs to return either a tagged union (value | error) or a tuple (value, error). Since C doesn't have these built in, let's use a custom tuple type: The next step is the allocator interface. I think Odin's approach of using a single function makes the implementation more complicated than it needs to be, so let's create separate methods like Zig does: This approach to interface design is explained in detail in a separate post: Interfaces in C . Zig uses byte slices ( ) instead of raw memory pointers. We could make our own byte slice type, but I don't see any real advantage to doing that in C — it would just mean more type casting. So let's keep it simple and stick with like our ancestors did. Now let's create generic and wrappers: I'm taking for granted here to keep things simple. A more robust implementation should properly check if it is available or pass the type to directly. We can even create a separate pair of helpers for collections: We could use some macro tricks to make and work for both a single object and a collection. But let's not do that — I prefer to avoid heavy-magic macros in this post. As for the custom allocators, let's start with a libc wrapper. It's not particularly interesting, since it ignores most of the parameters, but still: Usage example: Now let's use that field to implement an arena allocator backed by a fixed-size buffer: Usage example: As shown in the examples above, the allocation method returns an error if something goes wrong. While checking for errors might not be as convenient as it is in Zig or Odin, it's still pretty straightforward: Here's an informal table comparing allocation APIs in the languages we've discussed: In Zig, you always have to specify the allocator. In Odin, passing an allocator is optional. In C3, some functions require you to pass an allocator, while others just use the global one. In Hare, there's a single global allocator. As we've seen, there's nothing magical about the allocators used in modern languages. While they're definitely more ergonomic and safe than C, there's nothing stopping us from using the same techniques in plain C. on Unix platforms; on Windows; : alignment = 1. Can start at any address (0, 1, 2, 3...). : alignment = 4. Must start at addresses divisible by 4 (0, 4, 8, 12...). : alignment = 8. Must start at addresses divisible by 8 (0, 8, 16...). is for general-purpose allocations. It uses the operating system's heap allocator. is for short-lived allocations. It uses a scratch allocator (a kind of growing arena). is for general-purpose allocations. It uses a operating system's heap allocator (typically a libc wrapper). is for short-lived allocations. It uses an arena allocator. The default allocator is based on the algorithm from the Verified sequential malloc/free paper. The libc allocator uses the operating system's malloc and free functions from libc. The debug allocator uses a simple mmap-based method for memory allocation.

0 views
baby steps 3 weeks ago

Hello, Dada!

Following on my Fun with Dada post, this post is going to start teaching Dada. I’m going to keep each post short – basically just what I can write while having my morning coffee. 1 Here is a very first Dada program I think all of you will be able to guess what it does. Still, there is something worth noting even in this simple program: “You have the right to write code. If you don’t write a function explicitly, one will be provided for you.” Early on I made the change to let users omit the function and I was surprised by what a difference it made in how light the language felt. Easy change, easy win. Here is another Dada program Unsurprisingly, this program does the same thing as the last one. “Convenient is the default.” Strings support interpolation (i.e., ) by default. In fact, that’s not all they support, you can also break them across lines very conveniently. This program does the same thing as the others we’ve seen: When you have a immediately followed by a newline, the leading and trailing newline are stripped, along with the “whitespace prefix” from the subsequent lines. Internal newlines are kept, so something like this: would print Of course you could also annotate the type of the variable explicitly: You will find that it is . This in and of itself is not notable, unless you are accustomed to Rust, where the type would be . This is of course a perennial stumbling block for new Rust users, but more than that, I find it to be a big annoyance – I hate that I have to write or everywhere that I mix constant strings with strings that are constructed. Similar to most modern languages, strings in Dada are immutable. So you can create them and copy them around: OK, we really just scratched the surface here! This is just the “friendly veneer” of Dada, which looks and feels like a million other languages. Next time I’ll start getting into the permission system and mutation, where things get a bit more interesting. My habit is to wake around 5am and spend the first hour of the day doing “fun side projects”. But for the last N months I’ve actually been doing Rust stuff, like symposium.dev and preparing the 2026 Rust Project Goals . Both of these are super engaging, but all Rust and no play makes Niko a dull boy. Also a grouchy boy.  ↩︎ My habit is to wake around 5am and spend the first hour of the day doing “fun side projects”. But for the last N months I’ve actually been doing Rust stuff, like symposium.dev and preparing the 2026 Rust Project Goals . Both of these are super engaging, but all Rust and no play makes Niko a dull boy. Also a grouchy boy.  ↩︎

0 views
Simon Willison 3 weeks ago

How StrongDM's AI team build serious software without even looking at the code

Last week I hinted at a demo I had seen from a team implementing what Dan Shapiro called the Dark Factory level of AI adoption, where no human even looks at the code the coding agents are producing. That team was part of StrongDM, and they've just shared the first public description of how they are working in Software Factories and the Agentic Moment : We built a Software Factory : non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review. [...] In kōan or mantra form: In rule form: Finally, in practical form: I think the most interesting of these, without a doubt, is "Code must not be reviewed by humans". How could that possibly be a sensible strategy when we all know how prone LLMs are to making inhuman mistakes ? I've seen many developers recently acknowledge the November 2025 inflection point , where Claude Opus 4.5 and GPT 5.2 appeared to turn the corner on how reliably a coding agent could follow instructions and take on complex coding tasks. StrongDM's AI team was founded in July 2025 based on an earlier inflection point relating to Claude Sonnet 3.5: The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error. By December of 2024, the model's long-horizon coding performance was unmistakable via Cursor's YOLO mode . Their new team started with the rule "no hand-coded software" - radical for July 2025, but something I'm seeing significant numbers of experienced developers start to adopt as of January 2026. They quickly ran into the obvious problem: if you're not writing anything by hand, how do you ensure that the code actually works? Having the agents write tests only helps if they don't cheat and . This feels like the most consequential question in software development right now: how can you prove that software you are producing works if both the implementation and the tests are being written for you by coding agents? StrongDM's answer was inspired by Scenario testing (Cem Kaner, 2003). As StrongDM describe it: We repurposed the word scenario to represent an end-to-end "user story", often stored outside the codebase (similar to a "holdout" set in model training), which could be intuitively understood and flexibly validated by an LLM. Because much of the software we grow itself has an agentic component, we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user? That idea of treating scenarios as holdout sets - used to evaluate the software but not stored where the coding agents can see them - is fascinating . It imitates aggressive testing by an external QA team - an expensive but highly effective way of ensuring quality in traditional software. Which leads us to StrongDM's concept of a Digital Twin Universe - the part of the demo I saw that made the strongest impression on me. The software they were building helped manage user permissions across a suite of connected services. This in itself was notable - security software is the last thing you would expect to be built using unreviewed LLM code! [The Digital Twin Universe is] behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors. With the DTU, we can validate at volumes and rates far exceeding production limits. We can test failure modes that would be dangerous or impossible against live services. We can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs. How do you clone the important parts of Okta, Jira, Slack and more? With coding agents! As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation. With their own, independent clones of those services - free from rate-limits or usage quotas - their army of simulated testers could go wild . Their scenario tests became scripts for agents to constantly execute against the new systems as they were being built. This screenshot of their Slack twin also helps illustrate how the testing process works, showing a stream of simulated Okta users who are about to need access to different simulated systems. This ability to quickly spin up a useful clone of a subset of Slack helps demonstrate how disruptive this new generation of coding agent tools can be: Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it. The techniques page is worth a look too. In addition to the Digital Twin Universe they introduce terms like Gene Transfusion for having agents extract patterns from existing systems and reuse them elsewhere, Semports for directly porting code from one language to another and Pyramid Summaries for providing multiple levels of summary such that an agent can enumerate the short ones quickly and zoom in on more detailed information as it is needed. StrongDM AI also released some software - in an appropriately unconventional manner. github.com/strongdm/attractor is Attractor , the non-interactive coding agent at the heart of their software factory. Except the repo itself contains no code at all - just three markdown files describing the spec for the software in meticulous detail, and a note in the README that you should feed those specs into your coding agent of choice! github.com/strongdm/cxdb is a more traditional release, with 16,000 lines of Rust, 9,500 of Go and 6,700 of TypeScript. This is their "AI Context Store" - a system for storing conversation histories and tool outputs in an immutable DAG. It's similar to my LLM tool's SQLite logging mechanism but a whole lot more sophisticated. I may have to gene transfuse some ideas out of this one! I visited the StrongDM AI team back in October as part of a small group of invited guests. The three person team of Justin McCarthy, Jay Taylor and Navan Chauhan had formed just three months earlier, and they already had working demos of their coding agent harness, their Digital Twin Universe clones of half a dozen services and a swarm of simulated test agents running through scenarios. And this was prior to the Opus 4.5/GPT 5.2 releases that made agentic coding significantly more reliable a month after those demos. It felt like a glimpse of one potential future of software development, where software engineers move from building the code to building and then semi-monitoring the systems that build the code. The Dark Factory. I glossed over this detail in my first published version of this post, but it deserves some serious attention. If these patterns really do add $20,000/month per engineer to your budget they're far less interesting to me. At that point this becomes more of a business model exercise: can you create a profitable enough line of products that you can afford the enormous overhead of developing software in this way? Building sustainable software businesses also looks very different when any competitor can potentially clone your newest features with a few hours of coding agent work. I hope these patterns can be put into play with a much lower spend. I've personally found the $200/month Claude Max plan gives me plenty of space to experiment with different agent patterns, but I'm also not running a swarm of QA testers 24/7! I think there's a lot to learn from StrongDM even for teams and individuals who aren't going to burn thousands of dollars on token costs. I'm particularly invested in the question of what it takes to have agents prove that their code works without needing to review every line of code they produce. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Why am I doing this? (implied: the model should be doing this instead) Code must not be written by humans Code must not be reviewed by humans If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

0 views
Simon Willison 3 weeks ago

Running Pydantic's Monty Rust sandboxed Python subset in WebAssembly

There's a jargon-filled headline for you! Everyone's building sandboxes for running untrusted code right now, and Pydantic's latest attempt, Monty , provides a custom Python-like language (a subset of Python) in Rust and makes it available as both a Rust library and a Python package. I got it working in WebAssembly, providing a sandbox-in-a-sandbox. Here's how they describe Monty : Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code. Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds. What Monty can do: A quick way to try it out is via uv : Then paste this into the Python interactive prompt - the enables top-level await: Monty supports a very small subset of Python - it doesn't even support class declarations yet! But, given its target use-case, that's not actually a problem. The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages. A coding agent can run some Python code, get an error message telling it that classes aren't supported and then try again with a different approach. I wanted to try this in a browser, so I fired up a code research task in Claude Code for web and kicked it off with the following: Clone https://github.com/pydantic/monty to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright test scripts that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working Then a little later: I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to https://tools.simonwillison.net/micropython (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the WASM and wheel from a relative path since the .html files will be served from the same folder as the wheel and WASM file Here's the transcript , and the final research report it produced. I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a bundle you can load and call from JavaScript, and as a wheel file which can be loaded into Pyodide and then called from Python in Pyodide in WebAssembly in a browser. Here are those two demos, hosted on GitHub Pages: As a connoisseur of sandboxes - the more options the better! - this new entry from Pydantic ticks a lot of my boxes. It's small, fast, widely available (thanks to Rust and WebAssembly) and provides strict limits on memory usage, CPU time and access to disk and network. It was also a great excuse to spin up another demo showing how easy it is these days to turn compiled code like C or Rust into WebAssembly that runs in both a browser and a Pyodide environment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Run a reasonable subset of Python code - enough for your agent to express what it wants to do Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control Call functions on the host - only functions you give it access to [...] Monty WASM demo - a UI over JavaScript that loads the Rust WASM module directly. Monty Pyodide demo - this one provides an identical interface but here the code is loading Pyodide and then installing the Monty WASM wheel .

0 views
Evan Schwartz 3 weeks ago

Scour - January Update

Hi friends, In January, Scour scoured 805,241 posts from 16,555 feeds (939 were newly added). I also rolled out a lot of new features that I'm excited to tell you about. Maybe because of some of these, I found more posts than usual that I thought were especially worth sharing. You can find them at the bottom of this post. Let's dive in! The Scour homepage has been completely revamped. It includes a new tagline, a more succinct description, and a live demo where you can try out my feed right from that page. Let me know what you think! Scour also finally has its own logo! (And it looks great on my phone's home screen, if I do say so myself! See below ) Have you ever wondered how Scour works? There is now a full documentation section, complete with detailed write-ups about Interests , Feeds , Reactions , How Ranking Works , and more. There are also guides specifically for RSS users and readers of Hacker News , arXiv , Reddit , and Substack . All of the docs have lots of interactive elements, which I wrote about in Building Docs Like a Product . My favorite one is on the Hacker News guide where you can search for hidden gems that have been submitted to HN but that have not reached the front page. Thanks to Tiago Ferreira , Andrew Doran , and everyone else who gave me the feedback that they wanted to understand more about how Scour works! Scour is now a Progressive Web App (PWA). That means you can install it as an icon on your home screen and access it easily. Just open Scour on your phone and follow the instructions there. Thanks to Adam Benenson for the encouragement to finally do this! This is one of the features I have most wanted as a user of Scour myself. When you're browsing the feed, Scour now keeps track of which items you've seen and scrolled past so it shows you new content each time you check it. If you don't want this behavior, you can disable it in the feed filter menu or change your default view to show seen posts. If you subscribe to specific feeds, as opposed to scouring all of them, it's now easier to find the feed for an article you liked . Click the "..." menu under the post, then "Show Feeds" to show feeds where the item was found. When populating that list, Scour will now automatically search the website where the article was found to see if it has a feed that Scour wasn't already checking. This makes it easy to discover new feeds and follow websites or authors whose content you like. This was another feature I've wanted for a long time myself. Previously, when I liked an article, I'd copy the domain and try to add it to my feeds on the Feeds page. Now, Scour does that with the click of a button. Some of the most disliked and flagged articles on Scour had titles such as "The Top 10..." or "5 tricks...". Scour now automatically penalizes articles with titles like those. Because I'm explicitly trying to avoid using popularity in ranking , I need to find other ways to boost high-quality content and down-rank low-quality content. You can expect more of these types of changes in the future to increase the overall quality of what you see in your feed. Previously, posts found through Google News links would show Google News as the domain under the post. Now, Scour extracts the original link. You can now navigate your feed using just your keyboard. Type to get the list of available keyboard shortcuts. Finally, here are some of my favorite posts that I found on Scour in January. There were a lot! Happy Scouring! Have feedback for Scour? Post it on the feedback board and upvote others' suggestions to help me prioritize new features! I appreciate this minimalist approach to coding agents: Pi: The Minimal Agent Within OpenClaw , even though it didn't yet convince me to switch away from Claude Code. A long and interesting take on which software tools will survive the AI era: Software Survival 3.0 . Scour uses Litestream for backup. While this new feature isn't directly relevant, I'm excited that it's now powering Fly.io's new Sprites offering (so I expect it to be a little more actively developed): Litestream Writable VFS . This is a very cool development in embedding models: a family of different size (and, as a result, cost) models whose embeddings are interoperable with one another: The Voyage 4 model family: shared embedding space with MoE architecture . A thought-provoking piece from Every about How AI Made Pricing Hard Again . TL;DR: over are the days where SaaS businesses have practically zero marginal cost for additional users or additional usage. A nice bit of UX design history about the gas tank arrow indicator on a car, with a lesson applied to AI: The Moylan Arrow: IA Lessons for AI-Powered Experiences . Helpful context for Understanding U.S. Intervention in Venezuela . Stoolap: an interesting new embedded database. Stoolap 0.2 Released For Modern Embedded SQL Database In Rust . I keep browsing fonts and, while I decided not to use this one for Scour, I think this is a neat semi-sans-serif from an independent designer: Heliotrope .

0 views
Evan Schwartz 1 months ago

Building Docs Like a Product

Stripe is famous for having some of the best product docs, largely because they are "designed to feel like an application rather than a traditional user manual" . I spent much of the last week building and writing the docs for Scour, and I am quite proud of the results. Scour is a personalized content feed, not an SDK or API, so I started by asking myself what the equivalent of working code or copyable snippets is for this type of product. The answer: interactive pieces of the product, built right into the docs themselves. The guide for Hacker News readers is one of the sections I'm most proud of. When describing Scour to people, I often start with the origin story of wanting a tool that could search for posts related to my interests from the thousands submitted to HN that never make it to the front page. Built right into the guide is a live search bar that searches posts that have been submitted to HN, but that have not been on the front page . Try it out! You might find some hidden gems. The guides for Redditors , Substack readers , and arXiv readers also have interactive elements that let you easily search for subreddits or newsletters, or subscribe to any of arXiv's categories. Logged in users can subscribe to those feeds right from the docs. Every time I went to explain some aspect of Scour, I first asked myself if there was a way to use a working example instead. On the Interests page, I wanted to explain that the topics you add to Scour can be any free-form text you want. Every time you load the page, this snippet loads a random set of interests that people have added on Scour. You can click any of them to go to the page of content related to that topic and add that interest yourself. While explaining how Scour recommends other topics to you, I thought what if I just included an actual topic recommendation for logged in users? (Graphic Design was actually a Scour recommendation for me, and a good one at that!) On various docs pages, I wanted to explain the settings that exist. Instead of linking to the settings page or describing where to find it, logged in users can just change the settings from within the docs. For example, on the Content Filtering page, you can toggle the setting to hide paywalled content right from the docs: There are numerous live examples throughout the docs. All of those use the same components as the actual Scour website. (Scour is built with the "MASH stack" , so these are all components.) The section explaining that you can show the feeds where any given post was found actually includes the recent post that was found in the most different feeds. (In the docs, you actually need to click the "..." button to show the feeds underneath the post, as shown below.) While building this out, I had a number of cases where I needed to show an example of some component, but where I couldn't show a live component. For example, in the Interest Recommendations section described above, I needed a placeholder for users that aren't logged in. I started building a separate component that looked like the normal interest component... and then stopped. This felt like the type of code that would eventually diverge from the original and I'd forget to update it. So, I went and refactored the original components so that they'd work for static examples too. The last piece of building a documentation experience that I would be happy to use was ensuring that there would be no broken links. No broken links across docs sections, and no broken links from the docs to parts of the application. Scour is built with the excellent HTTP routing library in Rust. The crate has a useful, albeit slightly tedious trait and derive macro. This lets you define HTTP routes as structs, which can be used by the router and anywhere else you might want to link to that page. Anywhere else we might want to link to the Interests docs, we can use the following to get the path: This way, Rust's type system enforces that any link to those docs will stay updated, even if I later move the paths around. I started working on these docs after a couple of users gave the feedback that they would love a page explaining how Scour works. There is now a detailed explanation of how Scour's ranking algorithm works , along with docs explaining everything else I could think of. Please keep the feedback coming! If you still have questions after reading through any of the docs, please let me know so I can keep improving them.

0 views
Giles's blog 1 months ago

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩

0 views
Langur Monkey 1 months ago

Yet another architectural update for Play Kid

I finished my previous post on Play Kid, only two days ago, with the following words: Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. Yeah, this was a lie. I have previously written about Play Kid, my Game Boy emulator. Here , I introduced it and talked about the base implementation and the quirks of the Game Boy CPU, PPU, and hardware in general. Here , I explained the tech stack update from SDL2 to Rust-native libraries. In that last post, I mentioned the dependency version hell I unwillingly descended into as a result of adopting as my rendering library. This forced me to stay on very old versions of and . I want my Game Boy emulator to use the latest crate versions for various reasons, so this could not be. I saw a simple and direct path of update, which consisted on adopting to manage the application life cycle, directly drawing the LCD to a texture, and dropping altogether. One of the things that nagged me about the crate is that its integration with was kind of lackluster. There was no easy way to render the frame buffer to an panel, so I had to render it centered inside the window. The immediate mode GUI lived in mostly in widgets on top of it. Unfortunately, these windows occluded the Game Boy LCD. In a proper debugger, you must be able to see the entire LCD plus the debug interface. So I dropped and adopted a render-to-texture approach. In it, you create the LCD texture at the beginning from the context, and then copy the LCD contents to it in the method. With this, we can easily render the texture in ’s , and the debug interface in a to the right. This is the result: The debug panel, showing the machine state and a code disassembly. Some additional tweaks here and there, and the UI looks much more polished and professional in version 0.3.0. Version 0.4.0 enables loading ROM files from the UI. Initially I thought about making the cartridge struct optional with , but this spiraled out of control fast. I found that making the full (which contains the , , , , etc.) optional worked much better, as there was only one reference to it in the top-level struct, . And, like so, you can dynamically load ROM files from the UI: Play Kid with the top menu bar and the ‘Open ROM’ menu entry. So, what’s in the future for Play Kid? Well, there are a couple of features that I’d really like to add at some point: Save states —Currently, Play Kid emulates the SRAM by saving and restoring it from files for supported games. I would like to add saving and restoring the full state of the emulator in what is known as save states. Possibly, the crate can help with this. GBC —Of course, I would like adding Game Boy Color support. It is not trivial, but also not exceedingly complicated. I never owned a GBC, so I’d see this as a good opportunity to explore its game catalog.

0 views
Steve Klabnik 1 months ago

The most important thing when working with LLMs

Okay, so you’ve got the basics of working with Claude going. But you’ve probably run into some problems: Claude doesn’t do what you want it to do, it gets confused about what’s happening and goes off the rails, all sorts of things can go wrong. Let’s talk about how to improve upon that. The most important thing that you can do when working with an LLM is give it a way to quickly evaluate if it’s doing the right thing, and if it isn’t, point it in the right direction. This is incredibly simple, yet, like many simple things, also wildly complex. But if you can keep this idea in mind, you’ll be well equipped to become effective when working with agents. A long time ago, I used to teach programming classes. Many of these were to adults, but some of them were to children. Teenaged children, but children nonetheless. We used to do an exercise to try and help them understand the difference between talking in English and talking in Ruby, or JavaScript, or whatever kind of programming language rather than human language. The exercise went like this: I would have a jar of peanut butter, a jar of jelly, a loaf of bread, a spoon, and a knife. I would ask the class to take a piece of paper and write down a series of steps to make a peanut butter and jelly sandwich. They’d all then give me their algorithms, and the fun part for me began: find one that’s innocently written that I could hilariously misinterpret. For example, I might find one like: I’d read this aloud to the class, you all understand this is a recipe for a peanut butter and jelly sandwich, right? I’d take the jar of peanut butter and place it upon the unopened bag of bread. I’d do the same with the jar of jelly. This would of course, squish the bread, which feels slightly transgressive given that you’re messing up the bread, so the kids would love that. I’d then say something like “the bread is already together, I do not understand this instruction.” After the inevitable laughter died down, I’d make my point: the computer will do exactly what you say, but not what you mean. So you have to get good at figuring out when you said something different than what you mean. Sort of ironically, LLMs are kind of the inverse of this: they’ll sometimes try to figure out what you mean, and then do that, rather than simply doing what you say. But the core thing here is the same: semantic drift from what we intended our program to do, and what it actually does. The second lesson is something I came up with sometime, I don’t even remember how exactly. But it’s something I told my students a lot. And that’s this: If your program did everything you wanted without problems, you wouldn’t be programming: you’d be using your program. The act of programming is itself perpetually to be in a state where something is either inadequate or broken, and the job is to fix that. I also think this is a bit simplistic but also getting at something. I had originally come up with this in the context of trying to explain how you need to manage your frustration when programming; if you get easily upset by something not working, doing computer programming might not be for you. But I do think these two things combine into something that gets to the heart of what we do: we need to understand what it is we want our software to do, and then make it do that. Sometimes, our software doesn’t do something yet. Sometimes, it does something, but incorrectly. Both of these cases result in a divergence from the program’s intended behavior. So, how do we know if our program does what it should do? Well, what we’ve been doing so far is: This is our little mini software development lifecycle, or “SDLC.” This process works, but is slow. That’s great for getting the feel of things, but programmers are process optimizers by trade. One of my favorite tools for optimization is called Amdahl’s law . The core idea is this, formulated in my own words: If you have a process that takes multiple steps, and you want to speed it up, if you optimize only one step, the maximum amount of speedup you’ll get is determined by the portion of the process that step takes. In other words, imagine we have a three step process: This process takes a total of 13 minutes to complete. If we speed up step 3 by double, it goes from two minutes to one minute, and now our process takes 12 minutes. However, if we were able to speed up step 2 by double, we’d cut off five minutes, and our process would now take 8 minutes. We can use this style of analysis to guide our thinking in many ways, but the most common way, for me, is to decide where to put my effort. Given the process above, I’m going to look at step 2 first to try and figure out how to make it faster. That doesn’t mean we can achieve the 2x speedup, but heck, if we get a 10% decrease in time, that’s the same time as if we did get a 2x on step 3. So it’s at least the place where we should start. I chose the above because, well, I think it properly models the proportion of time we’re taking when doing things with LLMs: we spend some time asking it to do something, and we spend a bit more time reviewing its output. But we spend a lot of time clicking “accept edit,” and a lot of time allowing Claude to execute tools. This will be our next step forward, as this will increase our velocity when working with the tools significantly. However, like with many optimization tasks, this is easier said than done. The actual mechanics of improving the speed of this step are simple at first: hit to auto-accept edits, and “Yes, and don’t ask again for commands” when you think the is safe for Claude to run. By doing this, once you have enough commands allowed, your input for step 2 of our development loop can drop to zero. Of course, it takes time for Claude to actually implement what you’ve asked, so it’s not like our 13 minute process drops to three, but still, this is a major efficiency step. But we were actively monitoring Claude for a reason. Claude will sometimes do incorrect things, and we need to correct it. At some point, Claude will say “Hey I’ve finished doing what you asked of me!” and it doesn’t matter how fast it does step 2 if we get to step 3 and it’s just incorrect, and we need to throw everything out and try again. So, how do we get Claude to guide itself in the right direction? A useful technique for figuring out what you should do is to consider the ending: where do we want to go? That will inform what we need to do to get there. Well, the ending of step 2 is knowing when to transition to step 3. And that transition is gated by “does the software do what it is supposed to do?” That’s a huge question! But in practice, we can do what we always do: start simple, and iterate from there. Right now, the transition from step 2 to step 3 is left up to Claude. Claude will use its own judgement to decide when it thinks that the software is working. And it’ll be right. But why leave that up to chance? I expect that some of you are thinking that maybe I’m belaboring this point. “Why not just skip to ? That’s the idea, right? We need tests.” Well on some level: yes. But on another level, no. I’m trying to teach you how to think here, not give you the answer. Because it might be broader than just “run the tests.” Maybe you are working on a project where the tests aren’t very good yet. Maybe you’re working on a behavior that’s hard to automatically test. Maybe the test suite takes a very long time, and so isn’t appropriate to be running over and over and over. Remember our plan from the last post? Where Claude finished the plan with this: These aren’t “tests” in the traditional sense of a test suite, but they are objective measures that Claude can invoke itself to understand if it’s finished the task. Claude could run after every file edit if it wanted to, and as soon as it sees , it knows that it’s finished. You don’t need a comprehensive test suite. You just need some sort of way for Claude to detect if it’s done in some sort of objective fashion. Of course, we can do better. While giving Claude a way to know if it’s done working is important, there’s a second thing we need to pay attention to: when Claude isn’t done working, can we guide it towards doing the right thing, rather than the wrong thing? For example, those of you who are of a similar vintage as myself may remember the output of early compilers. It was often… not very helpful. Imagine that we told Claude that it should run to know if things are working, and the only output from it was the exit code: 0 if we succeeded, 1 if we failed. That would accomplish our objective of letting Claude know when things are done, but it wouldn’t help Claude know what went wrong when it returns 1. This is one reason why I think Rust works well with LLMs. Take this incorrect Rust program: The Rust compiler won’t just say “yeah this program is incorrect,” it’ll give you this (as of Rust 1.93.0): The compiler will point out the exact place in the code itself of where there’s an issue, and even make suggestions as to how to fix it. This goes beyond just simply saying “it doesn’t work” and instead nudges you to what might fix the problem. Of course, this isn’t perfect, but if it’s helpful more than not, that’s a win. Of course, too much verbosity isn’t helpful either. A lot of tooling has gotten much more verbose lately. Often times, this is really nice as a human. Pleasant terminal output is, well… pleasant. But that doesn’t mean that it’s always good or useful. For example, here’s the default output for : This is not bad output. It’s nice. But it’s also not useful for an LLM. We don’t need to read all of the tests that are passing, we really just want to see some sort of minimal output, and then what failed if something failed. In Cargo’s case, that’s for “quiet”: There is no point in giving a ton of verbose input to an LLM that it isn’t even going to need to use. If you’re feeding a tools’ output to an LLM, you should consider both what the tool does in the failure case, but also the success case. Maybe configure things to be a bit simpler for Claude. You’ll save some tokens and get better results. All of this has various implications for all sorts of things. For example, types are a great way to get quick feedback on what you’re doing. A comprehensive test suite that completes quickly is useful for giving feedback to the LLM. But that also doesn’t inherently mean that types must be better or that you need to be doing TDD; whatever gives you that underlying principle of “objective feedback for the success case and guidance for the failure case” will be golden, no matter what tech stack you use. This brings me to something that may be counter-intuitive, but I think is also true, and worth keeping in the back of your mind: what’s good for Claude is also probably good for humans working on your system. A good test suite was considered golden before LLMs. That it’s great for them is just a nice coincidence. At the end of the day, Claude is not a person, but it tackles programming problems in a similar fashion to how we do: take in the problem, attempt a solution, run the compiler/linter/tests, and then see what feedback it gets, then iterate. That core loop is the same, even if humans can exercise better judgement and can have more skill. And so even though I pitched fancy terminal output as an example of how humans and LLMs need different things, that’s really just a superficial kind of thing. Good error messages are still critical for both. We’re just better at having terminal spinners not take up space in our heads while we’re solving a problem, and can appreciate the aesthetics in a way that Claude does not. Incidentally, this is one of the things that makes me hopeful about the future of software development under agentic influence. Engineers always complain that management doesn’t give us time to do refactorings, to improve the test suite, to clean our code. Part of the reason for this is that we often didn’t do a good job of pitching how it would actually help accomplish business goals. But even if you’re on the fence about AI, and upset that management is all about AI: explain to management that this stuff is a force multiplier for your agents. Use the time you’ve saved by doing things the agentic way towards improving your test suite, or your documentation, or whatever else. I think there’s a chance that all of this stuff leads to higher quality codebases than ones filled with slop. But it also requires us to make the decisions that will lead is in that direction. That’s what I have for you today: consider how you can help Claude evaluate its own work. Give it explicit success criteria, and make evaluating that criteria as simple and objective as possible. In the next post, we’re gonna finally talk about . Can you believe that I’ve talked this much about how to use Claude and we haven’t talked about ? There’s good reason for that, as it turns out. We’re going to talk a bit more about understanding how interacting with LLMs work, and how it can help us both improve step 1 in our process, but also continue to make step 2 better and better. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 22 Replying to Steve Klabnik Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com The most important thing when working with LLMs steveklabnik.com/writing/the-... The most important thing when working with LLMs Blog post: The most important thing when working with LLMs by Steve Klabnik Put the peanut butter on the bread Put the jelly on the bread Put the bread together Asking the LLM to do something by typing up what we want it to do Closely observing its behavior and course correcting it when it goes off of the rails Eventually, after it says that it’s finished, reviewing its output Ten minutes Two minutes

0 views
Langur Monkey 1 months ago

Game Boy emulator tech stack update

In my previous post , I shared the journey of building Play Kid , my Game Boy emulator. At the time, I was using SDL2 to handle the “heavy lifting” of graphics, audio, and input. This was released as v0.1.0. It worked, and it worked well, but it always felt a bit like a “guest” in the Rust ecosystem. SDL2 is a C library at heart, and while the Rust wrappers are good, they bring along some baggage like shared library dependencies and difficult integration with Rust-native UI frameworks. So I decided to perform a heart transplant on Play Kid. For version v0.2.0 I’ve moved away from SDL2 entirely, replacing it with a stack of modern, native Rust libraries: , , , , , and : The most visible change is the new Debug Panel . The new integrated debugger features a real-time disassembly view and breakpoint management. One of the coolest additions is the Code disassembly panel. It decodes the ROM instructions in real-time, highlighting the current and allowing me to toggle breakpoints just by clicking on a line. The breakpoints themselves are now managed in a dedicated list, shown in red at the bottom. The rest of the debug panel shows what we already had: the state of the CPU, the PPU, and the joypad. Of course, no modern Rust migration is complete without a descent into dependency hell . This new stack comes with a major catch: is a bit of a picky gatekeeper. Its latest version is 0.15 (January 2025). It is pinned to an older version of (0.19 vs the current 28.0), and it essentially freezes the rest of the project in a time capsule. To keep the types compatible, I’m forced to stay on 0.26 (current is 0.33) and 0.29 (current is 0.30), even though the rest of the ecosystem has moved on to much newer, shinier versions. It’s kind of frustrating. You get the convenience of the buffer, but you pay for it by being locked out of the latest API improvements and features. Navigating these version constraints felt like solving a hostage negotiation between crate maintainers. Not very fun. Despite the dependency issues, I think the project is now in a much better place. The code is cleaner, the debugger is much better, and it’s easier to ship binaries for Linux, Windows, and macOS via GitHub Actions. If you’re interested in seeing the new architecture or trying out the new debugger, the code is updated on Codeberg and GitHub . Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. & : These handle the windowing and the actual Game Boy frame buffer. allows me to treat the 160x144 LCD as a simple pixel buffer while handles the hardware-accelerated scaling and aspect ratio correction behind the scenes. : This was a big step-up. Instead of my minimal homegrown UI library from the SDL2 version, I now have access to a full-featured, immediate-mode GUI. This allowed me to build the debugger I had in mind from the beginning. & : These replaced SDL2’s audio and controller handling with pure-Rust alternatives that feel much more ergonomic to use alongside the rest of the machine.

0 views
ava's blog 1 months ago

my theme for 2026

Last year, I made a post called " My theme for 2025 ". Inchwyrm's post about the year of the wizard reminded me I should do another one for this year. My 2025 theme was 'learning'. I think I have managed that pretty well, even if it wasn't exactly the things I mentioned in the post. I just cannot make time for The Odin Project and Rust, and to make little games; I have to prioritize my studies, my volunteer work, staying up to date on data protection law and writing about it. Maybe one day :) The rest fits though: I passed everything I enrolled in last year, and finished the certification process in just 6 months. I started summarizing and translating cases for noyb.eu, and I was creative with my notebook and some pixel art. I learned a lot and I tried new things. My theme for this year is ' rejection '. I'm collecting them! A little while ago, the concept of collecting a 1000 no's was picked up by some blogs, and it helped me view rejection, criticism and other feedback in a more positive light. I want to grow, I want to try new things, and I want to become (positively) hardened by challenge. It feels uncomfortable and a part of me doesn't want to, but in a way, I also want to be humbled in a constructive way. This year, I will send out a lot of applications for both new work and a new apartment. That will undoubtedly result in a lot of no's; the market for both is just incredibly tough right now, and there always seems to be someone better. I have already received one rejection this year just a week after I sent out the application for something I thought for sure I'd at least get an interview from, so there's that. Other on-going things that produce rejections: You can also help with something rejection-adjacent: This is your opportunity to give me constructive criticism on what has always bothered you about my blog's theme, writing, or my behavior. I want the pressure and polish to result in a version of me that is better. That's what I need right now. I have relied long enough on mostly gut feelings, learning by myself and my own assessments of myself, and always thought I had to do it all alone; but I need outside feedback now, especially from people who want to see me grow and do better. I want to know how I can improve. Reply via email Published 23 Jan, 2026 I have submitted an idea to my workplace's idea management team and they are notorious for shooting down anything, but at least I tried. I'm sending out e-mails for a blog project I wanna do, and I have received no answer so far from the places/people I've messaged. I'll have to rethink my approach and then keep trying. Doing things that are a little embarrassing, like my post making known I am looking for work.

0 views
Anton Zhiyanov 1 months ago

Interfaces and traits in C

Everyone likes interfaces in Go and traits in Rust. Polymorphism without class-based hierarchies or inheritance seems to be the sweet spot. What if we try to implement this in C? Interfaces in Go  • Traits in Rust  • Toy example  • Interface definition  • Interface data  • Method table  • Method table in implementor  • Type assertions  • Final thoughts An interface in Go is a convenient way to define a contract for some useful behavior. Take, for example, the honored : Anything that can read data into a byte slice provided by the caller is a . Quite handy, because the code doesn't need to care where the data comes from — whether it's memory, the file system, or the network. All that matters is that it can read the data into a slice: We can provide any kind of reader: Go's interfaces are structural, which is similar to duck typing. A type doesn't need to explicitly state that it implements ; it just needs to have a method: The Go compiler and runtime take care of the rest: A trait in Rust is also a way to define a contract for certain behavior. Here's the trait: Unlike in Go, a type must explicitly state that it implements a trait: The Rust compiler takes care of the rest: Either way, whether it's Go or Rust, the caller only cares about the contract (defined as an interface or trait), not the specific implementation. Let's make an even simpler version of — one without any error handling (Go): Usage example: Let's see how we can do this in C! The main building blocks in C are structs and functions, so let's use them. Our will be a struct with a single field called . This field will be a pointer to a function with the right signature: To make fully dynamic, let's turn it into a struct with a function pointer (I know, I know — just bear with me): Here's the "method" implementation: The is pretty obvious: And, finally, the function: See how easy it is to turn a into a : all we need is . Pretty cool, right? Not really. Actually, this implementation is seriously flawed in almost every way (except for the definition). Memory overhead . Each instance has its own function pointers (8 bytes per function on a 64-bit system) as "methods", which isn't practical even if there are only a few of them. Regular objects should store data, not functions. Layout dependency . Converting from to like only works if both structures have the same field as their first member. If we try to implement another interface: Everything will fall apart: and have different layouts, so type conversion in ⓧ is invalid and causes undefined behavior. Lack of type safety . Using a as the receiver in means the caller can pass any type, and the compiler won't even show a warning: C isn't a particularly type-safe language, but this is just too much. Let's try something else. A better way is to store a reference to the actual object in the interface: We could have the method in the interface take a instead of a , but that would make the implementation more complicated without any real benefits. So, I'll keep it as . Then will only have its own fields: We can make the method type-safe: To make this work, we add a method that returns the instance wrapped in a interface: The and functions remain quite simple: This approach is much better than the previous one: Since our type now knows about the interface (through the method), our implementation is more like a basic version of a Rust trait than a true Go interface. For simplicity, I'll keep using the term "interface". There is one downside, though: each instance has its own function pointer for every interface method. Since only has one method, this isn't an issue. But if an interface has a dozen methods and the program uses a lot of these interface instances, it can become a problem. Let's fix this. Let's extract interface methods into a separate strucute — the method table. The interface references its methods though the field: and don't change at all: The method initializes the static method table and assigns it to the interface instance: The only difference in is that it calls the method on the interface indirectly using the method table ( instead of ): stays the same: Now the instance always has a single pointer field for its methods. So even for large interfaces, it only uses 16 bytes ( + fields). This approach also keeps all the benefits from the previous version: We can even add a separate helper so the client doesn't have to worry about implementation detail: There's another approach I've seen out there. I don't like it, but it's still worth mentioning for completeness. Instead of embedding the method table in the interface, we can place it in the implementation ( ): We initialize the method table in the constructor: now takes a pointer: And converts to with a simple type cast: This keeps pretty lightweight, only adding one extra field. But the cast only works because is the first field in . If we try to implement a second interface, things will break — just like in the very first solution. I think the "method table in the interface" approach is much better. Go has an function that copies data from a source (a reader) to a destination (a writer): There's an interesting comment in its documentation: If implements , the copy is implemented by calling . Otherwise, if implements , the copy is implemented by calling . Here's what the function looks like: is a type assertion that checks if the reader is not just a , but also implements the interface. The Go runtime handles these kinds of dynamic type checks. Can we do something like this in C? I'd prefer not to make it fully dynamic, since trying to recreate parts of the Go runtime in C probably isn't a good idea. What we can do is add an optional method to the interface: Then we can easily check if a given is also a : Still, this feels a bit like a hack. I'd rather avoid using type assertions unless it's really necessary. Interfaces (traits, really) in C are possible, but they're not as simple or elegant as in Go or Rust. The method table approach we discussed is a good starting point. It's memory-efficient, as type-safe as possible given C's limitations, and supports polymorphic behavior. Here's the full source code if you are interested: The struct is lean and doesn't have any interface-related fields. The method takes a instead of a . The cast from to is handled inside the method. We can implement multiple interfaces if needed. Lightweight structure. Easy conversion from to . Supports multiple interfaces.

0 views
Steve Klabnik 1 months ago

Agentic development basics

In my last post, I suggested that you should start using Claude in your software development process via read-only means at first. The idea is just to get used to interacting with the AI, seeing what it can do, and seeing what it struggles with. Once you’ve got a handle on that part, it’s time to graduate to writing code. However, I’m going to warn you about this post: I hope that by the end of it, you’re a little frustrated. This is because I don’t think it’s productive to skip to the tools and techniques that experienced users use yet. We have to walk before we run. And more importantly, we have to understand how and why we run. That is, I hope that this step will let you start producing code with Claude, but it will also show you some of the initial pitfalls when doing so, in order to motivate the techniques you’re going to learn about in part 3. So with that in mind, let’s begin. Okay I lied. Before we actually begin: you are using version control, right? If not, you may want to go learn a bit about it. Version control, like git (or my beloved jj) is pretty critical for software development, but it’s in my opinion even more critical for this sort of development. You really want to be able to restore to previous versions of the code, branch off and try things, and recover from mistakes. If you already use version control systems religiously, you might use this as an excuse to learn even more features of them. I never bothered with s in the past, but I use s with agents all the time now. Okay, here’s my first bit of advice: commit yourself to not writing code any more. I don’t mean forever, I don’t mean all the time, I mean, while you’re trying to learn agentic software development, on the project that you’re learning it with, just don’t write any code manually. This might be a bit controversial! However, I think it is essential. Let me tell you a short story. Many years ago, I was in Brazil. I wanted to try scuba diving for the first time. Seemed like a good opportunity. Now, I don’t remember the exact setup, but I do remember the hardest part for me. Our instructor told us to put the mask on, and then lean forward and put our faces in the water and breathe through the regulator. I simply could not do it. I got too in my head, it was like those “you are now breathing manually” memes. I forget if it was my idea or the instructor’s idea, but what happened in practice: I just jumped in. My brain very quickly went from “but how do I do this properly” to “oh God, if you don’t figure this out right the fuck now you’re gonna fuckin die idiot” and that’s exactly what I needed to do it. A few seconds later, I was breathing just fine. I just needed the shock to my system, I needed to commit. And I figured it out. Now, I’m turning 40 and this happened long ago, and it’s reached more of a personal myth status in my head, so maybe I got the details wrong, maybe this is horribly irresponsible and someone who knows about diving can tell me if that experience was wrong, but the point has always stuck with me: sometimes, you just gotta go for it. This dovetails into another part I snuck into there: “on the project that you’re learning it with.” I really think that you should undergo a new project for this endeavor. There’s a few reasons for this: Pick something to get started with, and create a new repo. I suggest something you’ve implemented before, or maybe something you know how to do but have never bothered to take the time to actually build. A small CLI tool might be a good idea. Doesn’t super matter what it is. For the purposes of this example, I’m going to build a task tracking CLI application. Because there aren’t enough of those in the world. I recommend making a new fresh directory, and initializing the project of your choice. I’m using Rust, of course, so I’ll . You can make Claude do it, but I don’t think that starting from an initialized project is a bad idea either. I’m more likely to go the route if I know I’m building something small, and more of the “make Claude do it” route if I’m doing like, a web app with a frontend, backend, and several services. Anyway point is: get your project to exist, and then just run Claude Code. I guess you should install it first , but anyway, you can just to get started. At the time of writing, Claude will ask you if you trust the files in this folder, you want to say yes, because you just created it. You’ll also get some screens asking if you want to do light or dark mode, to log in with your Anthropic account, stuff like that. And then you’ll be in. Claude will ask you to run to create a CLAUDE.md, but we’re not gonna do that at the start. We need to talk about even more basic things than that first. You’ll be at a prompt, and it’ll even suggest something for you to get started with. Mine says “Try “fix typecheck errors"" at the moment. We’re gonna try to get Claude to modify a few lines of our program. The in Rust produces this program: so I’ll ask Claude this: Hi claude! right now my program prints “Hello, world”, can we have it print “Goodbye, world” instead? Claude does this for me: And then asks this: Claude wants to edit a file, and so by default, it has to ask us permission to do so. This is a terrible place to end up, but a great place to get started! We want to use these prompts at first to understand what Claude is doing, and sort of “code review” it as we go. More on that in a bit. Anyway, this is why you should answer “Yes” to this question, and not “Yes, allow all edits during this session.” We want to keep reviewing the code for now. You want to be paying close attention to what Claude is doing, so you can build up some intuition about it. Before clicking yes, I want to talk about what Claude did in my case here. Note that my prompt said right now my program prints “Hello, world”, can we have it print “Goodbye, world” Astute observers will have noticed that it actually says and not . We also asked it to have it say and it is showing a diff that will make it say . This is a tiny different, but it is also important to understand: Claude is going to try and figure out what you mean and do that. This is both the source of these tools’ power and also the very thing that makes it hard to trust them. In this case, Claude was right, I didn’t type the exact string when describing the current behavior, and I didn’t mean to remove the . In the previous post, I said that you shouldn’t be mean to Claude. I think it makes the LLM perform worse. So now it’s time to talk about your own reaction to the above: did you go “yeah Claude fucked up it didn’t do exactly what I asked?” or did you go “yeah Claude did exactly what I asked?” I think it’s important to try and let go of preconceived notions here, especially if your reaction here was negative. I know this is kind of woo, just like “be nice to Claude,” but you have to approach this as “this is a technology that works a little differently than I’m used to, and that’s why I’m learning how to meet it on its own terms” rather than “it didn’t work the way I expected it to, so it is wrong.” A non-woo way of putting it is this: the right way to approach “it didn’t work” in this context is “that’s a skill issue on my part, and I’m here to sharpen my skills.” Yes, there are limits to this technology and it’s not perfect. That’s not the point. You’re not doing that kind of work right now, you’re doing . Now, I should also say that like, if you don’t want to learn a new tool? 100% okay with me. Learned some things about a tool, and didn’t like it? Sure! Some of you won’t like agentic development. That’s okay. No worries, thanks for reading, have a nice day. I mean that honestly. But for those folks who do want to learn this, I’m trying to communicate that I think you’ll have a better time learning it if you try to get into the headspace of “how do I get the results I want” rather than getting upset and giving up when it doesn’t work out. Okay, with that out of the way, if you asked a small enough question, Claude probably did the right thing. Let’s accept. This might be a good time to commit & save your progress. You can use to put Claude Code into “bash mode” and run commands, so I just and I’m good. You can also use another terminal, I just figured I’d let you know. It’s good for short commands. Let’s try something bigger. To do that, we’re gonna invoke something called “plan mode.” Claude Code has three (shhhh, we don’t talk about the fourth yet) modes. The first one is the “ask to accept edits” mode. But if you hit , you’ll see at the bottom left. We don’t want to automatically accept edits. Hit again, and you’ll see this: This is what we want. Plan mode. Plan mode is useful any time you’re doing work that’s on the larger side, or just when you want to think through something before you begin. In plan mode, Claude cannot modify your files until you accept the plan. With plan mode, you talk to Claude about what you want to do, and you collaborate on a plan together. A nice thing about it is that you can communicate the things you are sure of, and also ask about the things you’re not sure of. So let’s kick off some sort of plan to build the most baby parts of our app. In my case, I’m prompting it with this: hi claude! i want this application to grow into a task tracking app. right now, it’s just hello world. I’d like to set up some command line argument parsing infrastructure, with a command that prints the version. can we talk about that? Yes, I almost always type , feel free to not. And I always feel like a “can we talk about that” on the end is nice too. I try to talk to Claude the way I’d talk to a co-worker. Obviously this would be too minor of a thing to bother to talk to a co-worker about, but like, you know, baby steps. Note that what I’m asking is basically a slightly more complex “hello world”, just getting some argument parsing up. You want something this sized: you know how it should be done, it should be pretty simple, but it’s not a fancy command. With plan mode, Claude will end up responding to this by taking a look at your project, considering what you might need, and then coming up with a plan. Here’s the first part of Claude’s answer to me: It’ll then come up with a plan, and it usually write it out to a file somewhere: You can go read the file if you want to, but it’s not needed. Claude is going to eventually present the plan to you directly, and you’ll review it before moving on. Claude will also probably ask you a question or maybe even a couple, depending. There’s a neat little TUI for responding to its questions, it can even handle multiple questions at once: For those of you that don’t write Rust, this is a pretty good response! Clap is the default choice in the ecosystem, arg is a decent option too, and doing it yourself is always possible. I’m going to choose clap, it’s great. If you’re not sure about the question, you can arrow down to “Chat about this” and discuss it more. Here’s why you don’t need to read the file: Claude will pitch you on its plan: This is pretty good! Now, if you like this plan, you can select #3. Remember, we’re not auto accepting just yet! Don’t worry about the difference between 1 and 2 now, we’ll talk about it someday. But, I actually want Claude to tweak this plan, I wouldn’t run , I would do . So I’m going to go down to four and type literally that to Claude: I wouldn’t run , I would do . And Claude replies: and then presents the menu again. See, this is where the leverage of “Claude figures out what I mean” can be helpful: I only told it about , but it also changed to as well. However, there’s a drawback too: we didn’t tell Claude we wanted to have help output! However, this is also a positive: Claude considered help output to be so basic that it’s suggesting it for our plan. It’s up to you to decide if this is an overreach on Claude’s part. In my case, I’m okay with it because it’s so nearly automatic with Clap and it’s something I certainly want in my tool, so I’m going to accept this plan. Iterate with Claude until you’re happy with the plan, and then do the same. I’m not going to paste all of the diffs here, but for me, Claude then went and did the plan: it added the dependency to , it added the needed code to , it ran to try and do the build. Oh yeah, here’s a menu we haven’t seen yet: This is a more interesting question than the auto-edit thing. Claude won’t run commands without you signing off on them. If you’re okay with letting Claude run this command every time without asking, you can choose 2, and if you want to confirm every time, you can type 1. Completely up to you. Now it ran , and . Everything looked good, so I see: And that’s it, we’ve built our first feature! Yeah, it’s pretty small, and in this case, we probably could have copy/pasted the documentation, but again, that’s not the point right now: we’re just trying to take very small steps forward to get used to working with the tool. We want it to be something that’s quick for us to verify. We are spending more time here than we would if we did it by hand, because that time isn’t wasted: it’s time learning. As we ramp up the complexity of what we can accomplish, we’ll start seeing speed gains. But we’re deliberately going slow and doing little right now. From here, that’s exactly what I’d like you to do: figure out where the limits are. Try something slightly larger, slightly harder. Try to do stuff with just a prompt, and then throw that commit away and try it again with plan mode. See what stuff you need plan mode for, and what you can get away with with just a simple prompt. I’ll leave you with an example of a prompt I might try next: asking Cladue for their opinion on what you should do. The next thing I’d do in this project is to switch back into planning mode and ask this: what do you think is the best architecture for this tracking app? we haven’t discussed any real features, design, or requirements. this is mostly a little sample application to practice with, and so we might not need a real database, we might get away with a simple file or files to track todos. but maybe something like sqlite would still be appropriate. if we wanted to implement a next step for this app, what do you think it should be? Here’s what Claude suggested: This plan is pretty solid. But again, it’s a demo app. The important part is, you can always throw this away if you don’t like it. So try some things. Give Claude some opinions and see how they react. Try small features, try something larger. Play around. But push it until Claude fails. At some point, you will run into problems. Maybe you already have! What you should do depends on the failure mode. The first failure you’re gonna run into is “I don’t like the code!” Maybe Claude just does a bad job. You have two options: the first is to just tell Claude to fix it. Claude made a mess, Claude can clean it up. In more extreme cases, you may want to just simply or and start again. Honestly, the second approach is better in a lot of cases, but I’m not going to always recommend it just yet. The reason why it is is that it gives you some time to reflect on why Claude did a bad job. But we’re gonna talk about that as the next post in this series! So for now, stick to the ‘worse’ option: just tell Claude to fix problems you find. The second kind of failure is one where Claude just really struggles to get things right. It looks in the wrong places in your codebase, it tries to find a bug and can’t figure it out, it misreads output and that leads it astray, etc. This kind of failure is harder to fix with the tools you have available to you right now. What matters is taking note of them, so that you can email them to me, haha. I mean, feel free to do that, and I can incorporate specific suggestions into the next posts, but also, just being able to reflect on what Claude struggles with is a good idea generally. You’ll be able to fix them eventually, so knowing what you need to improve matters. If Claude works long enough, you’ll see something about “compaction.” We haven’t discussed things at a deep enough level to really understand this yet, so don’t worry about it! You may want to note one thing though: Claude tends to do worse after compaction, in my opinion. So one way to think about this is, “If I see compaction, I’ve tried to accomplish too large a task.” Reflect on if you could have broken this up into something smaller. We’ll talk about this more in the next post. So that’s it! Let Claude write some code, in a project you don’t care about. Try bigger and harder things until you’re a bit frustrated with its failures. You will hit the limits, because you’re not doing any of the intermediate techniques to help Claude do a good job. But my hope is, by running into these issues, you’ll understand the motivation for those techniques, and will be able to better apply them in the future. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 7 Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik You can be less precious about the code. This isn’t messing up one of your projects, this is a throwaway scratch thing that doesn’t matter. “AI does better on greenfield projects” is not exactly true, but there’s enough truth to it that I think you should do a new project. It’s really more about secondary factors than it is actual greenfield vs brownfield development but whatever, doesn’t matter: start new.

1 views