Posts in Rust (20 found)
baby steps 2 days ago

Hello, Dada!

Following on my Fun with Dada post, this post is going to start teaching Dada. I’m going to keep each post short – basically just what I can write while having my morning coffee. 1 Here is a very first Dada program I think all of you will be able to guess what it does. Still, there is something worth noting even in this simple program: “You have the right to write code. If you don’t write a function explicitly, one will be provided for you.” Early on I made the change to let users omit the function and I was surprised by what a difference it made in how light the language felt. Easy change, easy win. Here is another Dada program Unsurprisingly, this program does the same thing as the last one. “Convenient is the default.” Strings support interpolation (i.e., ) by default. In fact, that’s not all they support, you can also break them across lines very conveniently. This program does the same thing as the others we’ve seen: When you have a immediately followed by a newline, the leading and trailing newline are stripped, along with the “whitespace prefix” from the subsequent lines. Internal newlines are kept, so something like this: would print Of course you could also annotate the type of the variable explicitly: You will find that it is . This in and of itself is not notable, unless you are accustomed to Rust, where the type would be . This is of course a perennial stumbling block for new Rust users, but more than that, I find it to be a big annoyance – I hate that I have to write or everywhere that I mix constant strings with strings that are constructed. Similar to most modern languages, strings in Dada are immutable. So you can create them and copy them around: OK, we really just scratched the surface here! This is just the “friendly veneer” of Dada, which looks and feels like a million other languages. Next time I’ll start getting into the permission system and mutation, where things get a bit more interesting. My habit is to wake around 5am and spend the first hour of the day doing “fun side projects”. But for the last N months I’ve actually been doing Rust stuff, like symposium.dev and preparing the 2026 Rust Project Goals . Both of these are super engaging, but all Rust and no play makes Niko a dull boy. Also a grouchy boy.  ↩︎ My habit is to wake around 5am and spend the first hour of the day doing “fun side projects”. But for the last N months I’ve actually been doing Rust stuff, like symposium.dev and preparing the 2026 Rust Project Goals . Both of these are super engaging, but all Rust and no play makes Niko a dull boy. Also a grouchy boy.  ↩︎

0 views
Simon Willison 3 days ago

How StrongDM's AI team build serious software without even looking at the code

Last week I hinted at a demo I had seen from a team implementing what Dan Shapiro called the Dark Factory level of AI adoption, where no human even looks at the code the coding agents are producing. That team was part of StrongDM, and they've just shared the first public description of how they are working in Software Factories and the Agentic Moment : We built a Software Factory : non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review. [...] In kōan or mantra form: In rule form: Finally, in practical form: I think the most interesting of these, without a doubt, is "Code must not be reviewed by humans". How could that possibly be a sensible strategy when we all know how prone LLMs are to making inhuman mistakes ? I've seen many developers recently acknowledge the November 2025 inflection point , where Claude Opus 4.5 and GPT 5.2 appeared to turn the corner on how reliably a coding agent could follow instructions and take on complex coding tasks. StrongDM's AI team was founded in July 2025 based on an earlier inflection point relating to Claude Sonnet 3.5: The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error. By December of 2024, the model's long-horizon coding performance was unmistakable via Cursor's YOLO mode . Their new team started with the rule "no hand-coded software" - radical for July 2025, but something I'm seeing significant numbers of experienced developers start to adopt as of January 2026. They quickly ran into the obvious problem: if you're not writing anything by hand, how do you ensure that the code actually works? Having the agents write tests only helps if they don't cheat and . This feels like the most consequential question in software development right now: how can you prove that software you are producing works if both the implementation and the tests are being written for you by coding agents? StrongDM's answer was inspired by Scenario testing (Cem Kaner, 2003). As StrongDM describe it: We repurposed the word scenario to represent an end-to-end "user story", often stored outside the codebase (similar to a "holdout" set in model training), which could be intuitively understood and flexibly validated by an LLM. Because much of the software we grow itself has an agentic component, we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user? That idea of treating scenarios as holdout sets - used to evaluate the software but not stored where the coding agents can see them - is fascinating . It imitates aggressive testing by an external QA team - an expensive but highly effective way of ensuring quality in traditional software. Which leads us to StrongDM's concept of a Digital Twin Universe - the part of the demo I saw that made the strongest impression on me. The software they were building helped manage user permissions across a suite of connected services. This in itself was notable - security software is the last thing you would expect to be built using unreviewed LLM code! [The Digital Twin Universe is] behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors. With the DTU, we can validate at volumes and rates far exceeding production limits. We can test failure modes that would be dangerous or impossible against live services. We can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs. How do you clone the important parts of Okta, Jira, Slack and more? With coding agents! As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation. With their own, independent clones of those services - free from rate-limits or usage quotas - their army of simulated testers could go wild . Their scenario tests became scripts for agents to constantly execute against the new systems as they were being built. This screenshot of their Slack twin also helps illustrate how the testing process works, showing a stream of simulated Okta users who are about to need access to different simulated systems. This ability to quickly spin up a useful clone of a subset of Slack helps demonstrate how disruptive this new generation of coding agent tools can be: Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it. The techniques page is worth a look too. In addition to the Digital Twin Universe they introduce terms like Gene Transfusion for having agents extract patterns from existing systems and reuse them elsewhere, Semports for directly porting code from one language to another and Pyramid Summaries for providing multiple levels of summary such that an agent can enumerate the short ones quickly and zoom in on more detailed information as it is needed. StrongDM AI also released some software - in an appropriately unconventional manner. github.com/strongdm/attractor is Attractor , the non-interactive coding agent at the heart of their software factory. Except the repo itself contains no code at all - just three markdown files describing the spec for the software in meticulous detail, and a note in the README that you should feed those specs into your coding agent of choice! github.com/strongdm/cxdb is a more traditional release, with 16,000 lines of Rust, 9,500 of Go and 6,700 of TypeScript. This is their "AI Context Store" - a system for storing conversation histories and tool outputs in an immutable DAG. It's similar to my LLM tool's SQLite logging mechanism but a whole lot more sophisticated. I may have to gene transfuse some ideas out of this one! I visited the StrongDM AI team back in October as part of a small group of invited guests. The three person team of Justin McCarthy, Jay Taylor and Navan Chauhan had formed just three months earlier, and they already had working demos of their coding agent harness, their Digital Twin Universe clones of half a dozen services and a swarm of simulated test agents running through scenarios. And this was prior to the Opus 4.5/GPT 5.2 releases that made agentic coding significantly more reliable a month after those demos. It felt like a glimpse of one potential future of software development, where software engineers move from building the code to building and then semi-monitoring the systems that build the code. The Dark Factory. I glossed over this detail in my first published version of this post, but it deserves some serious attention. If these patterns really do add $20,000/month per engineer to your budget they're far less interesting to me. At that point this becomes more of a business model exercise: can you create a profitable enough line of products that you can afford the enormous overhead of developing software in this way? Building sustainable software businesses also looks very different when any competitor can potentially clone your newest features with a few hours of coding agent work. I hope these patterns can be put into play with a much lower spend. I've personally found the $200/month Claude Max plan gives me plenty of space to experiment with different agent patterns, but I'm also not running a swarm of QA testers 24/7! I think there's a lot to learn from StrongDM even for teams and individuals who aren't going to burn thousands of dollars on token costs. I'm particularly invested in the question of what it takes to have agents prove that their code works without needing to review every line of code they produce. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Why am I doing this? (implied: the model should be doing this instead) Code must not be written by humans Code must not be reviewed by humans If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

0 views
Simon Willison 4 days ago

Running Pydantic's Monty Rust sandboxed Python subset in WebAssembly

There's a jargon-filled headline for you! Everyone's building sandboxes for running untrusted code right now, and Pydantic's latest attempt, Monty , provides a custom Python-like language (a subset of Python) in Rust and makes it available as both a Rust library and a Python package. I got it working in WebAssembly, providing a sandbox-in-a-sandbox. Here's how they describe Monty : Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code. Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds. What Monty can do: A quick way to try it out is via uv : Then paste this into the Python interactive prompt - the enables top-level await: Monty supports a very small subset of Python - it doesn't even support class declarations yet! But, given its target use-case, that's not actually a problem. The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages. A coding agent can run some Python code, get an error message telling it that classes aren't supported and then try again with a different approach. I wanted to try this in a browser, so I fired up a code research task in Claude Code for web and kicked it off with the following: Clone https://github.com/pydantic/monty to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright test scripts that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working Then a little later: I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to https://tools.simonwillison.net/micropython (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the WASM and wheel from a relative path since the .html files will be served from the same folder as the wheel and WASM file Here's the transcript , and the final research report it produced. I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a bundle you can load and call from JavaScript, and as a wheel file which can be loaded into Pyodide and then called from Python in Pyodide in WebAssembly in a browser. Here are those two demos, hosted on GitHub Pages: As a connoisseur of sandboxes - the more options the better! - this new entry from Pydantic ticks a lot of my boxes. It's small, fast, widely available (thanks to Rust and WebAssembly) and provides strict limits on memory usage, CPU time and access to disk and network. It was also a great excuse to spin up another demo showing how easy it is these days to turn compiled code like C or Rust into WebAssembly that runs in both a browser and a Pyodide environment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Run a reasonable subset of Python code - enough for your agent to express what it wants to do Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control Call functions on the host - only functions you give it access to [...] Monty WASM demo - a UI over JavaScript that loads the Rust WASM module directly. Monty Pyodide demo - this one provides an identical interface but here the code is loading Pyodide and then installing the Monty WASM wheel .

0 views
Evan Schwartz 6 days ago

Scour - January Update

Hi friends, In January, Scour scoured 805,241 posts from 16,555 feeds (939 were newly added). I also rolled out a lot of new features that I'm excited to tell you about. Maybe because of some of these, I found more posts than usual that I thought were especially worth sharing. You can find them at the bottom of this post. Let's dive in! The Scour homepage has been completely revamped. It includes a new tagline, a more succinct description, and a live demo where you can try out my feed right from that page. Let me know what you think! Scour also finally has its own logo! (And it looks great on my phone's home screen, if I do say so myself! See below ) Have you ever wondered how Scour works? There is now a full documentation section, complete with detailed write-ups about Interests , Feeds , Reactions , How Ranking Works , and more. There are also guides specifically for RSS users and readers of Hacker News , arXiv , Reddit , and Substack . All of the docs have lots of interactive elements, which I wrote about in Building Docs Like a Product . My favorite one is on the Hacker News guide where you can search for hidden gems that have been submitted to HN but that have not reached the front page. Thanks to Tiago Ferreira , Andrew Doran , and everyone else who gave me the feedback that they wanted to understand more about how Scour works! Scour is now a Progressive Web App (PWA). That means you can install it as an icon on your home screen and access it easily. Just open Scour on your phone and follow the instructions there. Thanks to Adam Benenson for the encouragement to finally do this! This is one of the features I have most wanted as a user of Scour myself. When you're browsing the feed, Scour now keeps track of which items you've seen and scrolled past so it shows you new content each time you check it. If you don't want this behavior, you can disable it in the feed filter menu or change your default view to show seen posts. If you subscribe to specific feeds, as opposed to scouring all of them, it's now easier to find the feed for an article you liked . Click the "..." menu under the post, then "Show Feeds" to show feeds where the item was found. When populating that list, Scour will now automatically search the website where the article was found to see if it has a feed that Scour wasn't already checking. This makes it easy to discover new feeds and follow websites or authors whose content you like. This was another feature I've wanted for a long time myself. Previously, when I liked an article, I'd copy the domain and try to add it to my feeds on the Feeds page. Now, Scour does that with the click of a button. Some of the most disliked and flagged articles on Scour had titles such as "The Top 10..." or "5 tricks...". Scour now automatically penalizes articles with titles like those. Because I'm explicitly trying to avoid using popularity in ranking , I need to find other ways to boost high-quality content and down-rank low-quality content. You can expect more of these types of changes in the future to increase the overall quality of what you see in your feed. Previously, posts found through Google News links would show Google News as the domain under the post. Now, Scour extracts the original link. You can now navigate your feed using just your keyboard. Type to get the list of available keyboard shortcuts. Finally, here are some of my favorite posts that I found on Scour in January. There were a lot! Happy Scouring! Have feedback for Scour? Post it on the feedback board and upvote others' suggestions to help me prioritize new features! I appreciate this minimalist approach to coding agents: Pi: The Minimal Agent Within OpenClaw , even though it didn't yet convince me to switch away from Claude Code. A long and interesting take on which software tools will survive the AI era: Software Survival 3.0 . Scour uses Litestream for backup. While this new feature isn't directly relevant, I'm excited that it's now powering Fly.io's new Sprites offering (so I expect it to be a little more actively developed): Litestream Writable VFS . This is a very cool development in embedding models: a family of different size (and, as a result, cost) models whose embeddings are interoperable with one another: The Voyage 4 model family: shared embedding space with MoE architecture . A thought-provoking piece from Every about How AI Made Pricing Hard Again . TL;DR: over are the days where SaaS businesses have practically zero marginal cost for additional users or additional usage. A nice bit of UX design history about the gas tank arrow indicator on a car, with a lesson applied to AI: The Moylan Arrow: IA Lessons for AI-Powered Experiences . Helpful context for Understanding U.S. Intervention in Venezuela . Stoolap: an interesting new embedded database. Stoolap 0.2 Released For Modern Embedded SQL Database In Rust . I keep browsing fonts and, while I decided not to use this one for Scour, I think this is a neat semi-sans-serif from an independent designer: Heliotrope .

0 views
Evan Schwartz 1 weeks ago

Building Docs Like a Product

Stripe is famous for having some of the best product docs, largely because they are "designed to feel like an application rather than a traditional user manual" . I spent much of the last week building and writing the docs for Scour, and I am quite proud of the results. Scour is a personalized content feed, not an SDK or API, so I started by asking myself what the equivalent of working code or copyable snippets is for this type of product. The answer: interactive pieces of the product, built right into the docs themselves. The guide for Hacker News readers is one of the sections I'm most proud of. When describing Scour to people, I often start with the origin story of wanting a tool that could search for posts related to my interests from the thousands submitted to HN that never make it to the front page. Built right into the guide is a live search bar that searches posts that have been submitted to HN, but that have not been on the front page . Try it out! You might find some hidden gems. The guides for Redditors , Substack readers , and arXiv readers also have interactive elements that let you easily search for subreddits or newsletters, or subscribe to any of arXiv's categories. Logged in users can subscribe to those feeds right from the docs. Every time I went to explain some aspect of Scour, I first asked myself if there was a way to use a working example instead. On the Interests page, I wanted to explain that the topics you add to Scour can be any free-form text you want. Every time you load the page, this snippet loads a random set of interests that people have added on Scour. You can click any of them to go to the page of content related to that topic and add that interest yourself. While explaining how Scour recommends other topics to you, I thought what if I just included an actual topic recommendation for logged in users? (Graphic Design was actually a Scour recommendation for me, and a good one at that!) On various docs pages, I wanted to explain the settings that exist. Instead of linking to the settings page or describing where to find it, logged in users can just change the settings from within the docs. For example, on the Content Filtering page, you can toggle the setting to hide paywalled content right from the docs: There are numerous live examples throughout the docs. All of those use the same components as the actual Scour website. (Scour is built with the "MASH stack" , so these are all components.) The section explaining that you can show the feeds where any given post was found actually includes the recent post that was found in the most different feeds. (In the docs, you actually need to click the "..." button to show the feeds underneath the post, as shown below.) While building this out, I had a number of cases where I needed to show an example of some component, but where I couldn't show a live component. For example, in the Interest Recommendations section described above, I needed a placeholder for users that aren't logged in. I started building a separate component that looked like the normal interest component... and then stopped. This felt like the type of code that would eventually diverge from the original and I'd forget to update it. So, I went and refactored the original components so that they'd work for static examples too. The last piece of building a documentation experience that I would be happy to use was ensuring that there would be no broken links. No broken links across docs sections, and no broken links from the docs to parts of the application. Scour is built with the excellent HTTP routing library in Rust. The crate has a useful, albeit slightly tedious trait and derive macro. This lets you define HTTP routes as structs, which can be used by the router and anywhere else you might want to link to that page. Anywhere else we might want to link to the Interests docs, we can use the following to get the path: This way, Rust's type system enforces that any link to those docs will stay updated, even if I later move the paths around. I started working on these docs after a couple of users gave the feedback that they would love a page explaining how Scour works. There is now a detailed explanation of how Scour's ranking algorithm works , along with docs explaining everything else I could think of. Please keep the feedback coming! If you still have questions after reading through any of the docs, please let me know so I can keep improving them.

0 views
Giles's blog 1 weeks ago

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩

0 views
Langur Monkey 2 weeks ago

Yet another architectural update for Play Kid

I finished my previous post on Play Kid, only two days ago, with the following words: Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. Yeah, this was a lie. I have previously written about Play Kid, my Game Boy emulator. Here , I introduced it and talked about the base implementation and the quirks of the Game Boy CPU, PPU, and hardware in general. Here , I explained the tech stack update from SDL2 to Rust-native libraries. In that last post, I mentioned the dependency version hell I unwillingly descended into as a result of adopting as my rendering library. This forced me to stay on very old versions of and . I want my Game Boy emulator to use the latest crate versions for various reasons, so this could not be. I saw a simple and direct path of update, which consisted on adopting to manage the application life cycle, directly drawing the LCD to a texture, and dropping altogether. One of the things that nagged me about the crate is that its integration with was kind of lackluster. There was no easy way to render the frame buffer to an panel, so I had to render it centered inside the window. The immediate mode GUI lived in mostly in widgets on top of it. Unfortunately, these windows occluded the Game Boy LCD. In a proper debugger, you must be able to see the entire LCD plus the debug interface. So I dropped and adopted a render-to-texture approach. In it, you create the LCD texture at the beginning from the context, and then copy the LCD contents to it in the method. With this, we can easily render the texture in ’s , and the debug interface in a to the right. This is the result: The debug panel, showing the machine state and a code disassembly. Some additional tweaks here and there, and the UI looks much more polished and professional in version 0.3.0. Version 0.4.0 enables loading ROM files from the UI. Initially I thought about making the cartridge struct optional with , but this spiraled out of control fast. I found that making the full (which contains the , , , , etc.) optional worked much better, as there was only one reference to it in the top-level struct, . And, like so, you can dynamically load ROM files from the UI: Play Kid with the top menu bar and the ‘Open ROM’ menu entry. So, what’s in the future for Play Kid? Well, there are a couple of features that I’d really like to add at some point: Save states —Currently, Play Kid emulates the SRAM by saving and restoring it from files for supported games. I would like to add saving and restoring the full state of the emulator in what is known as save states. Possibly, the crate can help with this. GBC —Of course, I would like adding Game Boy Color support. It is not trivial, but also not exceedingly complicated. I never owned a GBC, so I’d see this as a good opportunity to explore its game catalog.

0 views
Steve Klabnik 2 weeks ago

The most important thing when working with LLMs

Okay, so you’ve got the basics of working with Claude going. But you’ve probably run into some problems: Claude doesn’t do what you want it to do, it gets confused about what’s happening and goes off the rails, all sorts of things can go wrong. Let’s talk about how to improve upon that. The most important thing that you can do when working with an LLM is give it a way to quickly evaluate if it’s doing the right thing, and if it isn’t, point it in the right direction. This is incredibly simple, yet, like many simple things, also wildly complex. But if you can keep this idea in mind, you’ll be well equipped to become effective when working with agents. A long time ago, I used to teach programming classes. Many of these were to adults, but some of them were to children. Teenaged children, but children nonetheless. We used to do an exercise to try and help them understand the difference between talking in English and talking in Ruby, or JavaScript, or whatever kind of programming language rather than human language. The exercise went like this: I would have a jar of peanut butter, a jar of jelly, a loaf of bread, a spoon, and a knife. I would ask the class to take a piece of paper and write down a series of steps to make a peanut butter and jelly sandwich. They’d all then give me their algorithms, and the fun part for me began: find one that’s innocently written that I could hilariously misinterpret. For example, I might find one like: I’d read this aloud to the class, you all understand this is a recipe for a peanut butter and jelly sandwich, right? I’d take the jar of peanut butter and place it upon the unopened bag of bread. I’d do the same with the jar of jelly. This would of course, squish the bread, which feels slightly transgressive given that you’re messing up the bread, so the kids would love that. I’d then say something like “the bread is already together, I do not understand this instruction.” After the inevitable laughter died down, I’d make my point: the computer will do exactly what you say, but not what you mean. So you have to get good at figuring out when you said something different than what you mean. Sort of ironically, LLMs are kind of the inverse of this: they’ll sometimes try to figure out what you mean, and then do that, rather than simply doing what you say. But the core thing here is the same: semantic drift from what we intended our program to do, and what it actually does. The second lesson is something I came up with sometime, I don’t even remember how exactly. But it’s something I told my students a lot. And that’s this: If your program did everything you wanted without problems, you wouldn’t be programming: you’d be using your program. The act of programming is itself perpetually to be in a state where something is either inadequate or broken, and the job is to fix that. I also think this is a bit simplistic but also getting at something. I had originally come up with this in the context of trying to explain how you need to manage your frustration when programming; if you get easily upset by something not working, doing computer programming might not be for you. But I do think these two things combine into something that gets to the heart of what we do: we need to understand what it is we want our software to do, and then make it do that. Sometimes, our software doesn’t do something yet. Sometimes, it does something, but incorrectly. Both of these cases result in a divergence from the program’s intended behavior. So, how do we know if our program does what it should do? Well, what we’ve been doing so far is: This is our little mini software development lifecycle, or “SDLC.” This process works, but is slow. That’s great for getting the feel of things, but programmers are process optimizers by trade. One of my favorite tools for optimization is called Amdahl’s law . The core idea is this, formulated in my own words: If you have a process that takes multiple steps, and you want to speed it up, if you optimize only one step, the maximum amount of speedup you’ll get is determined by the portion of the process that step takes. In other words, imagine we have a three step process: This process takes a total of 13 minutes to complete. If we speed up step 3 by double, it goes from two minutes to one minute, and now our process takes 12 minutes. However, if we were able to speed up step 2 by double, we’d cut off five minutes, and our process would now take 8 minutes. We can use this style of analysis to guide our thinking in many ways, but the most common way, for me, is to decide where to put my effort. Given the process above, I’m going to look at step 2 first to try and figure out how to make it faster. That doesn’t mean we can achieve the 2x speedup, but heck, if we get a 10% decrease in time, that’s the same time as if we did get a 2x on step 3. So it’s at least the place where we should start. I chose the above because, well, I think it properly models the proportion of time we’re taking when doing things with LLMs: we spend some time asking it to do something, and we spend a bit more time reviewing its output. But we spend a lot of time clicking “accept edit,” and a lot of time allowing Claude to execute tools. This will be our next step forward, as this will increase our velocity when working with the tools significantly. However, like with many optimization tasks, this is easier said than done. The actual mechanics of improving the speed of this step are simple at first: hit to auto-accept edits, and “Yes, and don’t ask again for commands” when you think the is safe for Claude to run. By doing this, once you have enough commands allowed, your input for step 2 of our development loop can drop to zero. Of course, it takes time for Claude to actually implement what you’ve asked, so it’s not like our 13 minute process drops to three, but still, this is a major efficiency step. But we were actively monitoring Claude for a reason. Claude will sometimes do incorrect things, and we need to correct it. At some point, Claude will say “Hey I’ve finished doing what you asked of me!” and it doesn’t matter how fast it does step 2 if we get to step 3 and it’s just incorrect, and we need to throw everything out and try again. So, how do we get Claude to guide itself in the right direction? A useful technique for figuring out what you should do is to consider the ending: where do we want to go? That will inform what we need to do to get there. Well, the ending of step 2 is knowing when to transition to step 3. And that transition is gated by “does the software do what it is supposed to do?” That’s a huge question! But in practice, we can do what we always do: start simple, and iterate from there. Right now, the transition from step 2 to step 3 is left up to Claude. Claude will use its own judgement to decide when it thinks that the software is working. And it’ll be right. But why leave that up to chance? I expect that some of you are thinking that maybe I’m belaboring this point. “Why not just skip to ? That’s the idea, right? We need tests.” Well on some level: yes. But on another level, no. I’m trying to teach you how to think here, not give you the answer. Because it might be broader than just “run the tests.” Maybe you are working on a project where the tests aren’t very good yet. Maybe you’re working on a behavior that’s hard to automatically test. Maybe the test suite takes a very long time, and so isn’t appropriate to be running over and over and over. Remember our plan from the last post? Where Claude finished the plan with this: These aren’t “tests” in the traditional sense of a test suite, but they are objective measures that Claude can invoke itself to understand if it’s finished the task. Claude could run after every file edit if it wanted to, and as soon as it sees , it knows that it’s finished. You don’t need a comprehensive test suite. You just need some sort of way for Claude to detect if it’s done in some sort of objective fashion. Of course, we can do better. While giving Claude a way to know if it’s done working is important, there’s a second thing we need to pay attention to: when Claude isn’t done working, can we guide it towards doing the right thing, rather than the wrong thing? For example, those of you who are of a similar vintage as myself may remember the output of early compilers. It was often… not very helpful. Imagine that we told Claude that it should run to know if things are working, and the only output from it was the exit code: 0 if we succeeded, 1 if we failed. That would accomplish our objective of letting Claude know when things are done, but it wouldn’t help Claude know what went wrong when it returns 1. This is one reason why I think Rust works well with LLMs. Take this incorrect Rust program: The Rust compiler won’t just say “yeah this program is incorrect,” it’ll give you this (as of Rust 1.93.0): The compiler will point out the exact place in the code itself of where there’s an issue, and even make suggestions as to how to fix it. This goes beyond just simply saying “it doesn’t work” and instead nudges you to what might fix the problem. Of course, this isn’t perfect, but if it’s helpful more than not, that’s a win. Of course, too much verbosity isn’t helpful either. A lot of tooling has gotten much more verbose lately. Often times, this is really nice as a human. Pleasant terminal output is, well… pleasant. But that doesn’t mean that it’s always good or useful. For example, here’s the default output for : This is not bad output. It’s nice. But it’s also not useful for an LLM. We don’t need to read all of the tests that are passing, we really just want to see some sort of minimal output, and then what failed if something failed. In Cargo’s case, that’s for “quiet”: There is no point in giving a ton of verbose input to an LLM that it isn’t even going to need to use. If you’re feeding a tools’ output to an LLM, you should consider both what the tool does in the failure case, but also the success case. Maybe configure things to be a bit simpler for Claude. You’ll save some tokens and get better results. All of this has various implications for all sorts of things. For example, types are a great way to get quick feedback on what you’re doing. A comprehensive test suite that completes quickly is useful for giving feedback to the LLM. But that also doesn’t inherently mean that types must be better or that you need to be doing TDD; whatever gives you that underlying principle of “objective feedback for the success case and guidance for the failure case” will be golden, no matter what tech stack you use. This brings me to something that may be counter-intuitive, but I think is also true, and worth keeping in the back of your mind: what’s good for Claude is also probably good for humans working on your system. A good test suite was considered golden before LLMs. That it’s great for them is just a nice coincidence. At the end of the day, Claude is not a person, but it tackles programming problems in a similar fashion to how we do: take in the problem, attempt a solution, run the compiler/linter/tests, and then see what feedback it gets, then iterate. That core loop is the same, even if humans can exercise better judgement and can have more skill. And so even though I pitched fancy terminal output as an example of how humans and LLMs need different things, that’s really just a superficial kind of thing. Good error messages are still critical for both. We’re just better at having terminal spinners not take up space in our heads while we’re solving a problem, and can appreciate the aesthetics in a way that Claude does not. Incidentally, this is one of the things that makes me hopeful about the future of software development under agentic influence. Engineers always complain that management doesn’t give us time to do refactorings, to improve the test suite, to clean our code. Part of the reason for this is that we often didn’t do a good job of pitching how it would actually help accomplish business goals. But even if you’re on the fence about AI, and upset that management is all about AI: explain to management that this stuff is a force multiplier for your agents. Use the time you’ve saved by doing things the agentic way towards improving your test suite, or your documentation, or whatever else. I think there’s a chance that all of this stuff leads to higher quality codebases than ones filled with slop. But it also requires us to make the decisions that will lead is in that direction. That’s what I have for you today: consider how you can help Claude evaluate its own work. Give it explicit success criteria, and make evaluating that criteria as simple and objective as possible. In the next post, we’re gonna finally talk about . Can you believe that I’ve talked this much about how to use Claude and we haven’t talked about ? There’s good reason for that, as it turns out. We’re going to talk a bit more about understanding how interacting with LLMs work, and how it can help us both improve step 1 in our process, but also continue to make step 2 better and better. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 22 Replying to Steve Klabnik Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com The most important thing when working with LLMs steveklabnik.com/writing/the-... The most important thing when working with LLMs Blog post: The most important thing when working with LLMs by Steve Klabnik Put the peanut butter on the bread Put the jelly on the bread Put the bread together Asking the LLM to do something by typing up what we want it to do Closely observing its behavior and course correcting it when it goes off of the rails Eventually, after it says that it’s finished, reviewing its output Ten minutes Two minutes

0 views
Langur Monkey 2 weeks ago

Game Boy emulator tech stack update

In my previous post , I shared the journey of building Play Kid , my Game Boy emulator. At the time, I was using SDL2 to handle the “heavy lifting” of graphics, audio, and input. This was released as v0.1.0. It worked, and it worked well, but it always felt a bit like a “guest” in the Rust ecosystem. SDL2 is a C library at heart, and while the Rust wrappers are good, they bring along some baggage like shared library dependencies and difficult integration with Rust-native UI frameworks. So I decided to perform a heart transplant on Play Kid. For version v0.2.0 I’ve moved away from SDL2 entirely, replacing it with a stack of modern, native Rust libraries: , , , , , and : The most visible change is the new Debug Panel . The new integrated debugger features a real-time disassembly view and breakpoint management. One of the coolest additions is the Code disassembly panel. It decodes the ROM instructions in real-time, highlighting the current and allowing me to toggle breakpoints just by clicking on a line. The breakpoints themselves are now managed in a dedicated list, shown in red at the bottom. The rest of the debug panel shows what we already had: the state of the CPU, the PPU, and the joypad. Of course, no modern Rust migration is complete without a descent into dependency hell . This new stack comes with a major catch: is a bit of a picky gatekeeper. Its latest version is 0.15 (January 2025). It is pinned to an older version of (0.19 vs the current 28.0), and it essentially freezes the rest of the project in a time capsule. To keep the types compatible, I’m forced to stay on 0.26 (current is 0.33) and 0.29 (current is 0.30), even though the rest of the ecosystem has moved on to much newer, shinier versions. It’s kind of frustrating. You get the convenience of the buffer, but you pay for it by being locked out of the latest API improvements and features. Navigating these version constraints felt like solving a hostage negotiation between crate maintainers. Not very fun. Despite the dependency issues, I think the project is now in a much better place. The code is cleaner, the debugger is much better, and it’s easier to ship binaries for Linux, Windows, and macOS via GitHub Actions. If you’re interested in seeing the new architecture or trying out the new debugger, the code is updated on Codeberg and GitHub . Next, I’ll probably think about adding Game Boy Color support, but not before taking some time off from this project. & : These handle the windowing and the actual Game Boy frame buffer. allows me to treat the 160x144 LCD as a simple pixel buffer while handles the hardware-accelerated scaling and aspect ratio correction behind the scenes. : This was a big step-up. Instead of my minimal homegrown UI library from the SDL2 version, I now have access to a full-featured, immediate-mode GUI. This allowed me to build the debugger I had in mind from the beginning. & : These replaced SDL2’s audio and controller handling with pure-Rust alternatives that feel much more ergonomic to use alongside the rest of the machine.

0 views
ava's blog 2 weeks ago

my theme for 2026

Last year, I made a post called " My theme for 2025 ". Inchwyrm's post about the year of the wizard reminded me I should do another one for this year. My 2025 theme was 'learning'. I think I have managed that pretty well, even if it wasn't exactly the things I mentioned in the post. I just cannot make time for The Odin Project and Rust, and to make little games; I have to prioritize my studies, my volunteer work, staying up to date on data protection law and writing about it. Maybe one day :) The rest fits though: I passed everything I enrolled in last year, and finished the certification process in just 6 months. I started summarizing and translating cases for noyb.eu, and I was creative with my notebook and some pixel art. I learned a lot and I tried new things. My theme for this year is ' rejection '. I'm collecting them! A little while ago, the concept of collecting a 1000 no's was picked up by some blogs, and it helped me view rejection, criticism and other feedback in a more positive light. I want to grow, I want to try new things, and I want to become (positively) hardened by challenge. It feels uncomfortable and a part of me doesn't want to, but in a way, I also want to be humbled in a constructive way. This year, I will send out a lot of applications for both new work and a new apartment. That will undoubtedly result in a lot of no's; the market for both is just incredibly tough right now, and there always seems to be someone better. I have already received one rejection this year just a week after I sent out the application for something I thought for sure I'd at least get an interview from, so there's that. Other on-going things that produce rejections: You can also help with something rejection-adjacent: This is your opportunity to give me constructive criticism on what has always bothered you about my blog's theme, writing, or my behavior. I want the pressure and polish to result in a version of me that is better. That's what I need right now. I have relied long enough on mostly gut feelings, learning by myself and my own assessments of myself, and always thought I had to do it all alone; but I need outside feedback now, especially from people who want to see me grow and do better. I want to know how I can improve. Reply via email Published 23 Jan, 2026 I have submitted an idea to my workplace's idea management team and they are notorious for shooting down anything, but at least I tried. I'm sending out e-mails for a blog project I wanna do, and I have received no answer so far from the places/people I've messaged. I'll have to rethink my approach and then keep trying. Doing things that are a little embarrassing, like my post making known I am looking for work.

0 views
Anton Zhiyanov 2 weeks ago

Interfaces and traits in C

Everyone likes interfaces in Go and traits in Rust. Polymorphism without class-based hierarchies or inheritance seems to be the sweet spot. What if we try to implement this in C? Interfaces in Go  • Traits in Rust  • Toy example  • Interface definition  • Interface data  • Method table  • Method table in implementor  • Type assertions  • Final thoughts An interface in Go is a convenient way to define a contract for some useful behavior. Take, for example, the honored : Anything that can read data into a byte slice provided by the caller is a . Quite handy, because the code doesn't need to care where the data comes from — whether it's memory, the file system, or the network. All that matters is that it can read the data into a slice: We can provide any kind of reader: Go's interfaces are structural, which is similar to duck typing. A type doesn't need to explicitly state that it implements ; it just needs to have a method: The Go compiler and runtime take care of the rest: A trait in Rust is also a way to define a contract for certain behavior. Here's the trait: Unlike in Go, a type must explicitly state that it implements a trait: The Rust compiler takes care of the rest: Either way, whether it's Go or Rust, the caller only cares about the contract (defined as an interface or trait), not the specific implementation. Let's make an even simpler version of — one without any error handling (Go): Usage example: Let's see how we can do this in C! The main building blocks in C are structs and functions, so let's use them. Our will be a struct with a single field called . This field will be a pointer to a function with the right signature: To make fully dynamic, let's turn it into a struct with a function pointer (I know, I know — just bear with me): Here's the "method" implementation: The is pretty obvious: And, finally, the function: See how easy it is to turn a into a : all we need is . Pretty cool, right? Not really. Actually, this implementation is seriously flawed in almost every way (except for the definition). Memory overhead . Each instance has its own function pointers (8 bytes per function on a 64-bit system) as "methods", which isn't practical even if there are only a few of them. Regular objects should store data, not functions. Layout dependency . Converting from to like only works if both structures have the same field as their first member. If we try to implement another interface: Everything will fall apart: and have different layouts, so type conversion in ⓧ is invalid and causes undefined behavior. Lack of type safety . Using a as the receiver in means the caller can pass any type, and the compiler won't even show a warning: C isn't a particularly type-safe language, but this is just too much. Let's try something else. A better way is to store a reference to the actual object in the interface: We could have the method in the interface take a instead of a , but that would make the implementation more complicated without any real benefits. So, I'll keep it as . Then will only have its own fields: We can make the method type-safe: To make this work, we add a method that returns the instance wrapped in a interface: The and functions remain quite simple: This approach is much better than the previous one: Since our type now knows about the interface (through the method), our implementation is more like a basic version of a Rust trait than a true Go interface. For simplicity, I'll keep using the term "interface". There is one downside, though: each instance has its own function pointer for every interface method. Since only has one method, this isn't an issue. But if an interface has a dozen methods and the program uses a lot of these interface instances, it can become a problem. Let's fix this. Let's extract interface methods into a separate strucute — the method table. The interface references its methods though the field: and don't change at all: The method initializes the static method table and assigns it to the interface instance: The only difference in is that it calls the method on the interface indirectly using the method table ( instead of ): stays the same: Now the instance always has a single pointer field for its methods. So even for large interfaces, it only uses 16 bytes ( + fields). This approach also keeps all the benefits from the previous version: We can even add a separate helper so the client doesn't have to worry about implementation detail: There's another approach I've seen out there. I don't like it, but it's still worth mentioning for completeness. Instead of embedding the method table in the interface, we can place it in the implementation ( ): We initialize the method table in the constructor: now takes a pointer: And converts to with a simple type cast: This keeps pretty lightweight, only adding one extra field. But the cast only works because is the first field in . If we try to implement a second interface, things will break — just like in the very first solution. I think the "method table in the interface" approach is much better. Go has an function that copies data from a source (a reader) to a destination (a writer): There's an interesting comment in its documentation: If implements , the copy is implemented by calling . Otherwise, if implements , the copy is implemented by calling . Here's what the function looks like: is a type assertion that checks if the reader is not just a , but also implements the interface. The Go runtime handles these kinds of dynamic type checks. Can we do something like this in C? I'd prefer not to make it fully dynamic, since trying to recreate parts of the Go runtime in C probably isn't a good idea. What we can do is add an optional method to the interface: Then we can easily check if a given is also a : Still, this feels a bit like a hack. I'd rather avoid using type assertions unless it's really necessary. Interfaces (traits, really) in C are possible, but they're not as simple or elegant as in Go or Rust. The method table approach we discussed is a good starting point. It's memory-efficient, as type-safe as possible given C's limitations, and supports polymorphic behavior. Here's the full source code if you are interested: The struct is lean and doesn't have any interface-related fields. The method takes a instead of a . The cast from to is handled inside the method. We can implement multiple interfaces if needed. Lightweight structure. Easy conversion from to . Supports multiple interfaces.

0 views
Steve Klabnik 2 weeks ago

Agentic development basics

In my last post, I suggested that you should start using Claude in your software development process via read-only means at first. The idea is just to get used to interacting with the AI, seeing what it can do, and seeing what it struggles with. Once you’ve got a handle on that part, it’s time to graduate to writing code. However, I’m going to warn you about this post: I hope that by the end of it, you’re a little frustrated. This is because I don’t think it’s productive to skip to the tools and techniques that experienced users use yet. We have to walk before we run. And more importantly, we have to understand how and why we run. That is, I hope that this step will let you start producing code with Claude, but it will also show you some of the initial pitfalls when doing so, in order to motivate the techniques you’re going to learn about in part 3. So with that in mind, let’s begin. Okay I lied. Before we actually begin: you are using version control, right? If not, you may want to go learn a bit about it. Version control, like git (or my beloved jj) is pretty critical for software development, but it’s in my opinion even more critical for this sort of development. You really want to be able to restore to previous versions of the code, branch off and try things, and recover from mistakes. If you already use version control systems religiously, you might use this as an excuse to learn even more features of them. I never bothered with s in the past, but I use s with agents all the time now. Okay, here’s my first bit of advice: commit yourself to not writing code any more. I don’t mean forever, I don’t mean all the time, I mean, while you’re trying to learn agentic software development, on the project that you’re learning it with, just don’t write any code manually. This might be a bit controversial! However, I think it is essential. Let me tell you a short story. Many years ago, I was in Brazil. I wanted to try scuba diving for the first time. Seemed like a good opportunity. Now, I don’t remember the exact setup, but I do remember the hardest part for me. Our instructor told us to put the mask on, and then lean forward and put our faces in the water and breathe through the regulator. I simply could not do it. I got too in my head, it was like those “you are now breathing manually” memes. I forget if it was my idea or the instructor’s idea, but what happened in practice: I just jumped in. My brain very quickly went from “but how do I do this properly” to “oh God, if you don’t figure this out right the fuck now you’re gonna fuckin die idiot” and that’s exactly what I needed to do it. A few seconds later, I was breathing just fine. I just needed the shock to my system, I needed to commit. And I figured it out. Now, I’m turning 40 and this happened long ago, and it’s reached more of a personal myth status in my head, so maybe I got the details wrong, maybe this is horribly irresponsible and someone who knows about diving can tell me if that experience was wrong, but the point has always stuck with me: sometimes, you just gotta go for it. This dovetails into another part I snuck into there: “on the project that you’re learning it with.” I really think that you should undergo a new project for this endeavor. There’s a few reasons for this: Pick something to get started with, and create a new repo. I suggest something you’ve implemented before, or maybe something you know how to do but have never bothered to take the time to actually build. A small CLI tool might be a good idea. Doesn’t super matter what it is. For the purposes of this example, I’m going to build a task tracking CLI application. Because there aren’t enough of those in the world. I recommend making a new fresh directory, and initializing the project of your choice. I’m using Rust, of course, so I’ll . You can make Claude do it, but I don’t think that starting from an initialized project is a bad idea either. I’m more likely to go the route if I know I’m building something small, and more of the “make Claude do it” route if I’m doing like, a web app with a frontend, backend, and several services. Anyway point is: get your project to exist, and then just run Claude Code. I guess you should install it first , but anyway, you can just to get started. At the time of writing, Claude will ask you if you trust the files in this folder, you want to say yes, because you just created it. You’ll also get some screens asking if you want to do light or dark mode, to log in with your Anthropic account, stuff like that. And then you’ll be in. Claude will ask you to run to create a CLAUDE.md, but we’re not gonna do that at the start. We need to talk about even more basic things than that first. You’ll be at a prompt, and it’ll even suggest something for you to get started with. Mine says “Try “fix typecheck errors"" at the moment. We’re gonna try to get Claude to modify a few lines of our program. The in Rust produces this program: so I’ll ask Claude this: Hi claude! right now my program prints “Hello, world”, can we have it print “Goodbye, world” instead? Claude does this for me: And then asks this: Claude wants to edit a file, and so by default, it has to ask us permission to do so. This is a terrible place to end up, but a great place to get started! We want to use these prompts at first to understand what Claude is doing, and sort of “code review” it as we go. More on that in a bit. Anyway, this is why you should answer “Yes” to this question, and not “Yes, allow all edits during this session.” We want to keep reviewing the code for now. You want to be paying close attention to what Claude is doing, so you can build up some intuition about it. Before clicking yes, I want to talk about what Claude did in my case here. Note that my prompt said right now my program prints “Hello, world”, can we have it print “Goodbye, world” Astute observers will have noticed that it actually says and not . We also asked it to have it say and it is showing a diff that will make it say . This is a tiny different, but it is also important to understand: Claude is going to try and figure out what you mean and do that. This is both the source of these tools’ power and also the very thing that makes it hard to trust them. In this case, Claude was right, I didn’t type the exact string when describing the current behavior, and I didn’t mean to remove the . In the previous post, I said that you shouldn’t be mean to Claude. I think it makes the LLM perform worse. So now it’s time to talk about your own reaction to the above: did you go “yeah Claude fucked up it didn’t do exactly what I asked?” or did you go “yeah Claude did exactly what I asked?” I think it’s important to try and let go of preconceived notions here, especially if your reaction here was negative. I know this is kind of woo, just like “be nice to Claude,” but you have to approach this as “this is a technology that works a little differently than I’m used to, and that’s why I’m learning how to meet it on its own terms” rather than “it didn’t work the way I expected it to, so it is wrong.” A non-woo way of putting it is this: the right way to approach “it didn’t work” in this context is “that’s a skill issue on my part, and I’m here to sharpen my skills.” Yes, there are limits to this technology and it’s not perfect. That’s not the point. You’re not doing that kind of work right now, you’re doing . Now, I should also say that like, if you don’t want to learn a new tool? 100% okay with me. Learned some things about a tool, and didn’t like it? Sure! Some of you won’t like agentic development. That’s okay. No worries, thanks for reading, have a nice day. I mean that honestly. But for those folks who do want to learn this, I’m trying to communicate that I think you’ll have a better time learning it if you try to get into the headspace of “how do I get the results I want” rather than getting upset and giving up when it doesn’t work out. Okay, with that out of the way, if you asked a small enough question, Claude probably did the right thing. Let’s accept. This might be a good time to commit & save your progress. You can use to put Claude Code into “bash mode” and run commands, so I just and I’m good. You can also use another terminal, I just figured I’d let you know. It’s good for short commands. Let’s try something bigger. To do that, we’re gonna invoke something called “plan mode.” Claude Code has three (shhhh, we don’t talk about the fourth yet) modes. The first one is the “ask to accept edits” mode. But if you hit , you’ll see at the bottom left. We don’t want to automatically accept edits. Hit again, and you’ll see this: This is what we want. Plan mode. Plan mode is useful any time you’re doing work that’s on the larger side, or just when you want to think through something before you begin. In plan mode, Claude cannot modify your files until you accept the plan. With plan mode, you talk to Claude about what you want to do, and you collaborate on a plan together. A nice thing about it is that you can communicate the things you are sure of, and also ask about the things you’re not sure of. So let’s kick off some sort of plan to build the most baby parts of our app. In my case, I’m prompting it with this: hi claude! i want this application to grow into a task tracking app. right now, it’s just hello world. I’d like to set up some command line argument parsing infrastructure, with a command that prints the version. can we talk about that? Yes, I almost always type , feel free to not. And I always feel like a “can we talk about that” on the end is nice too. I try to talk to Claude the way I’d talk to a co-worker. Obviously this would be too minor of a thing to bother to talk to a co-worker about, but like, you know, baby steps. Note that what I’m asking is basically a slightly more complex “hello world”, just getting some argument parsing up. You want something this sized: you know how it should be done, it should be pretty simple, but it’s not a fancy command. With plan mode, Claude will end up responding to this by taking a look at your project, considering what you might need, and then coming up with a plan. Here’s the first part of Claude’s answer to me: It’ll then come up with a plan, and it usually write it out to a file somewhere: You can go read the file if you want to, but it’s not needed. Claude is going to eventually present the plan to you directly, and you’ll review it before moving on. Claude will also probably ask you a question or maybe even a couple, depending. There’s a neat little TUI for responding to its questions, it can even handle multiple questions at once: For those of you that don’t write Rust, this is a pretty good response! Clap is the default choice in the ecosystem, arg is a decent option too, and doing it yourself is always possible. I’m going to choose clap, it’s great. If you’re not sure about the question, you can arrow down to “Chat about this” and discuss it more. Here’s why you don’t need to read the file: Claude will pitch you on its plan: This is pretty good! Now, if you like this plan, you can select #3. Remember, we’re not auto accepting just yet! Don’t worry about the difference between 1 and 2 now, we’ll talk about it someday. But, I actually want Claude to tweak this plan, I wouldn’t run , I would do . So I’m going to go down to four and type literally that to Claude: I wouldn’t run , I would do . And Claude replies: and then presents the menu again. See, this is where the leverage of “Claude figures out what I mean” can be helpful: I only told it about , but it also changed to as well. However, there’s a drawback too: we didn’t tell Claude we wanted to have help output! However, this is also a positive: Claude considered help output to be so basic that it’s suggesting it for our plan. It’s up to you to decide if this is an overreach on Claude’s part. In my case, I’m okay with it because it’s so nearly automatic with Clap and it’s something I certainly want in my tool, so I’m going to accept this plan. Iterate with Claude until you’re happy with the plan, and then do the same. I’m not going to paste all of the diffs here, but for me, Claude then went and did the plan: it added the dependency to , it added the needed code to , it ran to try and do the build. Oh yeah, here’s a menu we haven’t seen yet: This is a more interesting question than the auto-edit thing. Claude won’t run commands without you signing off on them. If you’re okay with letting Claude run this command every time without asking, you can choose 2, and if you want to confirm every time, you can type 1. Completely up to you. Now it ran , and . Everything looked good, so I see: And that’s it, we’ve built our first feature! Yeah, it’s pretty small, and in this case, we probably could have copy/pasted the documentation, but again, that’s not the point right now: we’re just trying to take very small steps forward to get used to working with the tool. We want it to be something that’s quick for us to verify. We are spending more time here than we would if we did it by hand, because that time isn’t wasted: it’s time learning. As we ramp up the complexity of what we can accomplish, we’ll start seeing speed gains. But we’re deliberately going slow and doing little right now. From here, that’s exactly what I’d like you to do: figure out where the limits are. Try something slightly larger, slightly harder. Try to do stuff with just a prompt, and then throw that commit away and try it again with plan mode. See what stuff you need plan mode for, and what you can get away with with just a simple prompt. I’ll leave you with an example of a prompt I might try next: asking Cladue for their opinion on what you should do. The next thing I’d do in this project is to switch back into planning mode and ask this: what do you think is the best architecture for this tracking app? we haven’t discussed any real features, design, or requirements. this is mostly a little sample application to practice with, and so we might not need a real database, we might get away with a simple file or files to track todos. but maybe something like sqlite would still be appropriate. if we wanted to implement a next step for this app, what do you think it should be? Here’s what Claude suggested: This plan is pretty solid. But again, it’s a demo app. The important part is, you can always throw this away if you don’t like it. So try some things. Give Claude some opinions and see how they react. Try small features, try something larger. Play around. But push it until Claude fails. At some point, you will run into problems. Maybe you already have! What you should do depends on the failure mode. The first failure you’re gonna run into is “I don’t like the code!” Maybe Claude just does a bad job. You have two options: the first is to just tell Claude to fix it. Claude made a mess, Claude can clean it up. In more extreme cases, you may want to just simply or and start again. Honestly, the second approach is better in a lot of cases, but I’m not going to always recommend it just yet. The reason why it is is that it gives you some time to reflect on why Claude did a bad job. But we’re gonna talk about that as the next post in this series! So for now, stick to the ‘worse’ option: just tell Claude to fix problems you find. The second kind of failure is one where Claude just really struggles to get things right. It looks in the wrong places in your codebase, it tries to find a bug and can’t figure it out, it misreads output and that leads it astray, etc. This kind of failure is harder to fix with the tools you have available to you right now. What matters is taking note of them, so that you can email them to me, haha. I mean, feel free to do that, and I can incorporate specific suggestions into the next posts, but also, just being able to reflect on what Claude struggles with is a good idea generally. You’ll be able to fix them eventually, so knowing what you need to improve matters. If Claude works long enough, you’ll see something about “compaction.” We haven’t discussed things at a deep enough level to really understand this yet, so don’t worry about it! You may want to note one thing though: Claude tends to do worse after compaction, in my opinion. So one way to think about this is, “If I see compaction, I’ve tried to accomplish too large a task.” Reflect on if you could have broken this up into something smaller. We’ll talk about this more in the next post. So that’s it! Let Claude write some code, in a project you don’t care about. Try bigger and harder things until you’re a bit frustrated with its failures. You will hit the limits, because you’re not doing any of the intermediate techniques to help Claude do a good job. But my hope is, by running into these issues, you’ll understand the motivation for those techniques, and will be able to better apply them in the future. Here’s my post about this post on BlueSky: Steve Klabnik @steveklabnik.com · Jan 7 Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik steveklabnik.com Steve Klabnik @steveklabnik.com Agentic development basics: steveklabnik.com/writing/agen... Agentic development basics Blog post: Agentic development basics by Steve Klabnik You can be less precious about the code. This isn’t messing up one of your projects, this is a throwaway scratch thing that doesn’t matter. “AI does better on greenfield projects” is not exactly true, but there’s enough truth to it that I think you should do a new project. It’s really more about secondary factors than it is actual greenfield vs brownfield development but whatever, doesn’t matter: start new.

1 views
Corrode 2 weeks ago

Gama Space

Space exploration demands software that is reliable, efficient, and able to operate in the harshest environments imaginable. When a spacecraft deploys a solar sail millions of kilometers from Earth, there’s no room for memory bugs, race conditions, or software failures. This is where Rust’s robustness guarantees become mission-critical. In this episode, we speak with Sebastian Scholz, an engineer at Gama Space, a French company pioneering solar sail and drag sail technology for spacecraft propulsion and deorbiting. We explore how Rust is being used in aerospace applications, the unique challenges of developing software for space systems, and what it takes to build reliable embedded systems that operate beyond Earth’s atmosphere. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Gama Space is a French aerospace company founded in 2020 and headquartered in Ivry-sur-Seine, France. The company develops space propulsion and orbital technologies with a mission to keep space accessible. Their two main product lines are solar sails for deep space exploration using the sun’s infinite energy, and drag sails—the most effective way to deorbit satellites and combat space debris. After just two years of R&D, Gama successfully launched their satellite on a SpaceX Falcon 9. The Gama Alpha mission is a 6U cubesat weighing just 11 kilograms that deploys a large 73.3m² sail. With 48 employees, Gama is at the forefront of making space exploration more sustainable and accessible. Sebastian Scholz is an engineer at Gama Space, where he works on developing software systems for spacecraft propulsion technology. His work involves building reliable, safety-critical embedded systems that must operate flawlessly in the extreme conditions of space. Sebastian brings expertise in systems programming and embedded development to one of the most demanding environments for software engineering. GAMA-ALPHA - The demonstration satellite launched in January 2023 Ada - Safety-focused programming language used in aerospace probe-rs - Embedded debugging toolkit for Rust hyper - Fast and correct HTTP implementation for Rust Flutter - Google’s UI toolkit for cross-platform development UART - Very common low level communication protocol Hamming Codes - Error correction used to correct bit flips Rexus/Bexus - European project for sub-orbital experiments by students Embassy - The EMBedded ASsYnchronous framework CSP - The Cubesat Space Protocol std::num::NonZero - A number in Rust that can’t be 0 std::ffi::CString - A null-byte terminated String Rust in Production: KSAT - Our episode with Vegard about using Rust for Ground Station operations Rust in Production: Oxide - Our episode with Steve, mentioning Hubris Hubris - Oxide’s embedded operating system ZeroCopy - Transmute data in-place without allocations std::mem::transmute - Unsafe function to treat a memory section as a different type than before Gama Space Website Gama Space on LinkedIn Gama Space on Crunchbase

0 views
Brain Baking 3 weeks ago

Another Major Bike Service

Last month I handed in my bike for another major repair service. It was sorely needed: a slight push on the pedals caused the chain to drop a gear, the front light wiring was broken since forever, and shifting in general always required two good clicks on the handlebar instead of just one. This year, the bike turns ten. The previous one was stolen on a weekday evening after parking it right across the old courthouse—isn’t that ironic? Of course that was entirely my fault: I kind of might have slightly forgotten to lock it. But still, who does that? The local bicycle repair expert had their hands full: the entire back cassette gear together with the chain was replaced, the seat post was replaced (I didn’t even know it was broken), the front light rewired, and the right shifter on the bar got replaced. Everything together cost me about . The result is a spotless gear system that’s lovely to drive: A closeup of the replaced cassette gear and chain. Yes, there once was a chain guard/fender in front of that chain protecting it from mud but that brittle plastic thing broke down long ago. This does mean the chain is open for attacks from road salt after snowy days like last week. I forgot to clean it and in just three days the entire chain was covered in rust—the new chain! After another trip to the bike shop for more mud remover and chain protector/oil, that problem was luckily solved. My wife laughs at me for regularly cleaning and oiling the gears and chain. I hate a squeaky bike. I shudder when encountering other cyclists with poorly maintained bikes that you can hear weeping (weep-weep-weep) as they push their pedals. I want to hear exactly nothing and feel nothing but smoothness when I exert force on my pedals. For some reason, that’s hugely satisfying for me. So yes, I try to keep the mud and sand out. But somehow, I forgot about the road salt: if you zoom in on the above photo you’ll still spot spots (ha!) of rust here and there. I guess that means I’ll be repeating the cleaning process later today. The reason why the entire cassette was replaced is that apparently, the wear and tear on the gears gradually reduce the short edges of the gear that fit into the chain into very spiky ones. As a result, as you push on the pedals to move the chain, the gear no longer consistently “locks” into it, causing slipping. If you sometimes “fall through” when biking, it’s time to inspect the gears. Did a cogwheel transform into a giant shuriken that would make every ninja jealous? Then perhaps it’s time to visit the bike shop. This wasn’t the first time the chain and gear(s) got replaced—the last time was in 2021. The not cheap price tag does raise the question whether buying a new bike is the better option, but I really like my current bike. Besides, spreading the repair cost out on four-ish years makes it much more bearable. Riding the new bike to and from work on a daily basis would deteriorate the cogwheels just as fast unless I buy a very fancy e-bike with a belt drive. Also, small repairs like chain adjustments I can do myself. At least I think I can. I don’t have any fancy biking stats to share: I don’t keep track of that. For me, my bike symbolises simplicity and freedom. I hope to be able to ride the Trek 1 for at least five more years. I just found out that Trek is an American brand, while here in Belgium and The Netherlands we basically drown in excellent bike manufacturers. I’ll take note of that should I ever decide to replace it.  ↩︎ Related topics: / bike / By Wouter Groeneveld on 20 January 2026.  Reply via email . I just found out that Trek is an American brand, while here in Belgium and The Netherlands we basically drown in excellent bike manufacturers. I’ll take note of that should I ever decide to replace it.  ↩︎

0 views
Steve Klabnik 1 months ago

Getting started with Claude for software development

2025 was an interesting year in many ways. One way in which it was interesting for me is that I went from an AI hater to a pretty big user. And so I’ve had a few requests for a “using Claude” guide, so I figure new year, why not give it a shot? The lack of this kind of content was something that really frustrated me starting out, so feels like a good thing to contribute to the world. This post is going to be for software developers that are interested in learning about using these tools as of early 2026. I’m going to spend this post talking about some background, and then the first steps towards getting your feet wet. If folks like it, I’ll follow up with more. There’s a lot here. I’m going to be speaking about Claude directly, because it’s the tool I use the most, but a lot of this should apply to other platforms as well. The first thing I want to say on this topic is that there’s a reason that this is the first post of possibly many: there’s a lot here, as I just said above. This matters more than you might think at first. Becoming productive with LLMs is not actually easy, no matter what other people tell you. A lot of advice in this space is given by people who don’t have a teaching background, and have forgotten how much work they’ve put in to get to where they are. I liken it to vim: everyone acknowledges that modal editing is a different way of working. We joke about how hard it is to learn how to quit vim if you’ve accidentally started it up. But many people also acknowledge that the learning curve is worth it, due to the power you get. I think of LLMs like vim: they’re not super easy to get real results from, but the time invested can be worth it. It’s also worth saying up front: maybe it’s not worth it, for you. I don’t fault anyone for not wanting to spend time learning a new tool, especially in a space that’s moving as fast as this one. Effectively everything I’m going to talk about in this post has really only come into its own in the last 12-14 months. Maybe in another 12 months this post will be useless. I don’t know. But just like we might not find the time to learn vim to be worth it over just using a more normal editor, that doesn’t mean that deciding that all of this isn’t worth your time isn’t a rational, reasonable decision to make. You’re not going to be “left behind” or whatever some of the boosters say, in the same way that you don’t need to learn vim to do software dev. We aren’t doing vim vs emacs wars here. We’re saying “hey if you want to learn vim this is how I think you can, and if not, that’s fine.” Furthermore, because there’s so much to cover, this post is going to be background and step 1. Because otherwise it would be too dang long. You can’t just read ten thousand words on this stuff and be an expert, you have to go actually use the things. So you should be taking time between each post to go and do that, and so not putting it all in one place should give you natural breakpoints to go and actually try this stuff out. The very first thing I want to say on this entire topic is something that I think about a lot. I have generally had better outcomes with LLMs than a lot of people I know. And that’s bothered me. And I’m not sure exactly why that is. But I do have one idea. I like to approach this space in a … maybe “scientific” way is too strong, but at least a rational one. I try things out, discard what doesn’t seem to work, and keep what seems to work. I try and think critically about this space. I do think that the whole “vibe” term, while complicated in this space, is also important. Vibes do matter, actually. I have some more science-y and some more folks-y reasons that I believe this. But I do think that the attitude you bring towards this process partially dictates your success, and I think you should be conscious of that while you go on this journey. Is that too woo-y for you? Okay, let me make it concrete: I un-ironically believe that swearing at Claude makes it perform worse. I think you will get better results working with an LLM if you treat them like you’d treat a respected co-worker, and you will get worse results if you berate, insult, or otherwise mistreat them. This matters because I think that for a lot of LLM-skeptical people who give this a shot, they may not actually go “Hey claude what’s your fucking problem” (though I have literally seen this happen) they will tend to let their frustrations show a bit more when things don’t work out. Use your emotional regulation skills. It’s very okay to be critical in response to whatever Claude does, but do it in a way that wouldn’t get you reported to HR in a healthy company. Do this: Why did you do it that way? I would have preferred if we did <this> instead. Stop making such basic mistakes. You know that we do <this> and not <that>, idiot. I think that being kind to people is good for you, but I also believe that even if you’re a misanthrope, consider this a skill to get increased output from the tool. I think a bit of anthropomorphization is actually a good thing here. We’ll come back to that later during the more practical steps, but basically, that’s the higher level principle at work: an LLM is not a person. But it is working based off of language that people use. That’s its API. And so interacting with it in the way you’d interact with a co-worker is, in my mind, the right way to do it. Maybe I’ll elaborate on this belief someday. Or maybe not. I do this for personal belief reasons more than anything else. But it is something I want to share. Okay! Now that we’ve got that out of the way, let’s talk about the various ways you can use Claude! There’s a number of them, actually, but I want to focus on two: on the web at https://claude.ai , and with Claude Code . Using Claude in these ways is fundamentally different. Both have pros and cons. For real actual software development, you want to use Claude Code. This is due to the “agentic loop”, which you’ll learn more about in a bit. But for your first steps, using it via the web is okay. It’s mostly just important to know that your experience using the web interface is not going to be the same as using Claude Code. If I only had access to the web interface, I wouldn’t be so bullish on this stuff. But it is and can be useful. Especially when getting your feet wet, as long as you can understand that they’re meaningfully different. This gets into another topic that matters: money. Another reason I do not fault anyone for not spending time with these tools is that vim is free, whereas Claude is very much not. However. There are three major factors in the money equation: Claude Web vs Claude Code, which models you have access to, and the actual cost. Let’s talk about them. You can load up https://claude.ai and talk to Claude right now for free. But you cannot use Claude Code without paying. So if you want to start incredibly small, using the website at first before you fork over many can make sense. Again, that’s fine, just know that the experience is different. But it may be a good way to start. In 2024 and 2025, there was a good argument that you needed to be on a paid plan because that’s how you got access to the latest models. While this is still true to some degree, models have advanced far enough that the changes are less important over time. I do think that in the first half of 2026, it still does matter to a degree. Basically, the difference between Claude 3, 4, and 4.5 is significant, but for me, Claude 4 was good enough a year ago to get real work done. I’m not 100% sure which one you get for free today, but it’s at least fine. And I think that by the time the next round of models come out, the ones you’ll have access to for free will be basically good enough to make this question moot. But do know that you get what you pay for, and paying for things does get you better performance. (Speaking of models, you’ll hear Claude referred to in three ways: Haiku, Sonnet, and Opus. As the names imply, worst to best there, though also, fastest to slowest. Sonnet, especially the 4.5 version, is pretty good for everything. Opus 4.5 is wonderful. Haiku is great for certain things.) As for actual cost: there’s $20/month, $100/month, and $200/month plans, as well as “you pay per API call.” You might be tempted to think “I’ll just pay per API call and keep my usage down.” This is a reasonable thing to think, and also a terrible mistake to make. You get a lot of bang for your buck with the plans. To give you an idea, I recently hit my weekly limit last night on the $200/month plan, and my estimated usage for that week (which again, I’m paying $50 for) would have been $1440.73 if I were paying by the API call. Now, I am a very heavy user, but the point stands: as someone trying out these tools, it is way easy to spend more than $20 of API tokens. If you want to give these tools a real shot, come up with a spare $20, sign up for the cheap plan, and then cancel after your experiment is over. You get access to Claude Code and you’ve capped your spend. It’s a win/win. There’s some good secondary effects of trying to be frugal here but I think that’s more of an intermediate than an advanced topic, to be honest. I think worrying about the money while you build these skills is a distraction. Cap your spend via a plan so that way you can not stress out about breaking the bank. Okay, with all of that background out of the way: let’s talk about your first steps here. Everyone is interested in the ability of LLMs to generate code. But I think that’s actually step 2, not step 1. The way I want you to start using these tools is purely read-only at first. This is also why the website is okay to get started with too; Claude Code is far better at generating code than the site is, but we’re not going to start by writing code. Find some code you’ve written recently. It can be literally anything. Load up https://claude.ai , and type: Hi Claude! Can we talk about this code? And then paste your code in. You don’t need any fancy prompting techniques. You don’t even need to say what language it is. Just give it some code. It could be ten lines, it could be a hundred. I wouldn’t recommend a thousand to start. Claude will probably respond with some sort of basic analysis of what you’ve done, and then a question. I gave it ~50 lines of code a friend and I were discussing recently, and it gave me this back: Sure! This looks like <description of what it does>. You’ve got <three things that the code does>. What’s on your mind about it? Are you thinking through the design, running into a specific issue, or wanting feedback on a particular aspect? From here, you have a ton of options of which way to go, but they really depend on what you’ve pasted in. Here’s some fun prompt ideas: Do you think this code is idiomatic? If you could improve one thing about this code, what might it be? If I wanted to modify this code to do <something>, how would you go about doing that? Are there any bugs in this code? Are there any security implications of this code I may not have thought about? And so on. Anyway, the goal here is to just get used to this whole thing. It’s a bit weird! It’s very different than talking to a compiler. If Claude says something you disagree with, push back a little, just like you would a co-worker: I’m not sure I agree with that. The reason why is that in some other part of the system, there’s <behavior> and so that would impact this sort of decision. Why did you suggest that? I’d like to understand more. Claude will absolutely not be right all of the time. And that’s okay! The goal is to work together, not that this is a magic tool that suddenly solves all of your problems. Once you’ve done this a few times, you might want to graduate to Claude Code. The reason for this is that you can start to scale up your questions. Once you’ve installed it and logged in, you’ll be at a terminal prompt. It might bug you about creating a CLAUDE.md, don’t worry about that for now. Continue having conversations with Claude about your codebase. The reason that this is a a big step up is that before, you had to paste all of the code in. Now, Claude can go find your code itself. Some prompts for you to try: Please give me a code review of my codebase and suggest five things I could do to improve it. Can you find any bugs in <component>? I’m curious about the performance of <component>, can we talk about it? One thing I like to do here is have Claude double check my intuitions. A few months ago, working on an application in Rust, I was considering a refactoring. I hadn’t done it because I was worried that it would be tedious, take a while, and maybe not improve the codebase. It might! But it might not. But putting in the day or two to do the refactor wasn’t really worth finding out if maybe that would be wasted. So, I asked Claude. This is an example of a bit longer of a prompt: Hi Claude! I am considering refactoring my code. In a function like this: <paste code>, I don’t like how I did things, and I’m considering doing it like this instead: <paste code>. However, I know that changes the signature, which impacts other code in the codebase. A few questions for you: 1. how many function signatures would need to be updated if I made this change? 2. can you show me what the code would look like if I did this refactoring on one of my simpler endpoints? 3. can you show me what the code would look like if I did this refactoring on one of my most complex endpoints? Claude came back and said something like “250 signatures would need to change, here’s the before and after using these two examples from your codebase.” Now, Claude isn’t perfect: maybe it was actually 260 signatures. But the point is, this helped me characterize my intuition here: it would be a non-trivial amount of work. But I also got to see its impact on real code I had written, which helped me decide if this refactoring would actually help me in some of the more hairy parts of the codebase. Note that there’s not really any special “prompt engineering” going on here. You don’t need to do “as a senior software engineer” or stuff like that. Just talk to it like you’d talk to a person. It’s fine. That doesn’t mean that prompts are useless, but this sort of optimization is an intermediate to advanced topic, and frankly, I’m skeptical that at this point the “as an x” technique even helps. More on that someday. The point is, you can start asking more complex questions as you get more comfortable with the tool. Because Claude works asynchronously, you can just fire off questions like these in the background, and come back to them when it’s done. Well, sorta. Let’s talk about permissions before we wrap this up. By default, Claude will put you in an “ask before edits” mode. This is a good way to start. It’ll check in with you before doing certain things, and you can say yes or no. Please consider what it’s about to do, and give the answer you’re comfortable with. Advanced users basically let Claude do whatever it wants, but you’re not there yet, and there’s risks involved that aren’t obvious to you just yet as a new user, so even though it can be a bit annoying to say yes every time it asks, I’d encourage you to start off with minimal permissions. It gives you the option to say “commands like this one are okay for the rest of my session” and so when it wants to or something, that can be nice to agree to, but I’d encourage you to not use it for writing code just yet, and tell it no if it asks. We’ll do that in a follow-up post. So that’s my intro to getting started with Claude. Spend $20, talk to it like you’d talk to a person, and use it as a means of getting feedback on your code, don’t have it write anything just yet. Graduate to larger and larger questions as you get comfortable with what it can do. Gently push back when you think it gets out of line. But your goal here is a baseline understanding of what the tool is capable of, not to vibe code out an entire app in an afternoon. These skills may seem too basic, but I promise you, it gets harder from here, and so you’ll want a solid foundation in read-only questions before we graduate to having Claude write some code. I hope this was helpful to you. Here’s my post about this post on BlueSky: Getting started with Claude for software development: steveklabnik.com/writing/gett... Getting started with Claude for software development Blog post: Getting started with Claude for software development by Steve Klabnik

0 views

Reflecting on 2025, preparing for 2026

As I do every year, it's that time to reflect on the year that's been, and talk about some of my hopes and goals for the next year! I'll be honest, this one is harder to write than last year's. It was an emotionally intense year in a lot of ways. Here's to a good 2026! Where last year I got sick and had time black holes from that, this year I lost time to various planned surgeries. I didn't get nearly as much done, because it was also hard to stay focused with all the attacks on trans rights happening. Without further ado, what'd I get up to? I helped coaching clients land job and improve their lives at work and beyond. I started coaching informally in 2024, and in 2025 I took on some clients formally. During the year, I helped clients improve their skills, build their confidence, and land great new jobs. I also helped clients learn how to balance their work and home life, how to be more productive and focused, and how to navigate a changing industry. This was one of the most rewarding things I did all year. I hope to do more of it this coming year! If you want to explore working together, email me or schedule an intro . I solved interesting problems at work. This reflection is mostly private, because it's so intertwined with work that's confidential. I learned a lot, and also got to see team members blossom into their own leadership roles. It is really fun watching people grow over time. I took on some consulting work. I had some small engagements to consult with clients, and those were really fun. Most of the work was focused on performance-sensitive web apps and networked code, using (naturally) Rust. This is something I'll be expanding this year! I've left my day job and am spinning up my consulting business again. More on that soon, but for now, email me if you want help with software engineering (especially web app performance) or need a principal engineer to step in and provide some engineering leadership. I wrote some good blog posts. This year, my writing output dropped to about 1/3 of what it was last year. Despite the reduction, I wrote some pretty good posts that I'm really happy with! I took a break intentionally to spend some time dealing with everything going on around me, and that helped a lot. I didn't get back to consistent weekly posts, but I intend to in 2026. My hernias were fixed. During previous medical adventures, some hernias were found. I go those fixed [1] ! Recovering from hernia repair isn't fun, but wasn't too bad in the long run. It resolved some pain I'd had for a while, which I hadn't realized was unusual pain. (Story of my life, honestly.) Long-awaited surgery! In addition to the hernia repair, I had another planned surgery done. The recovery was long, and is still ongoing. My medical leave was 12 weeks, and I'm going to continue recovering for about the first year in various forms. This has brought me so much deep relief, I can't even put it in words. Performed a 30-minute set at West Philly Porchfest. I did a solo set in West Philly Porchfest! All the arrangements were done by me, and I performed all the parts live (well, one part used a pre-sequenced arpeggiator). I played my wind synth as my main instrument, layering parts over top of myself with a looper, and I also played the drum parts. You can watch a few of the pieces in a YouTube playlist . Wrote and recorded two pieces of original music. This was one of my goals from 2024, and I'm very proud that I got it done. The first piece of music, Anticipation , came from an exercise a music therapist had me do. I took the little vignette and expanded it into a full piece, but more importantly, the exercise gave me an approach to composition. I'd like to rerecord Anticipation sometime, since I've grown as a musician significantly across the year. My second piece I'm even happier with. It's called Little Joys , and I'm just tickled that I was able to write this. I played it on my alto sax (piped through a pedal board) and programmed the other parts using a sequencer. One of my poems was published! I've written a lot more poetry this year. One of my close friends told me that I should get one of them published to have more people read it. They thought it was a good and important poem. That gave me the confidence to submit some poems, and one of them was accepted! (The one they told me to submit was not yet accepted anywhere, but fingers crossed.) You can read my poem, "my voice", in the December issue of Lavender Review . Every year when I write this, I realize I got a lot done. This year was a lot, filled with way more creative output than previous years. How does it stack up against what I wanted to do last year ? I am really proud of how much I did on my goals. I might be unhappy with my slipping on if it were a "normal" year where the government isn't trying to strip my rights, but you know what? I'll take it. Especially since I prioritized my health and happiness. So, what would I like to get out of this new year, 2026? These aren't my predictions for what will happen, nor are they concrete goals. They're more of a reflection on what I'd like this coming year to be. This is what I'm dreaming 2026 will be like for me. Keep my rights (and maybe regain ground). A perennial goal, I'd like to be able to stay where I am and have access to, I don't know, doctors and bathrooms. We've held a lot of ground this year. Hopefully some of what was lost can be regained. I'm going to keep doing what I can, and that includes living my best life and being positive representation for all others who are under attack. Maintain relationships with friends and family. I want to keep up with my friends and family and continue having regular chats with those I care about. We're a social species, and we rely on each other for support. I'm going to keep being there for the people I care about when they need me, and keep accepting their help as well when I need them. Spin up my business. I'm going out on my own, and I'm going to be offering my software engineering services again. By the end of the year, this will hopefully be thrumming along to support me and my family. Publish weekly blog posts (sustainably). I'm back in the saddle! This is the first post of 2026, and they're going to hopefully keep coming regularly. To make it sustainable, I'm going to explore if Patreon is a viable option to offset some of the time it takes to make the blog worth reading. Record a short album. I have a track in progress, and I have four more track ideas planned. I accidentally started writing an EP, I think??? This year I would love to actually finish that and release it. Publish more poetry. Writing poetry this year was very meaningful, and it's deeply important to me. I want to get more of it published so that I can share it with people who will also be able to get deep importance from it. That's it! Wow, the year was a lot. I've put a lot of myself in this post. If you've read this far, thank you so much for reading. If you've not read this far, then how're you reading this sentence anyway? 2025 had a lot in it. There were some very good things I am very grateful for. There were some very scary and bad things that I wish had never happened. All told, it's been a long few years jammed into one calendar year. I hope that 2026 will be a little calmer, with less of the bad. Maybe it can feel like just one year. Regardless, I'm going to hold as much joy in the world this year as I can. Please join me in that. Let's fill 2026 with as much joy as we can, and make the world shine in spite of everything. The surgeon really meshed me up! ↩ ❓ Once again, I wanted to keep my rights. It's a perennial goal, and I did keep my rights in the state/community I live in. I'm awarding this one a question mark since my rights were under assault, and there are now many more places I cannot safely travel to. That means it's not a full miss, but not a win either. ✅ No personal-time side projects went into production! Yet another year that I toyed with the idea and again talked myself out of it. I'm taking it off the list for 2026, since the urge wasn't really even there this time. ✅ Maintained relationships with friends and family. I've had regular, scheduled calls with some people close to me. I've visited people, supported them when they needed me, and asked for support when I needed it. ❓ I did a little consulting and coaching, but didn't explore many ways to make this (playful exploration like I do on here) my living. I'm giving this the question mark of dubiousity, since I don't think I got much information from the year toward the questions I wanted to answer. ✅ Kept my mental health strong! There were certainly some challenges. What I'm proud of most is that I recognized those challenges and made space for myself. That's why I stopped blogging regularly: I needed the space to get through things with intact mental health. ❓ Did some ridiculous fun projects with code, but not as much as I wanted. The main project was making it so I can type using my keyboard (you know, like a piano, not the thing with letters on it). I had aspired to do more, and I'm glad I let myself relax on this. ✅ Wrote some original music! ✅ Also recorded that original music! It's on my bandcamp page . The surgeon really meshed me up! ↩

0 views
Farid Zakaria 1 months ago

Bespoke software is the future

At Google, some of the engineers would joke, self-deprecatingly , that the software internally was not particularly exceptional but rather Google’s dominance was an example of the power of network effects: when software is custom tailored to work well with each other. This is often cited externally to Google, or similar FAANG companies, as indulgent “NIH” (Not Invented Here) syndrome; where the prevailing practice is to pick generalized software solutions, preferably open-source, off-the shelf. The problem with these generalized solutions is that, well, they are generalized and rarely fit well together. 🙄 Engineers are trained to be DRY (Don’t Repeat Yourself), and love abstractions. As a tool tries to solve more problems, the abstraction becomes leakier and ill-fitting. It becomes a general-purpose tax. If you only need 10% of a software solution, you pay for the remaining 90% via the abstractions they impose. 🫠 Internally to a company, however, we are taught that unused code is a liability. We often celebrate negative pull-requests as valuable clean-up work with the understanding that smaller code-bases are simpler to understand, operate and optimize. Yet for our most of our infrastructure tooling, we continue to bloat solutions and tout support despite miniscule user bases. This is probably one of the areas I am most excited about with the ability to leverage LLM for software creation. I recently spent time investigating linkers in previous posts such as LLVM’s lld . I found LLVM to be a pretty polished codebase with lots of documentation. Despite the high-quality, navigating the codebase is challenging as it’s a mass of interfaces and abstractions in order to support: multiple object file formats, 13+ ISAs, a slough of features (i.e. linker scripts ) and multiple operating systems. Instead, I leveraged LLMs to help me design and write µld , a tiny opinionated linker in Rust that only targets ELF, x86_64, static linking and barebone feature-set. It shouldn’t be a surprise to anyone that the end result is a codebase that I can audit, educate myself and can easily grow to support additional improvements and optimizations. The surprising bit, especially to me, was how easy it was to author and write within a very short period of time (1-2 days). That means smaller companies, without the coffer of similar FAANG companies, can also pursue bespoke custom tailored software for their needs. This future is well-suited for tooling such as Nix . Nix is the perfect vehicle to help build custom tooling as you have a playground that is designed to build the world similar to a monorepo. We need to begin to cut away legacy in our tooling and build software that solves specific problems. The end-result will be smaller, easier to manage and better integrated. Where this might have seemed unattainable for most, LLMs will democratize this possibility. I’m excited for the bespoke future.

0 views
Brain Baking 1 months ago

2025 In Video Games

It’s that time of the year—the time to publish the yearly notes summarizing playtime statistics and providing a personal opinion on recent and vintage Game Of The Year (GOTY) contestants. In 2023 , Pizza Tower and Tactics Ogre: Reborn were examples of superb recent games that even made it to the Top 100 List , while DUSK and Plants vs. Zombies scored high in the vintage list (both also on the Top 100). In 2024 , Skald and the Paper Mario remake were the great ones, but the most memorable experience was no doubt playing Ultima Underworld for the first time together for the DOS Game Club. For 2025, the amount of games recorded on my retro gaming site remains the same as the previous year—27—but this year I also started occasionally reviewing board games that I replay at least ten times. Here’s this year’s collage of the games I (re)played this year in chronological order: A collage of the 2025 GOTY contestants. I have yet to write a review for Shotgun King so let’s keep that one out. It’s a small indie roguelike that’s fun but doesn’t have a lot to offer. Also, since this post is called 2025 in Video Games , let’s ignore the board games for now and keep that for a future post where I summarise my Board Game Geek statistics. Some more useless stats, based on user input from How Long To Beat (HLTB): Last year, about 50% of my gaming time took place on the Switch. That’s dropped to 40%. Or has it? Remove the six board games and you’ve got 52% so nope, I’m still primarily a Nintendo (handheld) gamer. I have a bunch of cartridges waiting to be played and I believe even a few cases still in shrink wrap (yeah I know), so for the coming year, that’s not likely to change either. I don’t need a Switch 2 just yet. For more details on those divisions by platform, I again reused last year’s script to generate a graph summarizing the platforms and calculates an average score (rated on 5, see about the rating system ) for each platform: A bar chart of (average) scores per platform. Most mediocre plays game from platforms where I was hunting down card games for my feature write-up on card games back in September. Filtering all games that are scored as either great (4/5) or amazing (5/5), we end with the following lists, where I further cherry-picked the best of the best: The Recent GOTY list: Couch “recent” cough . Yeah, again—I know. What can I say, I’m a retro gamer, and the “new games” I play are usually repurposed old ones, go figure. This seems to be especially apparent this year. Those Nightdive Studios boomer shooter remakes are beyond awesome, you’ve got to try them! The Vintage GOTY list: I found 2024 to be a meagre year for me when it comes to “the great ones”—because I don’t play many of those within the year of release. I have the same feeling for this year, looking back at the play log. There are many great games I highly enjoyed such as Wonder Boy with the awesome art and music and ability to switch back and forth between retro and remastered version, or Hoyle Card Games , the PC classic that’s hard to beat when it comes to trumping the trump. I love Celeste and Castlevania Dominus Collection but those were replays of games I know by heart, so I’m ruling them out. We’ve got to draw the line somewhere. And then there’s Inscryption . What a game. No, what an experience that was! I was on the edge of my seat almost every single in-game minute. I played it in January (read my thoughts but beware of the spoilers) and didn’t encounter a game that challenged my expectations that much ever since. There’s no need for a debate or a voting round: Inscryption is my “Game of the Other Year”. It’s in the Top100 . As for the GOTY of 2025-ish; that’s got to be one of the Nightdive remakes. Both Blood: Refreshed Supply and the Outlaws remaster have been released recently and I haven’t yet had the change to touch either of them. If I had, I think Blood might have been the winner as that’s the only Build Engine game I never truly played back in the nineties. Screw it. DOOM + DOOM II is my GOTY. Just the music alone: And that’s from the new Legacy of Rust expansion. I’ll leave the discovery of Andrew Hulshult’s DOOM riffs up to you. Obviously, DOOM + DOOM II (2024) kicked out and replaced DOOM (1993) in the Top100. Cheers to 2026. My hopes are high for opening that shrink wrap. Related topics: / games / goty / lists / yearnote / By Wouter Groeneveld on 30 December 2025.  Reply via email . total #games: 27 total hours: 175.8 average hours: 6.51 average a day: 0.5 longest game: 28.0 hours; ‘Castlevania Dominus Collection’ shortest game: 0.0 hours; Hoyle Card Games 2002 Divison by platform: Platform: pc (5/27) Platform: ds (3/27) Platform: boardgames (6/27) Platform: gameboycolor (1/27) Platform: switch (11/27) Platform: snes (1/27) 💖 Guncho (pc; 2024) 💖 Shogun Showdown (switch; 2023) 💖 Rise Of The Triad: Ludicrous Edition (switch; 2023) 💖 Prince of Persia: The Lost Crown (switch; 2024) 💖 DOOM + DOOM II (pc; 2024) 💖 Castlevania Dominus Collection (switch; 2024) 💖 Hoyle Card Games 2002 (pc; 2002) 💖 Wonder Boy: The Dragon’s Trap (switch; 2017) 💖 Tangle Tower (switch; 2019) 💖 Celeste (switch; 2018) 💖 Inscryption (switch; 2021)

0 views
Dangling Pointers 1 months ago

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices

CHERIoT RTOS: An OS for Fine-Grained Memory-Safe Compartments on Low-Cost Embedded Devices Saar Amar, Tony Chen, David Chisnall, Nathaniel Wesley Filardo, Ben Laurie, Hugo Lefeuvre, Kunyan Liu, Simon W. Moore, Robert Norton-Wright, Margo Seltzer, Yucong Tao, Robert N. M. Watson, and Hongyan Xia SOSP'25 This paper is a companion to a previous paper which described the CHERIoT hardware architecture. This work presents an OS that doesn’t look like the systems you are used to. The primary goal is memory safety (and security more broadly). Why rewrite your embedded code in Rust when you can switch to a fancy new chip and OS instead? Recall that a CHERI capability is a pointer augmented with metadata (bounds, access permissions). CHERI allows a more restrictive capability to be derived from a less restrictive one (e.g., reduce the bounds or remove access permissions), but not the other way around. CHERIoT RTOS doesn’t have the notion of a process, instead it has a compartment. A compartment comprises code and compartment-global data. Compartment boundaries are trust boundaries. I think of it like a microkernel operating system. Example compartments in CHERIoT include: Boot loader Context switcher Heap allocator Thread scheduler The boot loader is fully trusted and is the first code to run. The hardware provides the boot loader with the ultimate capability. The boot loader then derives more restrictive capabilities, which it passes to other compartments. You could imagine a driver compartment which is responsible for managing a particular I/O device. The boot loader would provide that compartment with a capability that enables the compartment to access the MMIO registers associated with the device. There is no user space/kernel space distinction here, only a set of compartments, each with a unique set of capabilities. Fig. 3 illustrates a compartment: Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state I wonder how often a micro-reboot of one compartment results in an error code which causes other compartments to micro-reboot. If a call into a compartment which is in the middle of a micro-reboot can fail, then I could see that triggering a cascade of micro-reboots. The ideas here remind me of Midori , which relied on managed languages rather than hardware support. I wonder which component is better to trust, an SoC or a compiler? Subscribe now Boot loader Context switcher Heap allocator Thread scheduler Source: https://dl.acm.org/doi/10.1145/3731569.3764844 Sealed Capabilities The CHERIoT hardware architecture supports sealing of capabilities. Sealing a capability is similar to deriving a more restrictive one, only this time the derived capability is useless until it is unsealed by a compartment which holds a capability with unsealing permissions. I think of this like a client encrypting some data before storing it on a server. The data is useless to everyone except for the client who can decrypt it. Cross-compartment function calls are similar to system calls and are implemented with sealed capabilities. Say compartment needs to be able to call a function exported by compartment . At boot, the boot loader derives a “function call” capability which is a pointer into the export table associated with , seals that capability, and passes it to compartment at initialization. The boot loader also gives the switcher a capability which allows it to unseal the function call capability. When A wants to call the function exported by , it passes the sealed capability to the switcher. The switcher then unseals the capability and uses it to read metadata about the exported function from ’s export table. The switcher uses this metadata to safely perform the function call. Capability sealing also simplifies inter-compartment state management. Say compartment calls into compartment (for networking) to create a TCP connection. The networking compartment can allocate a complicated tree of objects and then return a sealed capability which points to that tree. Compartment can hold on to that capability and pass it as a parameter for future networking function calls (which will unseal and then use). Compartment doesn’t need to track per-connection objects in its global state. Heap Allocator The heap compartment handles memory allocation for all compartments. There is just one address space shared by all compartments, but capabilities make the whole thing safe. As described in the previous summary, when an allocation is freed, the heap allocator sets associated revocation bits to zero. This prevents use-after-free bugs (in conjunction with the CHERIoT hardware load filter). Similar to garbage collection, freed memory is quarantined (not reused) until a memory sweep completes which ensures that no outstanding valid capabilities are referencing the memory to be reused. The allocator supports allocation capabilities which can enforce per-compartment quotas. Threads If you’ve had enough novelty, you can rest your eyes for a moment. The CHERIoT RTOS supports threads, and they mostly behave like you would expect. The only restriction is that threads are statically declared in code. Threads begin execution in the compartment that declares them, but then threads can execute code in other compartments via cross-compartment calls. Micro-reboots Each compartment is responsible for managing its own state with proper error handling. If all else fails, the OS supports micro-reboots, where a single compartment can be reset to a fresh state. The cross-compartment call mechanism supported by the switcher enables the necessary bookkeeping for micro-reboots. The steps to reboot a single compartment are: Stop new threads from calling into the compartment (these calls fail with an error code) Fault all threads which are currently executing in the compartment (this will also result in error codes being returned to other compartments) Release all resources (e.g., heap data) which have been allocated by the compartment Reset all global variables to their initial state

0 views
Tenderlove Making 1 months ago

Can Bundler Be as Fast as uv?

At RailsWorld earlier this year, I got nerd sniped by someone. They asked “why can’t Bundler be as fast as uv?” Immediately my inner voice said “YA, WHY CAN’T IT BE AS FAST AS UV????” My inner voice likes to shout at me, especially when someone asks a question so obvious I should have thought of it myself. Since then I’ve been thinking about and investigating this problem, going so far as to give a presentation at XO Ruby Portland about Bundler performance . I firmly believe the answer is “Bundler can be as fast as uv” (where “as fast” has a margin of error lol). Fortunately, Andrew Nesbitt recently wrote a post called “How uv got so fast” , and I thought I would take this opportunity to review some of the highlights of the post and how techniques applied in uv can (or can’t) be applied to Bundler / RubyGems. I’d also like to discuss some of the existing bottlenecks in Bundler and what we can do to fix them. If you haven’t read Andrew’s post, I highly recommend giving it a read . I’m going to quote some parts of the post and try to reframe them with RubyGems / Bundler in mind. Andrew opens the post talking about rewriting in Rust: uv installs packages faster than pip by an order of magnitude. The usual explanation is “it’s written in Rust.” That’s true, but it doesn’t explain much. Plenty of tools are written in Rust without being notably fast. The interesting question is what design decisions made the difference. This is such a good quote. I’m going to address “rewrite in Rust” a bit later in the post. But suffice to say, I think if we eliminate bottlenecks in Bundler such that the only viable option for performance improvements is to “rewrite in Rust”, then I’ll call it a success. I think rewrites give developers the freedom to “think outside the box”, and try techniques they might not have tried. In the case of , I think it gave the developers a good way to say “if we don’t have to worry about backwards compatibility, what could we achieve?”. I suspect it would be possible to write a uv in Python (PyUv?) that approaches the speeds of uv, and in fact much of the blog post goes on to talk about performance improvements that aren’t related to Rust. pip’s slowness isn’t a failure of implementation. For years, Python packaging required executing code to find out what a package needed. I didn’t know this about Python packages, and it doesn’t really apply to Ruby Gems so I’m mostly going to skip this section. Ruby Gems are tar files, and one of the files in the tar file is a YAML representation of the GemSpec. This YAML file declares all dependencies for the Gem, so RubyGems can know, without evaling anything, what dependencies it needs to install before it can install any particular Gem. Additionally, RubyGems.org provides an API for asking about dependency information, which is actually the normal way of getting dependency info (again, no required). There’s only one other thing from this section I’d like to quote: PEP 658 (2022) put package metadata directly in the Simple Repository API, so resolvers could fetch dependency information without downloading wheels at all. Fortunately RubyGems.org already provides the same information about gems. Reading through the number of PEPs required as well as the amount of time it took to get the standards in place was very eye opening for me. I can’t help but applaud folks in the Python community for doing this. It seems like a mountain of work, and they should really be proud of themselves. I’m mostly going to skip this section except for one point: Ignoring requires-python upper bounds. When a package says it requires python<4.0, uv ignores the upper bound and only checks the lower. This reduces resolver backtracking dramatically since upper bounds are almost always wrong. Packages declare python<4.0 because they haven’t tested on Python 4, not because they’ll actually break. The constraint is defensive, not predictive. I think this is very very interesting. I don’t know how much time Bundler spends on doing “required Ruby version” bounds checking, but it feels like if uv can do it, so can we. I really love that Andrew pointed out optimizations that could be made that don’t involve Rust. There are three points in this section that I want to pull out: Parallel downloads. pip downloads packages one at a time. uv downloads many at once. Any language can do this. This is absolutely true, and is a place where Bundler could improve. Bundler currently has a problem when it comes to parallel downloads, and needs a small architectural change as a fix. The first problem is that Bundler tightly couples installing a gem with downloading the gem. You can read the installation code here , but I’ll summarize the method in question below: The problem with this method is that it inextricably links downloading the gem with installing it. This is a problem because we could be downloading gems while installing other gems, but we’re forced to wait because the installation method couples the two operations. Downloading gems can trivially be done in parallel since the files are just archives that can be fetched independently. The second problem is the queuing system in the installation code. After gem resolution is complete, and Bundler knows what gems need to be installed, it queues them up for installation. You can find the queueing code here . The code takes some effort to understand. Basically it allows gems to be installed in parallel, but only gems that have already had their dependencies installed. So for example, if you have a dependency tree like “gem depends on gem which depends on gem ” ( ), then no gems will be installed (or downloaded) in parallel. To demonstrate this problem in an easy-to-understand way, I built a slow Gem server . It generates a dependency tree of ( depends on , depends on ), then starts a Gem server. The Gem server takes 3 seconds to return any Gem, so if we point Bundler at this Gem server and then profile Bundler, we can see the impact of the queueing system and download scheme. In my test app, I have the following Gemfile: If we profile Bundle install with Vernier, we can see the following swim lanes in the marker chart: The above chart is showing that we get no parallelism during installation. We spend 3 seconds downloading the gem, then we install it. Then we spend 3 seconds downloading the gem, then we install it. Finally we spend 3 seconds downloading the gem, and we install it. Timing the process shows we take over 9 seconds to install (3 seconds per gem): Contrast this with a Gemfile containing , , and , which have no dependencies, but still take 3 seconds to download: Timing for the above Gemfile shows it takes about 4 seconds: We were able to install the same number of gems in a fraction of the time. This is because Bundler is able to download siblings in the dependency tree in parallel, but unable to handle other relationships. There is actually a good reason that Bundler insists dependencies are installed before the gems themselves: native extensions. When installing native extensions, the installation process must run Ruby code (the file). Since the could require dependencies be installed in order to run, we must install dependencies first. For example depends on , but is only used during the installation process, so it needs to be installed before can be compiled and installed. However, if we were to decouple downloading from installation it would be possible for us to maintain the “dependencies are installed first” business requirement but speed up installation. In the case, we could have been downloading gems and at the same time as gem (or even while waiting on to be installed). Additionally, pure Ruby gems don’t need to execute any code on installation. If we knew that we were installing a pure Ruby gem, it would be possible to relax the “dependencies are installed first” business requirement and get even more performance increases. The above case could install all three gems in parallel since none of them execute Ruby code during installation. I would propose we split installation in to 4 discrete steps: Downloading and unpacking can be done trivially in parallel. We should unpack the gem to a temporary folder so that if the process crashes or the machine loses power, the user isn’t stuck with a half-installed gem. After we unpack the gem, we can discover whether the gem is a native extension or not. If it’s not a native extension, we “install” the gem simply by moving the temporary folder to the “correct” location. This step could even be a “hard link” step as discussed in the next point. If we discover that the gem is a native extension, then we can “pause” installation of that gem until its dependencies are installed, then resume (by compiling) at an appropriate time. Side note: , a Bundler alternative , works mostly in this manner today. Here is a timing of the case from above: Lets move on to the next point: Global cache with hardlinks. pip copies packages into each virtual environment. uv keeps one copy globally and uses hardlinks I think this is a great idea, but I’d actually like to split the idea in two. First, RubyGems and Bundler should have a combined, global cache, full stop. I think that global cache should be in , and we should store files there when they are downloaded. Currently, both Bundler and RubyGems will use a Ruby version specific cache folder. In other words, if you do on two different versions of Ruby, you get two copies of Rails and all its dependencies. Interestingly, there is an open ticket to implement this , it just needs to be done. The second point is hardlinking on installation. The idea here is that rather than unpacking the gem multiple times, once per Ruby version, we simply unpack once and then hard link per Ruby version. I like this idea, but I think it should be implemented after some technical debt is paid: namely implementing a global cache and unifying Bundler / RubyGems code paths. On to the next point: PubGrub resolver Actually Bundler already uses a Ruby implementation of the PubGrub resolver. You can see it here . Unfortunately, RubyGems still uses the molinillo resolver . In other words you use a different resolver depending on whether you do or . I don’t really think this is a big deal since the vast majority of users will be doing most of time. However, I do think this discrepancy is some technical debt that should be addressed, and I think this should be addressed via unification of RubyGems and Bundler codebases (today they both live in the same repository, but the code isn’t necessarily combined). Lets move on to the next section of Andrew’s post: Andrew first mentions “Zero-copy deserialization”. This is of course an important technique, but I’m not 100% sure where we would utilize it in RubyGems / Bundler. I think that today we parse the YAML spec on installation, and that could be a target. But I also think we could install most gems without looking at the YAML gemspec at all. Thread-level parallelism. Python’s GIL forces parallel work into separate processes, with IPC overhead and data copying. This is an interesting point. I’m not sure what work pip needed to do in separate processes. Installing a pure Ruby, Ruby Gem is mostly an IO bound task, with some ZLIB mixed in. Both of these things (IO and ZLIB processing) release Ruby’s GVL, so it’s possible for us to do things truly in parallel. I imagine this is similar for Python / pip, but I really have no idea. Given the stated challenges with Python’s GIL, you might wonder whether Ruby’s GVL presents similar parallelism problems for Bundler. I don’t think so, and in fact I think Ruby’s GVL gets kind of a bad rap. It prevents us from running CPU bound Ruby code in parallel. Ractors address this, and Bundler could possibly leverage them in the future, but since installing Gems is mostly an IO bound task I’m not sure what the advantage would be (possibly the version solver, but I’m not sure what can be parallelized in there). The GVL does allow us to run IO bound work in parallel with CPU bound Ruby code. CPU bound native extensions are allowed to release the GVL , allowing Ruby code to run in parallel with the native extension’s CPU bound code. In other words, Ruby’s GVL allows us to safely run work in parallel. That said, the GVL can work against us because releasing and acquiring the GVL takes time . If you have a system call that is very fast, releasing and acquiring the GVL could end up being a large percentage of that call. For example, if you do , and the buffer is very small, you could encounter a situation where GVL book keeping is the majority of the time. A bummer is that Ruby Gem packages usually contain lots of very small files, so this problem could be impacting us. The good news is that this problem can be solved in Ruby itself, and indeed some work is being done on it today . No interpreter startup. Every time pip spawns a subprocess, it pays Python’s startup cost. Obviously Ruby has this same problem. That said, we only start Ruby subprocesses when installing native extensions. I think native extensions make up the minority of gems installed, and even when installing a native extension, it isn’t Ruby startup that is the bottleneck. Usually the bottleneck is compilation / linking time (as we’ll see in the next post). Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. This is a cool optimization, but I don’t think it’s actually Rust specific. Comparing integers is much faster than comparing version objects. The idea is that you take a version number, say , and then pack each part of the version in to a single integer. For example, we could represent as and as , etc. It should be possible to use this trick in Ruby and encode versions to integer immediates, which would unlock performance in the resolver. Rust has an advantage here - compiled native code comparing u64s will always be faster than Ruby, even with immediates. However, I would bet that with the YJIT or ZJIT in play, this gap could be closed enough that no end user would notice the difference between a Rust or Ruby implementation of Bundler. I started refactoring the object so that we might start doing this, but we ended up reverting it because of backwards compatibility (I am jealous of in that regard). I think the right way to do this is to refactor the solver entry point and ensure all version requirements are encoded as integer immediates before entering the solver. We could keep the API as “user facing” and design a more internal API that the solver uses. I am very interested in reading the version encoding scheme in uv. My intuition is that minor numbers tend to get larger than major numbers, so would minor numbers have more dedicated bits? Would it even matter with 64 bits? I’m going to quote Andrew’s last 2 paragraphs: uv is fast because of what it doesn’t do, not because of what language it’s written in. The standards work of PEP 518, 517, 621, and 658 made fast package management possible. Dropping eggs, pip.conf, and permissive parsing made it achievable. Rust makes it a bit faster still. pip could implement parallel downloads, global caching, and metadata-only resolution tomorrow. It doesn’t, largely because backwards compatibility with fifteen years of edge cases takes precedence. But it means pip will always be slower than a tool that starts fresh with modern assumptions. I think these are very good points. The difference is that in RubyGems and Bundler, we already have the infrastructure in place for writing a “fast as uv” package manager. The difficult part is dealing with backwards compatibility, and navigating two legacy codebases. I think this is the real advantage the uv developers had. That said, I am very optimistic that we could “repair the plane mid-flight” so to speak, and have the best of both worlds: backwards compatibility and speed. I mentioned at the top of the post I would address “rewrite it in Rust”, and I think Andrew’s own quote mostly does that for me. I think we could have 99% of the performance improvements while still maintaining a Ruby codebase. Of course if we rewrote it in Rust, you could squeeze an extra 1% out, but would it be worthwhile? I don’t think so. I have a lot more to say about this topic, and I feel like this post is getting kind of long, so I’m going to end it here. Please look out for part 2, which I’m tentatively calling “What makes Bundler / RubyGems slow?” This post was very “can we make RubyGems / Bundler do what uv does?” (the answer is “yes”). In part 2 I want to get more hands-on by discussing how to profile Bundler and RubyGems, what specifically makes them slow in the real world, and what we can do about it. I want to end this post by saying “thank you” to Andrew for writing such a great post about how uv got so fast . Download the gem Unpack the gem Compile the gem Install the gem

0 views