Posts in Python (20 found)
Xe Iaso Yesterday

First look at the DGX Spark

Disclaimer I'm considering this post as a sponsored post. I was not paid by NVIDIA to work on this, but I did receive a DGX Spark from them pre-release and have been dilligently testing it and filing bugs. I've had access to the NVIDIA DGX Spark for over a month now. Today I'm gonna cover my first impressions and let you know what I've been up to with it. In a nutshell, this thing is a beast. It's one of the most powerful devices in my house and in a pinch I'd be okay with using it as my primary workstation. It's got a mix of a CPU that's got enough punch to do software development with a GPU that's in that sweet spot between consumer and datacenter tier. Not to mention 128Gi of ram. When I've been using this thing, the main limit is my imagination…and my poor understanding of Python environment management. I think that it's best to understand the DGX Spark as a devkit for their NVIDIA Grace Datacentre processors . It's incredibly powerful for what it is, it's a device that can fit on your desk and run AI models right there. A DGX Spark on top of a desk with typical computer things around it such as a laptop, a coffee mug, a keyboard, and a mouse. The DGX Spark is tiny. It's about as wide as the screen of a Steam Deck OLED, or about halfway between the size of a Mac mini M1 and a Mac mini M4. This thing is also stupidly power efficient. I've been unable to cause my office to get warm in a way that is attributed to the DGX Spark alone. On average rendering Final Fantasy 14 in one of the major player hub areas ends up making my tower use more power than the DGX Spark does while doing AI finetuning. I'll talk more about this in the future. One of the most interesting things about this device is that it's got an Arm chip, CUDA, and unified RAM. This combination means that in practice you need to compile all of the Python packages you're using from source. Pip usually handles this well enough, but it does mean that for many Python packages it will take longer to install from source than it will on an X86 system. I assume this will be ironed out as the ecosystem matures. The power efficiency can't be overstated though. I've tried to make my office warm using the DGX Spark and I have failed. I'm seeing it pull a maximum of 70 watts. I get about 30-40 tokens per second with gpt-oss:120b : The user asks: "Summarize this post: --- ...". So we need to provide a summary. Follow developer instructions: be polite, bubbly, lowercase when possible unless serious. No JSON. Summarize the content. The post is long, covering themes: loss of craft, historical analogy with weavers and Luddites, decline of senior titles, AI replacing expertise, vibe coding, security concerns, AI's impact on art, etc. Provide concise summary. Use informal tone, lowercase. End without period unless serious. Probably okay. this post is a sprawling meditation on how we keep “rolling the ladder up behind us” – building a future that erodes the very expertise that made it possible. it starts with a historic look at cloth‑making: skilled weavers were displaced by power looms, turning a durable, treasured craft into cheap, disposable fast‑fashion. the author draws a parallel to today’s software world, warning that we’re replacing senior developers and deep craft with hype‑driven AI tools and “vibe coding” platforms that promise instant results but degrade long‑term quality. key points: the piece ends with a bleak outlook: without deliberate effort to preserve and pass on true craft, future generations will inherit ladders that only go up for the privileged few, leaving the rest scrambling. it’s a call to value high‑quality, human‑made work over quick, AI‑generated shortcuts. With flash attention on gpt-oss:120b at 128k context window, it uses about 70Gi of ram: I assume that the unaccounted 4Gi or so of ram is in the CPU ram overhead of the Ollama model runner process. So far I've been using the Spark in place of cloud GPUs for every AI thing I've needed to do at work. In general, I haven't really noticed any differences between the GPU in the cloud and the Spark on my home network. The only real rough edge is that I need to use this one blessed NVIDIA authored docker image to run iPython notebooks. It's easy enough though. Usually my Docker command looks like: And then it Just Works™. The main thing I've been doing with it is inference of GPT-OSS 120b via Ollama . I've been doing latency and power usage testing by setting up a Discord bot and telling people that the goal is to jailbreak the bot into telling you how to make a chocolate cake. Nobody has been able to make my room warm. This whole experience has been a bit of a career bucket list item for me. I've never had access to prerelease hardware like this before and being able to see what reviewers have to deal with before things are available to the masses is enlightening. I've ended up filing GPU driver bugs using my tower as a "known good" reference. I've been slowly sinking my teeth into learning how AI training actually works using this device to do it. I've mostly been focusing on finetuning GPT-2 and using that to learn the important parts of dataset cleaning, tokenization, and more. Let me know if you want to hear more about that and if you want me to release my practice models. At the very least though, here's the things I have in the pipeline that this device enables: I also plan to make a comprehensive review video. Details to be announced soon. I hope this was interesting. Thanks for early access to the device NVIDIA! craft is disappearing – both in weaving and coding, the knowledge of masters is vanishing, leaving only fragmented R&D notes or AI‑generated shortcuts. senior titles are at risk – companies favor hiring senior talent without nurturing the next generation, so the pool of true “seniors” will run out. AI as a double‑edged sword – generative tools can reduce drudgery but are being marketed as the next industrial revolution while actually shifting value to owners and creating insecure, low‑quality products. vibe coding & AI assistants – slick UX masks the fact that many tools are subscription traps, security hazards, and can erode programmers’ skills. artistic impact – similar to how AI floods art spaces with cheap, low‑effort outputs, software development risks becoming a flood of “good enough” code. security concerns – model‑context‑protocol servers can expose secrets and run unchecked code, highlighting the need for sandboxed, capability‑based designs. broader societal worry – the author (also the CEO of a small AI‑security startup) sees a winner‑take‑all capitalism fueled by AI, with the cost falling on workers, artists, and even the environment. Finetuning at home: how to make your own AI models do what you want Some rough outlines and/or overviews for how I want to use classical machine learning models to enhance Anubis and do outlier detection If I can somehow get Final Fantasy 14 running on it, some benchmarking in comparison to my gaming tower (if you know how to get amd64 games running well on aarch64, DM me!)

0 views

JIT: so you want to be faster than an interpreter on modern CPUs…

Since my previous blog entry about JIT compiler for PostgreSQL, sadly not much happened due to a lack of time, but still some things were done (biggest improvement was the port to ARM64, a few optimizations, implementing more opcodes…). But I am often asking myself how to really beat the interpreter… And on “modern” CPUs, with a well written interpreter, that’s far more complicated than many would imagine. So in order to explain all this and show how I am planning to improve performance (possibly of the interpreter itself too, thus making this endeavor self-defeating), let’s first talk about… If you already know about all the topics mentioned in this title, feel free to jump to the next section . Note that the following section is over-simplified to make the concepts more accessible. I am writing this blog post on a Zen 2+ CPU. If I upgraded to a Zen 3 CPU, same motherboard, same memory, I would get an advertised 25% performance jump in single thread benchmarks while the CPU frequency would be only 2% higher. Why such a discrepancy? Since the 90s and the Pentium-class CPUs, x86 has followed RISC CPUs in the super-scalar era. Instead of running one instruction per cycle, when conditions are right, several instructions can be executed at the same time. Let’s consider the following pseudo-code: X and Y can be calculated at the same time. The CPU can execute these on two integer units, fetch the results and store them. The only issue is the computation of Z: everything must be done before this step, making it impossible for the CPU to go further without waiting for the previous results. But now, what if the code was written as follow: Every step would require waiting for the previous one, slowing down the CPU terribly. Hence the most important technique used to implement superscalar CPUs: out-of-order execution. The CPU will fetch the instructions, dispatch them in several instruction queues, and resolve dependencies to compute Y before computing Z1 in order to have it ready sooner. The CPU is spending less time idling, thus the whole thing is faster. But, alas, what would happen with the following function? Should the CPU wait for X and Y before deciding which Z to compute? Here is the biggest trick: it will try its luck and compute something anyway. This way, if its bet was right, a lot of time will be saved, otherwise the mistake result will be dropped and the proper computation will be done instead. This is called branch prediction, it has been the source of many fun security issues (hello meltdown), but the performance benefits are so huge that one would never consider disabling this. Most interpreters will operate on an intermediate representation, using opcodes instead of directly executing from an AST or similar. So you could use the following main loop for an interpreter. This is how many, many interpreters were written. But this has a terrible drawback at least when compiled that way: it has branches all over the place from a single starting point (most if not all optimizing compilers will generate a jump table to optimize the dispatch, but this will still jump from the same point). The CPU will have a hard time predicting the right jump, and is thus losing a lot of performance. If this was the only way an interpreter could be written, generating a function by stitching the code together would save a lot of time, likely giving a more than 10% performance improvement. If one look at Python, removing this switch made the interpreter 15 to 20% faster. Many project, including PostgreSQL, use this same technique, called “computed gotos”. After a first pass to fill in “label” targets in each step, the execution would be When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter. Now that we have a very basic understanding of modern CPUs and the insane level of optimization they reach, let’s talk about fighting the PostgreSQL interpreter on performance. I will not discuss optimizing the tuple deforming part (aka. going from on-disc structure to the “C” structure used by the code), this will be a topic for a future blog post when I implement it in my compiler. As you may know, PostgreSQL has a very complete type system with operators overloading. Even this simple query ends up being a call to int4eq, a strict function that will perform the comparison. Since it is a strict function, PostgreSQL must check that the arguments are not null, otherwise the function is not called and the result will be null. If you execute a very basic query like the one in the title, PostgreSQL will have the following opcodes: The EEOP_FUNCEXPR_STRICT_2 will perform the null check, and then call the function. If we unroll all the opcodes in real C code, we end up with the following: We can already spot one optimization: why do we check the two arguments, including our constant, against null? It will never change for the entire run of this query and thus each comparison is going to use an ALU, and branch depending on that comparison. But of course the CPU will notice the corresponding branch pattern, and will thus be able to remain active and feed its other units. What is the real cost of such a pointless comparison? For this purpose, I’ve broken a PostgreSQL instance and replaced all FUNCEXPR_STRICT with a check on one argument only, and one with no STRICT check (do not try this at home!). Doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 100 million rows table, with no index, here are the two perf results: So, even if this is not the optimization of the century, it’s not that expensive to make, so… why not do it? (Patch coming to pgsql-hackers soon) But a better optimization is to go all-in on inlining. Indeed, instead of jumping through a pointer to the int4eq code (again, something that the CPU will optimize a lot), one could have a special opcode for this quite common operation. With this change alone (but keeping the two null checks, so there are still optimizations possible), we end up with the following perf results. Let’s sum up these results. The biggest change comes, quite obviously, from inlining the int4eq call. Why is it that much better? Because it reduces by quite a lot the number of instructions to run, and it removes a call to an address stored in memory. And this is again an optimization I could do on my JIT compiler that can also be done on the interpreter with the same benefits. The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse. Well. At first, I thought the elimination of null checks could not be implemented easily in the interpreter. The first draft in my compiler was certainly invalid, but gave me interesting numbers (around 5%, as seen above) and made me want to go ahead. And I realized that implementing it cleanly in the interpreter was far easier than implementing it in my JIT compiler … Then I went with optimizing another common case, the call to int4eq, and, well… One could also add an opcode for that in the interpreter, and thus the performance gain of the JIT compiler are going to be minimal compared to the interpreter. Modern CPUs don’t make my job easy here. Most of the cost of an interpreter is taken away by the branch predictor and the other optimizations implemented in silicon. So is all hope lost, am I to declare the interpreter the winner against the limitations of the copy-patch method I have available for my JIT? Of course not, see you in the next post to discuss the biggest interpreter bottleneck! PS: help welcome. Last year I managed to spend some time working on this during my work time. Since then I’ve changed job, and can hardly get some time on this. I also tried to get some sponsoring to work on this and present at future PostgreSQL conferences, to no luck :/ If you can help in any way on this project, feel free to reach me (code contribution, sponsoring, missions, job offers, nudge nudge wink wink). Since I’ve been alone on this, a lot of things are dibbles on scratch paper, I benchmark code and stuff in my head when life gives me some boring time but testing it for real is of course far better. I have some travel planned soon so I hope for next part to be released before next year, with interesting results since my experiences have been as successful as anticipated.

0 views
Justin Duke 5 days ago

Adding imports to the Django shell

I was excited to finally remove from my file when 5.2 dropped because they added support for automatic model import. However, I found myself missing one other little escape hatch that exposed, which was the ability to import other arbitrary modules into the namespace. Django explains how to do bring in modules without a namespace , but I wanted to be able to inoculate my shell, since most of my modules follow a similar structure (exposing a single function). It took the bare minimum of sleuthing to figure out how to hack this in for myself, and now here I am to share that sleuthing with you. Behold, a code snippet that is hopefully self-explanatory:

0 views
Simon Willison 6 days ago

Claude can write complete Datasette plugins now

This isn't necessarily surprising, but it's worth noting anyway. Claude Sonnet 4.5 is capable of building a full Datasette plugin now. I've seen models complete aspects of this in the past, but today is the first time I've shipped a new plugin where every line of code and test was written by Claude, with minimal prompting from myself. The plugin is called datasette-os-info . It's a simple debugging tool - all it does is add a JSON page which dumps out as much information as it can about the OS it's running on. Here's a live demo on my TIL website. I built it to help experiment with changing the Docker base container that Datasette uses to publish images to one that uses Python 3.14. Here's the full set of commands I used to create the plugin. I started with my datasette-plugin cookiecutter template: This created a directory with the initial and and files. Here's an example of that starter template . I created a virtual environment for it, installed the initial test dependencies and ran to check that worked: Then I fired up Claude Code in that directory in YOLO mode: (I actually used my shortcut which runs the above.) Then, in Claude, I told it how to run the tests: When that worked, I told it to build the plugin: ... and that was it! Claude implemented the plugin using Datasette's register_routes() plugin hook to add the page,and then without me prompting it to do so built this basic test as well . It ran the new test, spotted a bug (it had guessed a non-existent parameter), fixed the bug and declared itself done. I built myself a wheel: Then uploaded that to an S3 bucket and deployed it to test it out using . It did exactly what I had hoped - here's what that page looked like: I dread to think how long it would have taken me to figure out all of those different useful OS details! To finish up, I had Claude fill out the README via a couple of extra prompts: Then had a brief conversation about how safe the output was: Claude said: Yes, there are several potentially sensitive items in the output: Most sensitive: Moderately sensitive: Considerations: Suggestions: You might want to: Would you like me to add any of these protections to the plugin? In this case I'm fine with the current behavior provided people understand the risks, so I told it: And then (because the warning it added was quite long): And requested an extra README note: Three last prompts: ... and that was the project finished. I pushed it to GitHub , configured Trusted Publishing for it on PyPI and posted the 0.1 release , which ran this GitHub Actions publish.yml and deployed that release to datasette-os-info on PyPI . Now that it's live you can try it out without even installing Datasette using a one-liner like this: That's using the CLI option to show what that path in the Datasette instance would return, as described in the Datasette documentation . I've shared my full Claude Code transcript in a Gist. A year ago I'd have been very impressed by this. Today I wasn't even particularly surprised that this worked - the coding agent pattern implemented by Claude Code is spectacularly effective when you combine it with pre-existing templates, and Datasette has been aroung for long enough now that plenty of examples of plugins have made it into the training data for the leading models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . hostname - Could reveal internal naming schemes or organization details Environment variables: HOME - Reveals user's home directory path USER - Reveals the username PATH - Reveals filesystem structure and software installation locations VIRTUAL_ENV - Could reveal project paths and directory structure python_executable - Reveals filesystem paths and user directories CPU count and memory info - Could reveal infrastructure capacity Kernel version details - Could potentially be used to identify specific vulnerabilities This endpoint is accessible without authentication by default (at /-/os) If you're running Datasette on a public server, this information would be visible to anyone The information could help attackers understand your infrastructure setup Add authentication requirements to this endpoint Make certain fields optional/configurable Add a warning in the README that this exposes system information Consider filtering out or redacting sensitive paths (replace usernames with ***, etc.)

1 views

Python 3.14 Is Here. How Fast Is It?

In November of 2024 I wrote a blog post titled "Is Python Really That Slow?" , in which I tested several versions of Python and noted the steady progress the language has been making in terms of performance. Today is the 8th of October 2025, just a day after the official release of Python 3.14. Let's rerun the benchmarks to find out how fast the new version of Python is!

0 views
Sean Goedecke 1 weeks ago

GPT-5-Codex is a better AI researcher than me

In What’s the strongest AI model you can train on a laptop in five minutes? I tried my hand at answering a silly AI-research question. You can probably guess what it was. I chatted with GPT-5 to help me get started with the Python scripts and to bounce ideas off, but it was still me doing the research. I was coming up with the ideas, running the experiments, and deciding what to do next based on the data. The best model I could train was a 1.8M param transformer which produced output like this: Once upon a time , there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time. Since then, OpenAI has released GPT-5-codex, and supposedly uses it (plus Codex, their CLI coding tool) to automate a lot of their product development and AI research. I wanted to try the same thing. Codex-plus-me did a much better job than me alone 1 . Here’s an example of the best output I got from the model I trained with Codex: Once upon a time , in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. What was the process like to get there? I want to call it “vibe research”. Like “vibe coding”, it’s performing a difficult technical task by relying on the model. I have a broad intuitive sense of what approaches are being tried, but I definitely don’t have a deep enough understanding to do this research unassisted. A real AI researcher would get a lot more out of the tool. Still, it was very easy to get started. I gave Codex the path to my scratch directory, told it “continue the research”, and it immediately began coming up with ideas and running experiments on its own. In a way, the “train in five minutes” challenge is a perfect fit, because the feedback loop is so short. The basic loop of doing AI research with Codex (at least as an enthusiastic amateur) looks something like this: After two days I did paste the current research notes into GPT-5-Pro, which helped a bit, but the vast majority of my time was spent in this loop. As we’ll see, the best ideas were ones Codex already came up with. I chewed through a lot of tokens doing this. That’s OK with me, since I paid for the $200-per-month plan 2 , but if you don’t want to do that you’ll have to space out your research a bit more slowly. I restarted my Codex process every million tokens or so. It didn’t have any issue continuing where it left off from its previous notes, which was nice. I ran Codex with . By default it didn’t have access to MPS, which meant it could only train models on the CPU. There’s probably some more principled way of sandboxing it, but I didn’t bother to figure it out. I didn’t run into any runaway-agent problems, unless you count crashing my laptop a few times by using up too much memory. Here’s a brief summary of how the research went over the four or five days I spent poking at it. I stayed with the TinyStories dataset for all of this, partially because I think it’s the best choice and partially because I wanted a 1:1 comparison between Codex and my own efforts. Codex and I started with a series of n-gram models: instead of training a neural network, n-gram models just store the conditional probabilities of a token based on the n tokens that precede it. These models are very quick to produce (seconds, not minutes) but aren’t very good. The main reason is that even a 5-gram model cannot include context from more than five tokens ago, so they struggle to produce coherent text across an entire sentence. Here’s an example: Once upon a time , in a small school . ” they are friends . they saw a big pond . he pulled and pulled , but the table was still no attention to grow even more . she quickly ran to the house . she says , ” sara said . ” you made him ! ” the smooth more it said , for helping me decorate the cake . It’s not terrible ! There are short segments that are entirely coherent. But it’s kind of like what AI skeptics think LLMs are like: just fragments of the original source, remixed without any unifying through-line. The perplexity is 18.5, worse than basically any of the transformers I trained in my last attempt. Codex trained 19 different n-gram models, of which the above example (a 4-gram model) was the best 3 . In my view, this is one of the strengths of LLM-based AI research: it is trivial to tell the model “go and sweep a bunch of different values for the hyperparameters” . Of course, you can do this yourself. But it’s a lot easier to just tell the model to do it. After this, Codex spent a lot of time working on transformers. It trained ~50 normal transformers with different sizes, number of heads, layers, and so on. Most of this wasn’t particularly fruitful. I was surprised that my hand-picked hyperparameters from my previous attempt were quite competitive - though maybe it shouldn’t have been a shock, since they matched the lower end of the Chinchilla scaling laws. Still, eventually Codex hit on a 8.53 perplexity model (3 layers, 4 heads, and a dimension of 144), which was a strict improvement over my last attempt. I’m not really convinced this was an architectural improvement. One lesson from training fifty different models is that there’s quite a lot of variance between different seeds. A perplexity improvement of just over 1 is more or less what I was seeing on a “lucky seed”. This was an interesting approach for the challenge: going for pure volume and hoping for a lucky training run. You can’t do this with a larger model, since it takes so long to train 4 , but the five-minute limit makes it possible. The next thing Codex tried - based on some feedback I pasted in from GPT-5-Pro - was “shallow fusion” : instead of training a new model, updating the generation logic to blend the transformer-predicted tokens with a n-gram model, a “kNN head” (which looks up hidden states that are “nearby” the current hidden state of the transformer and predicts their tokens), and a “cache head” that makes the model more likely to repeat words that are already in the context. This immediately dropped perplexity down to 7.38 : a whole other point lower than our best transformer. I was excited about that, but the generated content was really bad: Once upon a time,, in a small house, there lived a boy named Tim. Tim loved to play outside with his ball. One Mr. Skip had a lot of fun. He ran everywhere every day. One One day, Tim was playing with his ball new ball near his house. Tim was playing with his his ball and had a lot of fun. But then, he saw a big tree and decided to climb it. Tim tried to climb the tree, but he was too big. He was too small to reach the top of the tree. But the tree was too high. The little tree was too high for him. Soon, Tim was near the tree. He was brave and climbed the tree. But when he got got to the top, he was sad. Tim saw a bird on What happened? I over-optimized for perplexity. As it turns out, the pure transformers that were higher-perplexity were better at writing stories. They had more coherence over the entire length of the story, they avoided generating weird repetition artifacts (like ”,,”), and they weren’t as mindlessly repetitive. I went down a bit of a rabbithole trying to think of how to score my models without just relying on perplexity. I came up with some candidate rubrics, like grammatical coherence, patterns of repetition, and so on, before giving up and just using LLM-as-a-judge. To my shame, I even generated a new API key for the LLM before realizing that I was talking to a strong LLM already via Codex, and I could just ask Codex to rate the model outputs directly. The final and most successful idea I tried was distilling a transformer from a n-gram teacher model . First, we train a n-gram model, which only takes ~10 seconds. Then we train a transformer - but for the first 200 training steps, we push the transformer towards predicting the tokens that the n-gram model would predict. After that, the transformer continues to train on the TinyStories data as usual. Here’s an example of some output: Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it. I think this is pretty good! It has characters that continue throughout the story. It has a throughline - Ben’s lost toy - though it confuses “toy” and “friend” a bit. It’s a coherent story, with a setup, problem, solution and moral. This is much better than anything else I’ve been able to train in five minutes. Why is it better? I think the right intuition here is that transformers need to spend a lot of initial compute (say, two minutes) learning how to construct grammatically-correct English sentences. If you begin the training by spending ten seconds training a n-gram model that can already produce sort-of-correct grammar, you can speedrun your way to learning grammar and spend an extra one minute and fifty seconds learning content. I really like this approach. It’s exactly what I was looking for from the start: a cool architectural trick that genuinely helps, but only really makes sense for this weird challenge 5 . I don’t have any illusions about this making me a real AI researcher, any more than a “vibe coder” is a software engineer. Still, I’m surprised that it actually worked. And it was a lot of fun! I’ve pushed up the code here if you want to pick up from where I left off, but you may be better off just starting from scratch with Codex or your preferred coding agent. edit: this post got some comments on Hacker News . The tone is much more negative than on my previous attempt, which is interesting - maybe the title gave people the mistaken impression that I think I’m a strong AI researcher! “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. Codex makes a change to the training script and does three or four runs (this takes ~20 minutes overall) Based on the results, Codex suggests two or three things that you could try next I pick one of them (or very occasionally suggest my own idea) and return to (1). “Alone” here is relative - I did use ChatGPT and a bit of Copilot to generate some of the training code in my last attempt. I just didn’t use any agentic tooling. ↩ My deal with myself was that if I ever have a month where I use fewer than 2M tokens, I’ll cancel the plan. ↩ There are a lot of clever tricks involved here: Kneser-Ney smoothing, interpolating unigram/bigram/trigram probabilities on a specific schedule, deliberately keeping the sentinel token, etc. I didn’t spend the time understanding all of these things deeply - that’s vibe research for you - so I won’t write too much about it. ↩ Unless you’re a big AI lab. I am 100% convinced that the large labs are spending a lot of compute just re-training on different seeds in the hope of getting a lucky run. ↩ I was suspicious that I just got a lucky seed, but I compared ~40 generations with and without the distillation and the distilled model really was better at producing correct-looking stories. ↩

0 views
Simon Willison 1 weeks ago

Embracing the parallel coding agent lifestyle

For a while now I've been hearing from engineers who run multiple coding agents at once - firing up several Claude Code or Codex CLI instances at the same time, sometimes in the same repo, sometimes against multiple checkouts or git worktrees . I was pretty skeptical about this at first. AI-generated code needs to be reviewed, which means the natural bottleneck on all of this is how fast I can review the results. It's tough keeping up with just a single LLM given how fast they can churn things out, where's the benefit from running more than one at a time if it just leaves me further behind? Despite my misgivings, over the past few weeks I've noticed myself quietly starting to embrace the parallel coding agent lifestyle. I can only focus on reviewing and landing one significant change at a time, but I'm finding an increasing number of tasks that can still be fired off in parallel without adding too much cognitive overhead to my primary work. Here are some patterns I've found for applying parallel agents effectively. The first category of tasks I've been applying this pattern to is research . Research tasks answer questions or provide recommendations without making modifications to a project that you plan to keep. A lot of software projects start with a proof of concept. Can Yjs be used to implement a simple collaborative note writing tool with a Python backend? The libraries exist , but do they work when you wire them together? Today's coding agents can build a proof of concept with new libraries and resolve those kinds of basic questions. Libraries too new to be in the training data? Doesn't matter: tell them to checkout the repos for those new dependencies and read the code to figure out how to use them. If you need a reminder about how a portion of your existing system works, modern "reasoning" LLMs can provide a detailed, actionable answer in just a minute or two. It doesn't matter how large your codebase is: coding agents are extremely effective with tools like grep and can follow codepaths through dozens of different files if they need to. Ask them to make notes on where your signed cookies are set and read, or how your application uses subprocesses and threads, or which aspects of your JSON API aren't yet covered by your documentation. These LLM-generated explanations are worth stashing away somewhere, because they can make excellent context to paste into further prompts in the future. Now we're moving on to code edits that we intend to keep, albeit with very low-stakes. It turns out there are a lot of problems that really just require a little bit of extra cognitive overhead which can be outsourced to a bot. Warnings are a great example. Is your test suite spitting out a warning that something you are using is deprecated? Chuck that at a bot - tell it to run the test suite and figure out how to fix the warning. No need to take a break from what you're doing to resolve minor irritations like that. There is a definite knack to spotting opportunities like this. As always, the best way to develop that instinct is to try things - any small maintenance task is something that's worth trying with a coding agent. You can learn from both their successes and their failures. Reviewing code that lands on your desk out of nowhere is a lot of work. First you have to derive the goals of the new implementation: what's it trying to achieve? Is this something the project needs? Is the approach taken the best for this current project, given other future planned changes? A lot of big questions before you can even start digging into the details of the code. Code that started from your own specification is a lot less effort to review. If you already decided what to solve, picked the approach and worked out a detailed specification for the work itself, confirming it was built to your needs can take a lot less time. I described my more authoritarian approach to prompting models for code back in March. If I tell them exactly how to build something the work needed to review the resulting changes is a whole lot less taxing. My daily drivers are currently Claude Code (on Sonnet 4.5), Codex CLI (on GPT-5-Codex), and Codex Cloud (for asynchronous tasks, frequently launched from my phone.) I'm also dabbling with GitHub Copilot Coding Agent (the agent baked into the GitHub.com web interface in various places) and Google Jules , Google's currently-free alternative to Codex Cloud. I'm still settling into patterns that work for me. I imagine I'll be iterating on my processes for a long time to come, especially as the landscape of coding agents continues to evolve. I frequently have multiple terminal windows open running different coding agents in different directories. These are currently a mixture of Claude Code and Codex CLI, running in YOLO mode (no approvals) for tasks where I'm confident malicious instructions can't sneak into the context. (I need to start habitually running my local agents in Docker containers to further limit the blast radius if something goes wrong.) I haven't adopted git worktrees yet: if I want to run two agents in isolation against the same repo I do a fresh checkout, often into . For riskier tasks I'm currently using asynchronous coding agents - usually Codex Cloud - so if anything goes wrong the worst that can happen is my source code getting leaked (since I allow it to have network access while running). Most of what I work on is open source anyway so that's not a big concern for me. I occasionally use GitHub Codespaces to run VS Code's agent mode, which is surprisingly effective and runs directly in my browser. This is particularly great for workshops and demos since it works for anyone with GitHub account, no extra API key necessary. This category of coding agent software is still really new, and the models have only really got good enough to drive them effectively in the past few months - Claude 4 and GPT-5 in particular. I plan to write more as I figure out the ways of using them that are most effective. I encourage other practitioners to do the same! Jesse Vincent wrote How I'm using coding agents in September, 2025 which describes his workflow for parallel agents in detail, including having an architect agent iterate on a plan which is then reviewed and implemented by fresh instances of Claude Code. In The 7 Prompting Habits of Highly Effective Engineers Josh Bleecher Snyder describes several patterns for this kind of work. I particularly like this one: Send out a scout . Hand the AI agent a task just to find out where the sticky bits are, so you don’t have to make those mistakes. I've tried this a few times with good results: give the agent a genuinely difficult task against a large codebase, with no intention of actually landing its code, just to get ideas from which files it modifies and how it approaches the problem. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Ahead of AI 1 weeks ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.

0 views
Sean Goedecke 1 weeks ago

How I influence tech company politics as a staff software engineer

Many software engineers are fatalistic about company politics. They believe that it’s pointless to get involved, because 1 : The general idea here is that software engineers are simply not equipped to play the game at the same level as real political operators . This is true! It would be a terrible mistake for a software engineer to think that you ought to start scheming and plotting like you’re in Game of Thrones . Your schemes will be immediately uncovered and repurposed to your disadvantage and other people’s gain. Scheming takes practice and power, and neither of those things are available to software engineers. It is simply a fact that software engineers are tools in the political game being played at large companies, not players in their own right. However, there are many ways to get involved in politics without scheming. The easiest way is to actively work to make a high-profile project successful . This is more or less what you ought to be doing anyway, just as part of your ordinary job. If your company is heavily investing in some new project - these days, likely an AI project - using your engineering skill to make it successful 2 is a politically advantageous move for whatever VP or executive is spearheading that project. In return, you’ll get the rewards that executives can give at tech companies: bonuses, help with promotions, and positions on future high-profile projects. I wrote about this almost a year ago in Ratchet effects determine engineer reputation at large companies . A slightly harder way (but one that gives you more control) is to make your pet idea available for an existing political campaign . Suppose you’ve wanted for a while to pull out some existing functionality into its own service. There are two ways to make that happen. The hard way is to expend your own political capital: drum up support, let your manager know how important it is to you, and slowly wear doubters down until you can get the project formally approved. The easy way is to allow some executive to spend their (much greater) political capital on your project . You wait until there’s a company-wide mandate for some goal that aligns with your project (say, a push for reliability, which often happens in the wake of a high-profile incident). Then you suggest to your manager that your project might be a good fit for this. If you’ve gauged it correctly, your org will get behind your project. Not only that, but it’ll increase your political capital instead of you having to spend it. Organizational interest comes in waves. When it’s reliability time, VPs are desperate to be doing something . They want to come up with plausible-sounding reliability projects that they can fund, because they need to go to their bosses and point at what they’re doing for reliability, but they don’t have the skillset to do it on their own. They’re typically happy to fund anything that the engineering team suggests. On the other hand, when the organization’s attention is focused somewhere else - say, on a big new product ship - the last thing they want is for engineers to spend their time on an internal reliability-focused refactor that’s invisible to customers. So if you want to get something technical done in a tech company, you ought to wait for the appropriate wave . It’s a good idea to prepare multiple technical programs of work, all along different lines. Strong engineers will do some of this kind of thing as an automatic process, simply by noticing things in the normal line of work. For instance, you might have rough plans: When executives are concerned about billing, you can offer the billing refactor as a reliability improvement. When they’re concerned about developer experience, you can suggest replacing the build pipeline. When customers are complaining about performance, you can point to the Golang rewrite as a good option. When the CEO checks the state of the public documentation and is embarrassed, you can make the case for rebuilding it as a static site. The important thing is to have a detailed, effective program of work ready to go for whatever the flavor of the month is. Some program of work will be funded whether you do this or not. However, if you don’t do this, you have no control over what that program is. In my experience, this is where companies make their worst technical decisions : when the political need to do something collides with a lack of any good ideas. When there are no good ideas, a bad idea will do, in a pinch. But nobody prefers this outcome. It’s bad for the executives, who then have to sell a disappointing technical outcome as if it were a success 4 , and it’s bad for the engineers, who have to spend their time and effort building the wrong idea. If you’re a very senior engineer, the VPs (or whoever) will quietly blame you for this. They’ll be right to! Having the right idea handy at the right time is your responsibility. You can view all this in two different ways. Cynically, you can read this as a suggestion to make yourself a convenient tool for the sociopaths who run your company to use in their endless internecine power struggles. Optimistically, you can read this as a suggestion to let executives set the overall priorities for the company - that’s their job, after all - and to tailor your own technical plans to fit 3 . Either way, you’ll achieve more of your technical goals if you push the right plan at the right time. edit: this post got some attention on Hacker News . The comments were much more positive than on my other posts about politics, for reasons I don’t quite understand. This comment is an excellent statement of what I write about here (but targeted at more junior engineers). This comment (echoed here ) references a Milton Friedman quote that applies the idea in this post to political policy in general, which I’d never thought of but sounds correct: Only a crisis—actual or perceived—produces real change. When that crisis occurs, the actions that are taken depend on the ideas that are lying around. That, I believe, is our basic function: to develop alternatives to existing policies, to keep them alive and available until the politically impossible becomes politically inevitable. There’s a few comments calling this approach overly game-playing and self-serving. I think this depends on the goal you’re aiming at. The ones I referenced above seem pretty beneficial to me! Finally, this comment is a good summary of what I was trying to say: Instead of waiting to be told what to do and being cynical about bad ideas coming up when there’s a vacumn and not doing what he wants to do, the author keeps a back log of good and important ideas that he waits to bring up for when someone important says something is priority. He gets what he wants done, compromising on timing. I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . For more along these lines, see Is it cynical to do what your manager wants? Just because they can do this doesn’t mean they want to. Technical decisions are often made for completely selfish reasons that cannot be influenced by a well-meaning engineer Powerful stakeholders are typically so stupid and dysfunctional that it’s effectively impossible for you to identify their needs and deliver solutions to them The political game being played depends on private information that software engineers do not have, so any attempt to get involved will result in just blundering around Managers and executives spend most of their time playing politics, while engineers spend most of their time doing engineering, so engineers are at a serious political disadvantage before they even start to migrate the billing code to stored-data-updated-by-webhooks instead of cached API calls to rip out the ancient hand-rolled build pipeline and replace it with Vite to rewrite a crufty high-volume Python service in Golang to replace the slow CMS frontend that backs your public documentation with a fast static site I was prompted to write this after reading Terrible Software’s article Don’t avoid workplace politics and its comments on Hacker News. Disclaimer: I am talking here about broadly functional tech companies (i.e. ones that are making money). If you’re working somewhere that’s completely dysfunctional, I have no idea whether this advice would apply at all. ↩ What it takes to make a project successful is itself a complex political question that every senior+ engineer is eventually forced to grapple with (or to deliberately avoid, with consequences for their career). For more on that, see How I ship projects at large tech companies . ↩ For more along these lines, see Is it cynical to do what your manager wants? ↩ Just because they can do this doesn’t mean they want to. ↩

0 views

The RAG Obituary: Killed by Agents, Buried by Context Windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool , an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions. After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline. In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text! The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once? The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years. GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating. Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages). With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole! RAG emerged as an elegant solution borrowed directly from search engines. Just as Google displays 10 blue links with relevant snippets for your query, RAG retrieves the most pertinent document fragments and feeds them to the LLM for synthesis. The core idea is beautifully simple: if you can’t fit everything in context, find the most relevant pieces and use those . It turns LLMs into sophisticated search result summarizers. Basically, LLMs can’t read the whole book but they can know who dies at the end; convenient! Long documents need to be chunked into pieces and it’s when problems start. Those digestible pieces are typically 400-1,000 tokens each which is basically 300-750 words. The problem? It isn’t as simple as cutting every 500 words. Consider chunking a typical SEC 10-K annual report. The document has a complex hierarchical structure: - Item 1: Business Overview (10-15 pages) - Item 1A: Risk Factors (20-30 pages) - Item 7: Management’s Discussion and Analysis (30-40 pages) - Item 8: Financial Statements (40-50 pages) After naive chunking at 500 tokens, critical information gets scattered: - Revenue recognition policies split across 3 chunks - A risk factor explanation broken mid-sentence - Financial table headers separated from their data - MD&A narrative divorced from the numbers it’s discussing If you search for “revenue growth drivers,” you might get a chunk mentioning growth but miss the actual numerical data in a different chunk, or the strategic context from MD&A in yet another chunk! At Fintool, we’ve developed sophisticated chunking strategies that go beyond naive text splitting: - Hierarchical Structure Preservation : We maintain the nested structure from Item 1 (Business) down to sub-sections like geographic segments, creating a tree-like document representation - Table Integrity : Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together - Cross-Reference Preservation : We maintain links between narrative sections and their corresponding financial data, preserving the “See Note X” relationships - Temporal Coherence : Year-over-year comparisons and multi-period analyses stay together as single chunks - Footnote Association : Footnotes remain connected to their referenced items through metadata linking Each chunk at Fintool is enriched with extensive metadata: - Filing type (10-K, 10-Q, 8-K) - Fiscal period and reporting date - Section hierarchy (Item 7 > Liquidity > Cash Position) - Table identifiers and types - Cross-reference mappings - Company identifiers (CIK, ticker) - Industry classification codes This allows for more accurate retrieval but even our intelligent chunking can’t solve the fundamental problem: we’re still working with fragments instead of complete documents! Once you have the chunks, you need a way to search them. One way is to embed your chunks. Each chunk is converted into a high‑dimensional vector (typically 1,536 dimensions in most embedding models). These vectors live in a space where, theoretically, similar concepts are close together. When a user asks a question, that question also becomes a vector. The system finds the chunks whose vectors are closest to the query vector using cosine similarity. It’s elegant in theory and in practice, it’s a nightmare of edge cases. Embedding models are trained on general text and struggle with specific terminologies. They find similarities but they can’t distinguish between “revenue recognition” (accounting policy) and “revenue growth” (business performance). Consider that example: Query: “ What is the company’s litigation exposure ? RAG searches for “litigation” and returns 50 chunks: - Chunks 1-10: Various mentions of “litigation” in boilerplate risk factors - Chunks 11-20: Historical cases from 2019 (already settled) - Chunks 21-30: Forward-looking safe harbor statements - Chunks 31-40: Duplicate descriptions from different sections - Chunks 41-50: Generic “we may face litigation” warnings What RAG Reports: $500M in litigation (from Legal Proceedings section) What’s Actually There: - $500M in Legal Proceedings (Item 3) - $700M in Contingencies note (”not material individually”) - $1B new class action in Subsequent Events - $800M indemnification obligations (different section) - $2B probable losses in footnotes (keyword “probable” not “litigation”) The actual Exposure is $5.1B. 10x what RAG found. Oupsy! By late 2023, most builders realized pure vector search wasn’t enough. Enter hybrid search: combine semantic search (embeddings) with the traditional keyword search (BM25). This is where things get interesting. BM25 (Best Matching 25) is a probabilistic retrieval model that excels at exact term matching. Unlike embeddings, BM25: - Rewards Exact Matches : When you search for “EBITDA,” you get documents with “EBITDA,” not “operating income” or “earnings” - Handles Rare Terms Better : Financial jargon like “CECL” (Current Expected Credit Losses) or “ASC 606” gets proper weight - Document Length Normalization : Doesn’t penalize longer documents - Term Frequency Saturation : Multiple mentions of “revenue” don’t overshadow other important terms At Fintool, we’ve built a sophisticated hybrid search system: 1. Parallel Processing : We run semantic and keyword searches simultaneously 2. Dynamic Weighting : Our system adjusts weights based on query characteristics: - Specific financial metrics? BM25 gets 70% weight - Conceptual questions? Embeddings get 60% weight - Mixed queries? 50/50 split with result analysis 3. Score Normalization : Different scoring scales are normalized using: - Min-max scaling for BM25 scores - Cosine similarity already normalized for embeddings - Z-score normalization for outlier handling So at the end the embeddings search and the keywords search retrieve chunks and the search engine combines them using Reciprocal Rank Fusion. RRF merges rankings so items that consistently appear near the top across systems float higher, even if no system put them at #1! So now you think it’s done right? But hell no! Here’s what nobody talks about: even after all that retrieval work, you’re not done. You need to rerank the chunks one more time to get a good retrieval and it’s not easy. Rerankers are ML models that take the search results and reorder them by relevance to your specific query limiting the number of chunks sent to the LLM. Not only LLMs are context poor, they also struggle when dealing with too much information . It’s vital to reduce the number of chunks sent to the LLM for the final answer. The Reranking Pipeline: 1. Initial search retrieval with embeddings + keywords gets you 100-200 chunks 2. Reranker ranks the top 10 3. Top 10 are fed to the LLM to answer the question Here is the challenge with reranking: - Latency Explosion : Rerank adds between 300-2000ms per query. Ouch. - Cost Multiplication : it adds significant extra cost to every query. For instance, Cohere Rerank 3.5 costs $2.00 per 1,000 search units, making reranking expensive. - Context Limits : Rerankers typically handle few chunks (Cohere Rerank supports only 4096 tokens), so if you need to re-rank more than that, you have to split it into different parallel API calls and merge them! - Another Model to Manage : One more API, one more failure point Re-rank is one more step in a complex pipeline. What I find difficult with RAG is what I call the “cascading failure problem”. 1. Chunking can fail (split tables) or be too slow (especially when you have to ingest and chunk gigabytes of data in real-time) 2. Embedding can fail (wrong similarity) 3. BM25 can fail (term mismatch) 4. Hybrid fusion can fail (bad weights) 5. Reranking can fail (wrong priorities) Each stage compounds the errors of the previous stage. Beyond the complexity of hybrid search itself, there’s an infrastructure burden that’s rarely discussed. Running production Elasticsearch is not easy. You’re looking at maintaining TB+ of indexed data for comprehensive document coverage, which requires 128-256GB RAM minimum just to get decent performance. The real nightmare comes with re-indexing. Every schema change forces a full re-indexing that takes 48-72 hours for large datasets. On top of that, you’re constantly dealing with cluster management, sharding strategies, index optimization, cache tuning, backup and disaster recovery, and version upgrades that regularly include breaking changes. Here are some structural limitations: 1. Context Fragmentation - Long documents are interconnected webs, not independent paragraphs - A single question might require information from 20+ documents - Chunking destroys these relationships permanently 2. Semantic Search Fails on Numbers - “$45.2M” and “$45,200,000” have different embeddings - “Revenue increased 10%” and “Revenue grew by a tenth” rank differently - Tables full of numbers have poor semantic representations 3. No Causal Understanding - RAG can’t follow “See Note 12” → Note 12 → Schedule K - Can’t understand that discontinued operations affect continuing operations - Can’t trace how one financial item impacts another 4. The Vocabulary Mismatch Problem - Companies use different terms for the same concept - “Adjusted EBITDA” vs “Operating Income Before Special Items” - RAG retrieves based on terms, not concepts 5. Temporal Blindness - Can’t distinguish Q3 2024 from Q3 2023 reliably - Mixes current period with prior period comparisons - No understanding of fiscal year boundaries These aren’t minor issues. They’re fundamental limitations of the retrieval paradigm. Three months ago I stumbled on an innovation on retrievial that blew my mind In May 2025, Anthropic released Claude Code, an AI coding agent that works in the terminal. At first, I was surprised by the form factor. A terminal? Are we back in 1980? no UI? Back then, I was using Cursor, a product that excelled at traditional RAG. I gave it access to my codebase to embed my files and Cursor ran a search n my codebase before answering my query. Life was good. But when testing Claude Code, one thing stood out: It was better and faster and not because their RAG was better but because there was no RAG. Instead of a complex pipeline of chunking, embedding, and searching, Claude Code uses direct filesystem tools: 1. Grep (Ripgrep) - Lightning-fast regex search through file contents - No indexing required. It searches live files instantly - Full regex support for precise pattern matching - Can filter by file type or use glob patterns - Returns exact matches with context lines - Direct file discovery by name patterns - Finds files like `**/*.py` or `src/**/*.ts` instantly - Returns files sorted by modification time (recency bias) - Zero overhead—just filesystem traversal 3. Task Agents - Autonomous multi-step exploration - Handle complex queries requiring investigation - Combine multiple search strategies adaptively - Build understanding incrementally - Self-correct based on findings By the way, Grep was invented in 1973. It’s so... primitive. And that’s the genius of it. Claude Code doesn’t retrieve. It investigates: - Runs multiple searches in parallel (Grep + Glob simultaneously) - Starts broad, then narrows based on discoveries - Follows references and dependencies naturally - No embeddings, no similarity scores, no reranking It’s simple, it’s fast and it’s based on a new assumption that LLMs will go from context poor to context rich. Claude Code proved that with sufficient context and intelligent navigation, you don’t need RAG at all. The agent can: - Load entire files or modules directly - Follow cross-references in real-time - Understand structure and relationships - Maintain complete context throughout investigation This isn’t just better than RAG—it’s a fundamentally different paradigm. And what works for code can work for any long documents that are not coding files. The context window explosion made Claude Code possible: 2022-2025 Context-Poor Era: - GPT-4: 8K tokens (~12 pages) - GPT-4-32k: 32K tokens (~50 pages) 2025 and beyond Context Revolution: - Claude Sonnet 4: 200k tokens (~700 pages) - Gemini 2.5: 1M tokens (~3,000 pages) - Grok 4-fast: 2M tokens (~6,000 pages) At 2M tokens, you can fit an entire year of SEC filings for most companies. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon. This represents a fundamental shift in how AI systems process information. Equally important, attention mechanisms are rapidly improving—LLMs are becoming far better at maintaining coherence and focus across massive context windows without getting “lost” in the noise. Claude Code demonstrated that with enough context, search becomes navigation: - No need to retrieve fragments when you can load complete files - No need for similarity when you can use exact matches - No need for reranking when you follow logical paths - No need for embeddings when you have direct access It’s mind-blowing. LLMs are getting really good at agentic behaviors meaning they can organize their work into tasks to accomplish an objective. Here’s what tools like ripgrep bring to the search table: - No Setup : No index. No overhead. Just point and search. - Instant Availability : New documents are searchable the moment they hit the filesystem (no indexing latency!) - Zero Maintenance : No clusters to manage, no indices to optimize, no RAM to provision - Blazing Fast : For a 100K line codebase, Elasticsearch needs minutes to index. Ripgrep searches it in milliseconds with zero prep. - Cost : $0 infrastructure cost vs a lot of $$$ for Elasticsearch So back to our previous example on SEC filings. An agent can SEC filing structure intrinsically: - Hierarchical Awareness : Knows that Item 1A (Risk Factors) relates to Item 7 (MD&A) - Cross-Reference Following : Automatically traces “See Note 12” references - Multi-Document Coordination : Connects 10-K, 10-Q, 8-K, and proxy statements - Temporal Analysis : Compares year-over-year changes systematically For searches across thousands of companies or decades of filings, it might still use hybrid search, but now as a tool for agents: - Initial broad search using hybrid retrieval - Agent loads full documents for top results - Deep analysis within full context - Iterative refinement based on findings My guess is traditional RAG is now a search tool among others and that agents will always prefer grep and reading the whole file because they are context rich and can handle long-running tasks. Consider our $6.5B lease obligation question as an example: Step 1: Find “lease” in main financial statements → Discovers “See Note 12” Step 2: Navigate to Note 12 → Finds “excluding discontinued operations (Note 23)” Step 3: Check Note 23 → Discovers $2B additional obligations Step 4: Cross-reference with MD&A → Identifies management’s explanation and adjustments Step 5: Search for “subsequent events” → Finds post-balance sheet $500M lease termination Final answer: $5B continuing + $2B discontinued - $500M terminated = $6.5B The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation. Basically, RAG is like a research assistant with perfect memory but no understanding: - “Here are 50 passages that mention debt” - Can’t tell you if debt is increasing or why - Can’t connect debt to strategic changes - Can’t identify hidden obligations - Just retrieves text, doesn’t comprehend relationships Agentic search is like a forensic accountant: - Follows the money systematically - Understands accounting relationships (assets = liabilities + equity) - Identifies what’s missing or hidden - Connects dots across time periods and documents - Challenges management assertions with data 1. Increasing Document Complexity - Documents are becoming longer and more interconnected - Cross-references and external links are proliferating - Multiple related documents need to be understood together - Systems must follow complex trails of information 2. Structured Data Integration - More documents combine structured and unstructured data - Tables, narratives, and metadata must be understood together - Relationships matter more than isolated facts - Context determines meaning 3. Real-Time Requirements - Information needs instant processing - No time for re-indexing or embedding updates - Dynamic document structures require adaptive approaches - Live data demands live search 4. Cross-Document Understanding Modern analysis requires connecting multiple sources: - Primary documents - Supporting materials - Historical versions - Related filings RAG treats each document independently. Agentic search builds cumulative understanding. 5. Precision Over Similarity - Exact information matters more than similar content - Following references beats finding related text - Structure and hierarchy provide crucial context - Navigation beats retrieval The evidence is becoming clear. While RAG served us well in the context-poor era, agentic search represents a fundamental evolution. The potential benefits of agentic search are compelling: - Elimination of hallucinations from missing context - Complete answers instead of fragments - Faster insights through parallel exploration - Higher accuracy through systematic navigation - Massive infrastructure cost reduction - Zero index maintenance overhead The key insight? Complex document analysis—whether code, financial filings, or legal contracts—isn’t about finding similar text. It’s about understanding relationships, following references, and maintaining precision. The combination of large context windows and intelligent navigation delivers what retrieval alone never could. RAG was a clever workaround for a context-poor era . It helped us bridge the gap between tiny windows and massive documents, but it was always a band-aid. The future won’t be about splitting documents into fragments and juggling embeddings. It will be about agents that can navigate, reason, and hold entire corpora in working memory. We are entering the post-retrieval age. The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents. In hindsight, RAG will look like training wheels. Useful, necessary, but temporary. The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted.

0 views
Simon Willison 2 weeks ago

Designing agentic loops

Coding agents like Anthropic's Claude Code and OpenAI's Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly exercise the code they are writing, correct errors, dig through existing implementation details, and even run experiments to find effective code solutions to problems. As is so often the case with modern AI, there is a great deal of depth involved in unlocking the full potential of these new tools. A critical new skill to develop is designing agentic loops . One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal a coding agent can often brute force its way to an effective solution. My preferred definition of an LLM agent is something that runs tools in a loop to achieve a goal . The art of using them well is to carefully design the tools and loop for them to use. Agents are inherently dangerous - they can make poor decisions or fall victim to malicious prompt injection attacks , either of which can result in harmful results from tool calls. Since the most powerful coding agent tool is "run this command in the shell" a rogue agent can do anything that you could do by running a command yourself. To quote Solomon Hykes : An AI agent is an LLM wrecking its environment in a loop. Coding agents like Claude Code counter this by defaulting to asking you for approval of almost every command that they run. This is kind of tedious, but more importantly, it dramatically reduces their effectiveness at solving problems through brute force. Each of these tools provides its own version of what I like to call YOLO mode, where everything gets approved by default. This is so dangerous , but it's also key to getting the most productive results! Here are three key risks to consider from unattended YOLO mode. If you want to run YOLO mode anyway, you have a few options: Most people choose option 3. Despite the existence of container escapes I think option 1 using Docker or the new Apple container tool is a reasonable risk to accept for most people. Option 2 is my favorite. I like to use GitHub Codespaces for this - it provides a full container environment on-demand that's accessible through your browser and has a generous free tier too. If anything goes wrong it's a Microsoft Azure machine somewhere that's burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository. There are plenty of other agent-like tools that run code on other people's computers. Code Interpreter mode in both ChatGPT and Claude can go a surprisingly long way here. I've also had a lot of success (ab)using OpenAI's Codex Cloud . Coding agents themselves implement various levels of sandboxing, but so far I've not seen convincing enough documentation of these to trust them. Update : It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says: Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use in a container without internet access. You can follow this reference implementation using Docker Dev Containers. Locking internet access down to a list of trusted hosts is a great way to prevent exfiltration attacks from stealing your private source code. Now that we've found a safe (enough) way to run in YOLO mode, the next step is to decide which tools we need to make available to the coding agent. You can bring MCP into the mix at this point, but I find it's usually more productive to think in terms of shell commands instead. Coding agents are really good at running shell commands! If your environment allows them the necessary network access, they can also pull down additional packages from NPM and PyPI and similar. Ensuring your agent runs in an environment where random package installs don't break things on your main computer is an important consideration as well! Rather than leaning on MCP, I like to create an AGENTS.md (or equivalent) file with details of packages I think they may need to use. For a project that involved taking screenshots of various websites I installed my own shot-scraper CLI tool and dropped the following in : Just that one example is enough for the agent to guess how to swap out the URL and filename for other screenshots. Good LLMs already know how to use a bewildering array of existing tools. If you say "use playwright python " or "use ffmpeg" most models will use those effectively - and since they're running in a loop they can usually recover from mistakes they make at first and figure out the right incantations without extra guidance. In addition to exposing the right commands, we also need to consider what credentials we should expose to those commands. Ideally we wouldn't need any credentials at all - plenty of work can be done without signing into anything or providing an API key - but certain problems will require authenticated access. This is a deep topic in itself, but I have two key recommendations here: I'll use an example to illustrate. A while ago I was investigating slow cold start times for a scale-to-zero application I was running on Fly.io . I realized I could work a lot faster if I gave Claude Code the ability to directly edit Dockerfiles, deploy them to a Fly account and measure how long they took to launch. Fly allows you to create organizations, and you can set a budget limit for those organizations and issue a Fly API key that can only create or modify apps within that organization... So I created a dedicated organization for just this one investigation, set a $5 budget, issued an API key and set Claude Code loose on it! In that particular case the results weren't useful enough to describe in more detail, but this was the project where I first realized that "designing an agentic loop" was an important skill to develop. Not every problem responds well to this pattern of working. The thing to look out for here are problems with clear success criteria where finding a good solution is likely to involve (potentially slightly tedious) trial and error . Any time you find yourself thinking "ugh, I'm going to have to try a lot of variations here" is a strong signal that an agentic loop might be worth trying! A few examples: A common theme in all of these is automated tests . The value you can get from coding agents and other LLM coding tools is massively amplified by a good, cleanly passing test suite. Thankfully LLMs are great for accelerating the process of putting one of those together, if you don't have one yet. Designing agentic loops is a very new skill - Claude Code was first released in just February 2025! I'm hoping that giving it a clear name can help us have productive conversations about it. There's so much more to figure out about how to use these tools as effectively as possible. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The joy of YOLO mode Picking the right tools for the loop Issuing tightly scoped credentials When to design an agentic loop This is still a very fresh area Bad shell commands deleting or mangling things you care about. Exfiltration attacks where something steals files or data visible to the agent - source code or secrets held in environment variables are particularly vulnerable here. Attacks that use your machine as a proxy to attack another target - for DDoS or to disguise the source of other hacking attacks. Run your agent in a secure sandbox that restricts the files and secrets it can access and the network connections it can make. Use someone else's computer. That way if your agent goes rogue, there's only so much damage they can do, including wasting someone else's CPU cycles. Take a risk! Try to avoid exposing it to potential sources of malicious instructions and hope you catch any mistakes before they cause any damage. Try to provide credentials to test or staging environments where any damage can be well contained. If a credential can spend money, set a tight budget limit. Debugging : a test is failing and you need to investigate the root cause. Coding agents that can already run your tests can likely do this without any extra setup. Performance optimization : this SQL query is too slow, would adding an index help? Have your agent benchmark the query and then add and drop indexes (in an isolated development environment!) to measure their impact. Upgrading dependencies : you've fallen behind on a bunch of dependency upgrades? If your test suite is solid an agentic loop can upgrade them all for you and make any minor updates needed to reflect breaking changes. Make sure a copy of the relevant release notes is available, or that the agent knows where to find them itself. Optimizing container sizes : Docker container feeling uncomfortably large? Have your agent try different base images and iterate on the Dockerfile to try to shrink it, while keeping the tests passing.

3 views
Simon Willison 2 weeks ago

Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)

Anthropic released Claude Sonnet 4.5 today , with a very bold set of claims: Claude Sonnet 4.5 is the best coding model in the world. It's the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math. Anthropic gave me access to a preview version of a "new model" over the weekend which turned out to be Sonnet 4.5. My initial impressions were that it felt like a better model for code than GPT-5-Codex, which has been my preferred coding model since it launched a few weeks ago . This space moves so fast - Gemini 3 is rumored to land soon so who knows how long Sonnet 4.5 will continue to hold the "best coding model" crown. The pricing is the same as the previous Sonnet: $3/million input tokens and $15/million output tokens. This remains significantly cheaper than Claude Opus - $15/$75 - but still quite a bit more than GPT-5 and GPT-5-Codex, both at $1.25/$10. The claude.ai web interface (not yet the Claude iPhone native app) recently added the ability for Claude to write and then directly execute code in a sandboxed server environment, using Python and Node.js. I wrote about that in detail three weeks ago. Anthropic's implementation of this code interpreter pattern is more powerful than ChatGPT's equivalent because it can directly clone code from GitHub and install software packages from NPM and PyPI. Sonnet 4.5 absolutely shines at using this tool! Here's my most ambitious experiment with it. I started like this: It did exactly as I asked, checking out my LLM repo , installing it along with all of its dependencies, and running the tests. ✅ All 466 tests passed in 167.69 seconds (about 2 minutes 47 seconds) Then I posed it a wildly ambitious challenge: As you can see in the resulting conversation transcript , it did a really impressive job of this! It churned away across dozens of different tool calls, eventually outputting the following summary of its activity: I've successfully experimented with adding tree-structured conversations to the LLM CLI tool. Here's what was accomplished: 1. Database Schema Enhancement 2. Comprehensive Utility Module ( ) 3. Complete Test Suite ( ) 4. Integration Verification The implementation is production-ready with comprehensive testing. Next steps would be: All code is tested, documented, and ready to use! Did I mention I kicked off this experiment by entering prompts on my phone? I told it to - you can explore the contents of the file it made me in this Gist . Update : I shared a longer version of this experiment where I had the model experiment with a complex message hashing mechanism as well. I've pushed the code to a branch , which is best explained by the message_matching_notes.md document written by Claude. I released llm-anthropic 0.19 adding support for the new model. Then I ran my classic pelican benchmark , first with thinking turned on : And then without : These are pretty good - they are recognizably pelicans! - though not quite as good as GPT-5-Codex which is better at drawing bicycles. I also tried having it describe this image: This image shows a large gathering of pelicans crowded together along what appears to be a waterfront or shoreline. The birds are densely packed, creating a chaotic scene with dozens of brown pelicans visible. Their distinctive long bills and pouches are clearly visible among the mass of feathered bodies. In the background, there's calm water and a dark tree line or forested area. The lighting suggests this was taken during early morning or late afternoon, giving the scene a golden, warm tone. The pelicans appear to be resting or congregating together, which is typical behavior for these colonial seabirds. The release of this model has been very well coordinated. My embargo on talking about it was due to lift at 10am Pacific today, and I got an email from them linking to their blog post at 10am on the dot. It's also already live on OpenRouter and in Cursor and GitHub Copilot and no doubt a whole bunch of other places as well. Anthropic also shipped a new Claude Code VS Code extension today, plus a big upgrade to the Claude Code terminal app. Plus they rebranded their confusingly named Claude Code SDK to the Claude Agent SDK instead, emphasizing that it's a tool for building agents beyond just customizing the existing Claude Code product. That's available for both TypeScript and Python . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Added column to the table via migration Enables tree-structured (branching) conversations instead of just linear sequences Fully backward compatible - existing data unaffected 12 helper functions for tree operations: Navigation (children, parent, siblings, paths) Analysis (depth, descendants, size) Queries (roots, leaves, statistics) Visualization (ASCII tree display) 16 tests covering all tree operations Tests linear chains, branching, multiple roots, forests Includes realistic debugging scenario All tests pass ✅ Updated existing migration tests to expect new column All 6 migration tests pass Total: 22/22 tests passing ✅ README.md - Quick start guide IMPLEMENTATION_SUMMARY.md - Complete technical overview MIGRATION_CODE.md - Database migration details TEST_RESULTS.md - Full test coverage report tree_notes.md - Design decisions and development notes tree_utils.py - Utility functions module test_tree_conversations.py - Test suite Multiple branches from any conversation point Multiple roots per conversation (forest structure) Rich analytics (depth, branching factor, tree size) ASCII tree visualization Cycle detection for safety Integrate into LLM package Add CLI commands ( , ) Update to accept

0 views
Armin Ronacher 2 weeks ago

90%

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code” — Dario Amodei Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close. For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding. The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue. I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted. Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that. There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%. As contrast: another quick prototype we built is a mess of unclear database tgables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end. I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers. I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up. For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand: Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better. I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem. You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along. I generally create PR-sized chunks that I can review. There are two paths to this: Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles. The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database. It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later. Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t. Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems. For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer. Here are some of the things I enjoyed a lot on this project: Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it. Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way. At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago. That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow. That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility. Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM. The fact that I no longer have to write SQL because the AI does it for me is a game changer. I also use raw SQL for migrations now. OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on. Agent loop with finishing touches: Prompt until the result is close, then clean up. Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge. Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes. It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation. Trying out things: I tried three different OpenAPI implementations and approaches in a day. Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy. Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs. Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests. SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering and when writing it.

0 views
Sean Goedecke 2 weeks ago

What is "good taste" in software engineering?

Technical taste is different from technical skill. You can be technically strong but have bad taste, or technically weak with good taste. Like taste in general, technical taste sometimes runs ahead of your ability: just like you can tell good food from bad without being able to cook, you can know what kind of software you like before you’ve got the ability to build it. You can develop technical ability by study and repetition, but good taste is developed in a more mysterious way. Here are some indicators of software taste: I think taste is the ability to adopt the set of engineering values that fit your current project . Aren’t the indicators above just a part of skill? For instance, doesn’t code look good if it’s good code ? I don’t think so. Let’s take an example. Personally, I feel like code that uses map and filter looks nicer than using a for loop. It’s tempting to think that this is a case of me being straightforwardly correct about a point of engineering. For instance, map and filter typically involve pure functions, which are easier to reason about, and they avoid an entire class of off-by-one iterator bugs. It feels to me like this isn’t a matter of taste, but a case where I’m right and other engineers are wrong. But of course it’s more complicated than that. Languages like Golang don’t contain map and filter at all, for principled reasons. Iterating with a for loop is easier to reason about from a performance perspective, and is more straightforward to extend to other iteration strategies (like taking two items at a time). I don’t care about these reasons as much as I care about the reasons in favour of map and filter - that’s why I don’t write a lot of for loops - but it would be far too arrogant for me to say that engineers who prefer for loops are simply less skilled. In many cases, they have technical capabilites that I don’t have. They just care about different things. In other words, our disagreement comes down to a difference in values . I wrote about this point in I don’t know how to build software and you don’t either . Even if the big technical debates do have definite answers, no working software engineer is ever in a position to know what those answers are, because you can only fit so much experience into one career. We are all at least partly relying on our own personal experience: on our particular set of engineering values. Almost every decision in software engineering is a tradeoff. You’re rarely picking between two options where one is strictly better. Instead, each option has its own benefits and downsides. Often you have to make hard tradeoffs between engineering values : past a certain point, you cannot easily increase performance without harming readability, for instance 1 . Really understanding this point is (in my view) the biggest indicator of maturity in software engineering. Immature engineers are rigid about their decisions. They think it’s always better to do X or Y. Mature engineers are usually willing to consider both sides of a decision, because they know that both sides come with different benefits. The trick is not deciding if technology X is better than Y, but whether the benefits of X outweigh Y in this particular case . In other words, immature engineers are too inflexible about their taste . They know what they like, but they mistake that liking for a principled engineering position. What defines a particular engineer’s taste? In my view, your engineering taste is composed of the set of engineering values you find most important . For instance: Resiliency . If an infrastructure component fails (a service dies, a network connection becomes unavailable), does the system remain functional? Can it recover without human intervention? Speed . How fast is the software, compared to the theoretical limit? Is work being done in the hot path that isn’t strictly necessary? Readability . Is the software easy to take in at a glance and to onboard new engineers to? Are functions relatively short and named well? Is the system well-documented? Correctness . Is it possible to represent an invalid state in the system? How locked-down is the system with tests, types, and asserts? Do the tests use techniques like fuzzing? In the extreme case, has the program been proven correct by formal methods like Alloy ? Flexibility . Can the system be trivially extended? How easy is it to make a change? If I need to change something, how many different parts of the program do I need to touch in order to do so? Portability . Is the system tied down to a particular operational environment (say, Microsoft Windows, or AWS)? If the system needs to be redeployed elsewhere, can that happen without a lot of engineering work? Scalability . If traffic goes up 10x, will the system fall over? What about 100x? Does the system have to be over-provisioned or can it scale automatically? What bottlenecks will require engineering intervention? Development speed . If I need to extend the system, how fast can it be done? Can most engineers work on it, or does it require a domain expert? There are many other engineering values: elegance, modern-ness, use of open source, monetary cost of keeping the system running, and so on. All of these are important, but no engineer cares equally about all of these things. Your taste is determined by which of these values you rank highest. For instance, if you value speed and correctness more than development speed, you are likely to prefer Rust over Python. If you value scalability over portability, you are likely to argue for a heavy investment in your host’s (e.g. AWS) particular quirks and tooling. If you value resiliency over speed, you are likely to want to split your traffic between different regions. And so on 2 . It’s possible to break these values down in a more fine-grained way. Two engineers who both deeply care about readability could disagree because one values short functions and the other values short call-stacks. Two engineers who both care about correctness could disagree because one values exhaustive test suites and the other values formal methods. But the principle is the same - there are lots of possible engineering values to care about, and because they are often in tension, each engineer is forced to take some more seriously than others. I’ve said that all of these values are important. Despite that, it’s possible to have bad taste. In the context of software engineering, bad taste means that your preferred values are not a good fit for the project you’re working on . Most of us have worked with engineers like this. They come onto your project evangelizing about something - formal methods, rewriting in Golang, Ruby meta-programming, cross-region deployment, or whatever - because it’s worked well for them in the past. Whether it’s a good fit for your project or not, they’re going to argue for it, because it’s what they like. Before you know it, you’re making sure your internal metrics dashboard has five nines of reliability, at the cost of making it impossible for any junior engineer to understand. In other words, most bad taste comes from inflexibility . I will always distrust engineers who justify decisions by saying “it’s best practice”. No engineering decision is “best practice” in all contexts! You have to make the right decision for the specific problem you’re facing. One interesting consequence of this is that engineers with bad taste are like broken compasses. If you’re in the right spot, a broken compass will still point north. It’s only when you start moving around that the broken compass will steer you wrong. Likewise, many engineers with bad taste can be quite effective in the particular niche where their preferences line up with what the project needs. But when they’re moved between projects or jobs, or when the nature of the project changes, the wheels immediately come off. No job stays the same for long, particularly in these troubled post-2021 times . Good taste is a lot more elusive than technical ability. That’s because, unlike technical ability, good taste is the ability to select the right set of engineering values for the particular technical problem you’re facing . It’s thus much harder to identify if someone has good taste: you can’t test it with toy problems, or by asking about technical facts. You need there to be a real problem, with all of its messy real-world context. You can tell you have good taste if the projects you’re working on succeed. If you’re not meaningfully contributing to the design of a project (maybe you’re just doing ticket-work), you can tell you have good taste if the projects where you agree with the design decisions succeed, and the projects where you disagree are rocky. Importantly, you need a set of different kinds of projects. If it’s just the one project, or the same kind of project over again, you might just be a good fit for that. Even if you go through many different kinds of projects, that’s no guarantee that you have good taste in domains you’re less familiar with 3 . How do you develop good taste? It’s hard to say, but I’d recommend working on a variety of things, paying close attention to which projects (or which parts of the project) are easy and which parts are hard. You should focus on flexibility : try not to acquire strong universal opinions about the right way to write software. What good taste I have I acquired pretty slowly. Still, I don’t see why you couldn’t acquire it fast. I’m sure there are prodigies with taste beyond their experience in programming, just as there are prodigies in other domains. Of course this isn’t always true. There are win-win changes where you can improve several usually-opposing values at the same time. But mostly we’re not in that position. Like I said above, different projects will obviously demand a different set of values. But the engineers working on those projects will still have to draw the line somewhere, and they’ll rely on their own taste to do that. That said, I do think good taste is somewhat transferable. I don’t have much personal experience with this so I’m leaving it in a footnote, but if you’re flexible and attentive to the details in domain A, you’ll probably be flexible and attentive to the details in domain B.

1 views
Michael Lynch 2 weeks ago

Get xkcd Cartoons at 2x Resolution

I recently learned a neat trick from Marcus Buffett : xkcd has an undocumented way to get images of the cartoons at double their normal resolution. For example, xkcd #2582 “Data Trap” lists this URL as the direct link: That leads to this 275x275px image, quite small by today’s standards: But there’s a trick! Add a to the filename in the URL: That gives you a version that’s 2x the original resolution, at 550x550px: I couldn’t find the 2x versions documented or linked anywhere except for some old , obscure reddit posts. I hope this knowledge saves you from stretching out poor, low-resolution xkcd comics. Update (2025-10-02) : My friend Matthew Riley let me know that the URL actually appears in xkcd’s HTML in the image’s attribute : Matthew also did some sleuthing at the Internet Archive and noticed that Randall Munroe added the version for older cartoons recently, as #124 “Blogofractal” had no attribute in this 2023 snapshot , but this 2024 snapshot includes . Not every xkcd is availble in 2x resolution. I vibe-coded a Python script to check which cartoons have a 2x-resolution version and published the current list: The earliest version with 2x available is #124, “Blogofractal” ( 2x version ). Starting with #1084, “Server Problem” ( 2x version ), almost every xkcd is available at 2x-resolution. https://imgs.xkcd.com/comics/data_trap.png https://imgs.xkcd.com/comics/data_trap_2x.png List of 2x-resolution xkcd Cartoons #1683 “Digital Data” : The 2x version is lower quality than the original .

0 views
Ash's Blog 3 weeks ago

How a String Library Beat OpenCV at Image Processing by 4x

To my great surprise, one of the biggest current users of my StringZilla library in the Python ecosystem is one of the world’s most widely used Image Augmentation libraries - Albumentations with over 100 million downloads on PyPI, growing by 5 million every month. Last year, Albumentations swapped parts of OpenCV - the world’s most widely used image processing library with 32 million monthly downloads in Python - for my strings library 🤯

0 views
Marc Brooker 3 weeks ago

Seven Years of Firecracker

Time flies like an arrow. Fruit flies like a banana. Back at re:Invent 2018, we shared Firecracker with the world. Firecracker is open source software that makes it easy to create and manage small virtual machines. At the time, we talked about Firecracker as one of the key technologies behind AWS Lambda, including how it’d allowed us to make Lambda faster, more efficient, and more secure. A couple years later, we published Firecracker: Lightweight Virtualization for Serverless Applications (at NSDI’20). Here’s me talking through the paper back then: The paper went into more detail into how we’re using Firecracker in Lambda, how we think about the economics of multitenancy ( more about that here ), and how we chose virtualization over kernel-level isolation (containers) or language-level isolation for Lambda. Despite these challenges, virtualization provides many compelling benefits. From an isolation perspective, the most compelling benefit is that it moves the security-critical interface from the OS boundary to a boundary supported in hardware and comparatively simpler software. It removes the need to trade off between kernel features and security: the guest kernel can supply its full feature set with no change to the threat model. VMMs are much smaller than general-purpose OS kernels, exposing a small number of well-understood abstractions without compromising on software compatibility or requiring software to be modified. Firecracker has really taken off, in all three ways we hoped it would. First, we use it in many more places inside AWS, backing the infrastructure we offer to customers across multiple services. Second, folks use the open source version directly, building their own cool products and businesses on it. Third, it was the motivation for a wave of innovation in the VM space. In this post, I wanted to write a bit about two of the ways we’re using Firecracker at AWS that weren’t covered in the paper. Bedrock AgentCore Back in July, we announced the preview of Amazon Bedrock AgentCore . AgentCore is built to run AI agents. If you’re not steeped in the world of AI right now, you might be confused by the many definitions of the word agent . I like Simon Willison’s take : An LLM agent runs tools in a loop to achieve a goal. 1 Most production agents today are programs, mostly Python, which use a framework that makes it easy to interact with tools and the underlying AI model. My favorite one of those frameworks is Strands , which does a great job of combining traditional imperative code with prompt-driven model-based interactions. I build a lot of little agents with Strands, most being less than 30 lines of Python (check out the strands samples for some ideas ). So where does Firecracker come in? AgentCore Runtime is the compute component of AgentCore. It’s the place in the cloud that the agent code you’ve written runs. When we looked at the agent isolation problem, we realized that Lambda’s per-function model isn’t rich enough for agents. Specifically, because agents do lots of different kinds of work on behalf of different customers. So we built AgentCore runtime with session isolation . Each session with the agent is given its own MicroVM, and that MicroVM is terminated when the session is over. Over the course of a session (up to 8 hours), there can be multiple interactions with the user, and many tool and LLM calls. But, when it’s over the MicroVM is destroyed and all the session context is securely forgotten. This makes interactions between agent sessions explicit (e.g. via AgentCore Memory or stateful tools), with no interactions at the code level, making it easier to reason about security. Firecracker is great here, because agent sessions vary from milliseconds (single-turn, single-shot, agent interactions with small models), to hours (multi-turn interactions, with thousands of tool calls and LLM interactions). Context varies from zero to gigabytes. The flexibility of Firecracker, including the ability to grow and shrink the CPU and memory use of VMs in place, was a key part of being able to build this economically. Aurora DSQL We announced Aurora DSQL, our serverless relational database with PostgreSQL compatibility, in December 2014. I’ve written about DSQL’s architecture before , but here wanted to highlight the role of Firecracker. Each active SQL transaction in DSQL runs inside its own Query Processor (QPs), including its own copy of PostgreSQL. These QPs are used multiple times (for the same DSQL database), but only handle one transaction at a time. I’ve written before about why this is interesting from a database perspective. Instead of repeating that, lets dive down to the page level and take a look from the virtualization level. Let’s say I’m creating a new DSQL QP in a new Firecracker for a new connection in an incoming database. One way I could do that is to start Firecracker, boot Linux, start PostgreSQL, start the management and observability agents, load all the metadata, and get going. That’s not going to take too long. A couple hundred milliseconds, probably. But we can do much better. With clones . Firecracker supports snapshot and restore , where it writes down all the VM memory, registers, and device state into a file, and then can create a new VM from that file. Cloning is the simple idea that once you have a snapshot you can restore it as many time as you like. So we boot up, start the database, do some customization, and then take a snapshot. When we need a new QP for a given database, we restore the snapshot. That’s orders of magnitude faster. This significantly reduces creation time, saving the CPU used for all that booting and starting. Awesome. But it does something else too: it allows the cloned microVMs to share unchanged ( clean ) memory pages with each other, significantly reducing memory demand (with fine-grained control over what is shared). This is a big saving, because a lot of the memory used by Linux, PostgreSQL, and the other processes on the box aren’t modified again after start-up. VMs get their own copies of pages they write to (we’re not talking about sharing writable memory here), ensuring that memory is still strongly isolated between each MicroVM. Another knock-on effect is the shared pages can also appear only once in some levels of the CPU cache hierarchy, further improving performance. There’s a bit more plumbing that’s needed to make some things like random numbers work correctly in the cloned VMs 2 . Last year, I wrote about our paper Resource management in Aurora Serverless . To understand these systems more deeply, let’s compare their approaches to one common challenge: Linux’s approach to memory management. At a high level, in stock Linux’s mind, an empty memory page is a wasted memory page. So it takes basically every opportunity it can to fill all the available physical memory up with caches, buffers, page caches, and whatever else it may think it’ll want later. This is a great general idea. But in DSQL and Aurora Serverless, where the marginal cost of a guest VM holding onto a page is non-zero, it’s the wrong one for the overall system. As we say in the Aurora serverless paper , Aurora Serverless fixes this with careful tracking of page access frequency: − Cold page identification: A kernel process called DARC [8] continuously monitors pages and identifies cold pages. It marks cold file-based pages as free and swaps out cold anonymous pages. This works well, but is heavier than what we needed for DSQL. In DSQL, we take a much simpler approach: we terminate VMs after a fixed period of time. This naturally cleans up all that built-up cruft without the need for extra accounting. DSQL can do this because connection handling, caching, and concurrency control are handled outside the QP VM. In a lot of ways this is similar to the approach we took with MVCC garbage collection in DSQL. Instead of PostgreSQL’s , which needs to carefully keep track of references to old versions from the set of running transactions, we instead bound the set of running transactions with a simple rule (no transaction can run longer than 5 minutes). This allows DSQL to simply discard versions older than that deadline, safe in the knowledge that they are no longer referenced. Simplicity, as always, is a system property.

0 views
Justin Duke 4 weeks ago

Another reason our pytest suite is slow

I wrote two days ago about how our pytest suite was slow, and how we could speed it up by blessing a suite-wide fixture that was scoped to . This was true. But, like a one-year-old with a hammer, I found myself so gratified by the act of swinging that I found myself also trying to pinpoint another performance issue: why does it take so long to run a single smoke test

0 views
Justin Duke 1 months ago

Why our pytest suite is slow

The speed of Buttondown's pytest suite (which I've written about here , here , and here ) is a bit of a scissor for my friends and colleagues: depending on who you ask, it is (at around three minutes when parallelized on Blacksmith) either quite fast given its robustness or unfathomably slow

0 views
Brain Baking 1 months ago

On Having the Patience To Solitaire

Traditionally, the magical realm of trick taking and card gaming in general could only be experienced after gathering three other contestants willing to sit together. In part one of this series on card games, we explored some traditional Flemish variants always played with four—although Jokeren is an exception. Games that require a team such as Kwajongen or Spades simply cannot be played without inviting others to the table. And yet, that doesn’t mean that that deck of playing cards has to stay inside the cupboard if you happen to be alone. There’s only one requirement besides the deck itself: Patience . It only recently dawned to me that Patience and Solitaire mean the same thing: the former is the common name for solo tile-laying card games in Europe while the latter seems to be used the most in the USA and Canada. For me, Patience will always equal Klondike , as in 1990 a digital variant of that particular card game shipped with Windows 3.0: Playing Solitaire on Windows 3.1 with the spooky castle card backs. You can relive those nostalgic moments right here in your browser using PCjs . As visible in the Windows title bar in the above screenshot, PCjs boots an American English version of Windows 3(.1), so if you’re in Europe, you’ll have to imagine it reads Patience instead. Susan Kare designed the card faces. Her webshop at kareprints.com even has some prints still available. I urgently need to convince my wife that we have to hang a Kare Queen of Hearts prominently in the hallway, because reasons. While exploring the patience realm, I discovered that there are endless variations of the individual systematic card arranging exercise, of which undoubtedly Klondike is the most popular. The 42nd edition of Hoyle’s Offical Rules published in 1943 mentions an impressive 19 and includes classics ( Cribbage , Klondike , Forty Thieves ) as well as to me obscure ones ( Idiot’s Delight , Streets and Alleys , Rainbow ). When I was little, the neighbour of my grandparents—the same ones introducing me to card gaming—learned me how to lay down a “clock”. In The Clock , lay down twelve stacks of four cards each, and a thirteenth start pile in the centre. After flipping and removing a card from the start, you can flip the pile laying at one of the twelve hours of the same value. The objective is to simply get rid of all cards without encountering all four kings. I forgot about this game for nearly two decades. Back then, preparing The Clock tableau felt special: would I be able to make it this time? I think it felt special because I was enamoured with card games thanks to my grandparents. Thinking about The Clock now leaves me very much unimpressed: it’s not even a game. You have zero input on it, it’s a pure game of chance. Flip a card, move to the next value. Oh, tough luck, a king. Even implementing a condensed The Clock solver in Python takes less than twenty lines of code: Where are the Dutch abbreviations for the suits ( Schuppen , Klaveren , Harten , Ruiten ). Running this simply results in or . Yay, I won. Oh no, I lost. Luckily, not all Patience games are as depressingly deterministic as The Clock . While Klondike is all right, alternatives like Spider Solitaire and FreeCell have a much higher skill level meaning your decisions in placing the cards do matter. The Nintendo DS card game collection Ultimate Card Games —a quite decent one worth your time—even contains some stats for each Patience type, informing the player if it’s a more skill or more luck based game they’re engaging in. A session of Spider Solitaire in Cosmigo's Ultimate Card Games. Note the 3D-rendered table space on the top screen where the card tableau is faithfully mimicked. If you’re the kind of person that exclusively rolls on Patience, you can alternatively check out Cosmigo’s Solitaire Overload DS game boasting more than thirty tableau laying challenges. I much prefer Ultimate Card Games ’s 3D effort and trick taking inclusions but Overload is there in case you just can’t get enough. After all, who loves playing these Patience games except for bored teenagers attending computer classes that stumbled upon after aimlessly wandering through the Windows start menu 1 ? Even that is a thing of the past as Microsoft, in their infinite wisdom, decided that the game should no longer be shipped with the OS. Patience card games are making a comeback thanks to the rising interest in board gaming in general. For instance, Isaludo is a free document describing ten engaging modern Patience alternatives to the worn-out Hoyle era variants. One of those, The Emissary , is the precursor to For Northwood! , a solo trick taking card game independently released and available at your local game store—once they finally manage to reprint it, that is. Yes, you’ve read that right: it’s not a tableau laying game but a trick taking one for just one person! That alone should be enough to check it out. For Northwood! is not the only “modern” solo card game originating in the standard 52-deck space. Another big hitter is the recent Regicide . You don’t need to buy anything to play the game: just parse the official rules and get going. The remarkable thing here is, again, that it is not a dry tableau lying game. In Regicide , you’re fighting back against the ancient regime, trying to overthrow the jacks, queens, and kings, by recruiting other cards acting as your army in the “tavern” (the drawing deck). Regicide ’s mechanics are as simple as they are clever, but don’t expect a walk in the park: by the time you face the first few kings, you’ll be scrambling to do enough damage and defend from incoming attacks. It’s amazing to see how much creativity sparks from a simple deck of classic playing cards. Do yourself a favour and dust off that old pack you have laying around. Skip that stupid Clock and pick up a(n imaginative) sword instead. If you find yourself wanting more, the upcoming Regicide Legacy variant will be sure to keep you busy. And then there’s the print-and-play realm where all you need is a printer, a standard deck of cards, and a pen to either make your way through a dungeon ( 52 Realms: Adventures ) or shoot your way out of a saloon ( Fliptown ). The latter, also sold as a boxed game, recently went through yet another successful Kickstarter announcing a standalone expansion called Fliptown: New Frontier . Of course I took out my wallet and pressed that pledge button. I’ll let you know in a few months if it was worth it. This article is part three in a series on trick taking and card games . Stay tuned for the next part! Smarter kids will secretly install or find and start a local multiplayer session using the Microsoft Hearts Network . Windows for Workgroups 3.11 was the first “network-ready” Windows where Hearts was used to showcase simultaneous network play.  ↩︎ Related topics: / card games / trick taking / screenshots / By Wouter Groeneveld on 14 September 2025.  Reply via email .

0 views