Posts in Python (20 found)
iDiallo 2 days ago

Boredom is the Gatekeeper

That first Monday of my holiday break, I made a promise to myself. No work emails, no side projects, not even glancing at my blog. This time was for family, for Netflix queues, for rereading dog-eared novels. One thing I was really looking forward to was learning something new, a new skill. Not for utility, but purely for curiosity. I wanted to learn about batteries. They power our world, yet they're a complete mystery to me. I only vaguely remember what I learned in high school decades ago. This would be the perfect subject for me. I went straight to a website I had bookmarked years ago in a fit of intellectual ambition: BatteryUniversity.com. I started with the chemistry of lead acid batteries. I was ready to be enlightened. Twenty minutes later, I was three paragraphs in, my mind adrift. The text was dense, packed with terms like "lead-antimony" and "acid-starved." My finger twitched. Then I read this: the sealed lead acid battery is designed with a low over-voltage potential to prohibit the battery from reaching its gas-generating potential during charge. I thought, wouldn't this be easier to understand as a YouTube video? A nice animation? I clicked away. It seemed like I had just met the gatekeeper, and it had turned me away. I was bored. We talk about boredom as if it's the absence of stimulation. Having nothing to do. But in our hyperconnected world, where information is constantly flowing and distractions are a finger tap away, true emptiness is rare. Modern boredom isn't having nothing to do. I had plenty of material to go over. Instead, it's the friction of deep focus. It's the resistance you feel when you move from consuming information to building those neural connections in your brain. Learning feels slow and hard, and it is ungratifying compared to dopamine-induced YouTube videos. Have you ever watched a pretty good video on YouTube and learned nothing from it? This reaction to learning the hard way, masquerading as boredom, is the gatekeeper. And almost every important skill in life lives on the other side of that gate. When I started working for an AI startup, I was fascinated by what we were able to accomplish with a team of just two engineers. It looked like magic to me at first. You feed the AI some customer's message, and it tells you exactly what this person needs. So, to be an effective employee, I decided to learn profoundly about the subject. Moving from just a consumer of an API to a model creator made the process look un-magical. It started with spreadsheets where we cleaned data. There was a loss function that stubbornly refused to budge for hours. There was staring at a single Python error that said the tensor dimensions don't align. The boring part was the meticulous engineering upon which the magic is built. I find it fascinating now, but it was frustrating at the time, and I had to force myself to learn it. Like most developers, video games inspired me to become a programmer. I wanted to code my own game from scratch. I remember playing Devil May Cry and thinking about how I would program those boss battles. But when I sat with a keyboard and the cursor on my terminal flashed before me, I struggled to move a gray box on the screen using SDL. For some reason, when I pressed arrow keys, the box jittered instead of following a straight line. I would spend the whole day reading OpenGL and SDL documentation only to fix a single bug. Boredom was going through all this documentation, painfully, only to make small incremental progress. When you start a business, the gatekeeper shows its face. It stares back at you when you open that blank document and write a single line of text in it: My idea. For indie developers, it's the feeling you get when you build the entire application and feel compelled to start over rather than ship what you've built. This boredom is the feeling of creation from nothing, which is always harder than passive consumption. We've conflated "interesting" with "easy to consume." The most interesting things in the world, like building software, writing a book, mastering a craft, understanding a concept, are never easy to produce. Their initial stages are pure effort. Gamification tries to trick us past the gatekeeper with points and badges, but that's just putting a costume on it. The real work remains. There is no way around it. You can't eliminate that feeling. Instead, you have to recognize it for what it is and push through. When you feel that itchy tug toward a distracting tab, that's the gatekeeper shaking its keys. It's telling you that what you're doing is really hard, and it would be easier to just passively consume it. You might even enjoy the process without ever learning anything. Instead, whenever you feel it, set a timer for 25 minutes. Agree to wrestle with the battery chemistry, the Python error, or the empty page. Just for that short time span. There is no dopamine hit waiting on the other side of boredom like you get from passive consumption. Instead, the focus, the struggle, the sustained attention, that's the process of learning. The gatekeeper ensures only those willing to engage in the hard, quiet work of thinking get to the good stuff. I did not become a battery expert over the holidays. But at least I learned to recognize the gatekeeper's face. Now, when I feel that familiar, restless boredom descend as I'm trying to learn something hard, I smile a little. I know I'm at the threshold. And instead of turning back, I take a deep breath, set my timer to 25 minutes, and I power through the gate.

2 views
Rob Zolkos 2 days ago

So where can we use our Claude subscription then?

There’s been confusion about where we can actually use a Claude subscription. This comes after Anthropic took action to prevent third-party applications from spoofing the Claude Code harness to use Claude subscriptions. The information in this post is based on my understanding from reading various tweets, official GitHub repos and documentation (some of which may or may not be up to date). I will endeavour to keep it up to date as new information becomes available. I would love to see Anthropic themselves maintain an easily parsable page like this that shows what is and is not permitted with a Claude subscription. We've taken action to prevent third-party clients from spoofing the Claude Code agent harness to use consumer subscriptions. Consumer subscriptions and their benefits should only be used in the Anthropic experiences they support (Claude Code CLI, Claude Code web, and via sessionKey in the Agent SDK). Third-party apps can use the API. From what I can gather, consumer subscriptions work with official Anthropic tools, not third-party applications. If you want third-party integrations, you need the API. The consumer applications (desktop and mobile) are the most straightforward way to use your Claude subscription. Available at claude.com/download , these apps give you direct access to Claude for conversation, file uploads, and Projects. The official command-line interface for Claude Code is fully supported with Claude subscriptions. This is the tool Anthropic built and maintains specifically for developers who want to use Claude in their development workflow. You get the full power of Claude integrated into your terminal, with access to your entire codebase, the ability to execute commands, read and write files, and use all the specialized agents that come with Claude Code. The web version of Claude Code (accessible through your browser at claude.ai/code) provides the same capabilities as the CLI but through a browser interface. Upload your project files, or point it at a repository, and you can work with Claude on your codebase directly. Want to experiment with building custom agents? The Claude Agent SDK lets you develop and test specialized agents powered by your Claude subscription for personal development work. The SDK is available in both Python and TypeScript , with documentation here . This is for personal experiments and development. For production deployments of agents, use the API instead of your subscription. You can use your Claude subscription to run automated agents in GitHub Actions. The Claude Code Action lets you set up workflows that leverage Claude for code review, documentation generation, or automated testing analysis. Documentation is here . Any other uses of Claude would require the use of API keys. Your Claude subscription gives you: Let me know if you have any corrections. Claude desktop and mobile apps for general use Claude Code CLI for terminal-based development Claude Code on the web for browser-based work The ability to build custom agents through the official SDK (for personal development) Claude Code GitHub Action for CI/CD integration

1 views

Letting Claude Play Text Adventures

The other day I went to an AI hackathon organized by my friends Lucia and Malin . The theme was mech interp , but I hardly know PyTorch so I planned to do something at the API layer rather than the model layer. Something I think about a lot is cognitive architectures (like Soar and ACT-R ). This is like a continuation of GOFAI research, inspired by cognitive science. And like GOFAI it’s never yielded anything useful. But I often think: can we scaffold LLMs with cog arch-inspired harnesses to overcome their limitations? LLM agents like Claude Code are basically “accidental” cognitive architectures: they are designed and built my practitioners rather than theorists, but they have commonalities, they all need a way to manage memory, tool use, a task agenda etc. Maybe building an agent on a more “principled” foundation, one informed by cognitive science, yields a higher-performing architecture. So I sat around a while thinking how to adapt Soar’s architecture to an LLM agent. And I sketched something out, but then I thought: how can I prove this performs better than baseline? I need an eval, a task. Math problems? Too one-shottable. A chatbot? Too interactive, I want something hands-off and long-horizon. A coding agent? That’s too freeform and requires too much tool use. And then I thought: text adventures ! You have a stylized, hierarchically-structured world accessible entirely tthrough text, long-term goals, puzzles, physical exploration and discovery of the environment. Even the data model of text adventures resembles frame-based knowledge representation systems. And there’s a vast collection of games available online. Anchorhead , which I played years ago, is a Lovecraft-inspired text adventure by Michael S. Gentry. It takes on the order of hundrds of turns to win across multiple in-game days. And the game world is huge and very open. In other words: a perfect long-horizon task. So I started hacking. The frotz interpreter runs on the command line and has a “dumb” interface called , which takes the ncurses fluff out, and gives you a very stripped command-line experience. It looks like this: It is easy to write a little Python wrapper to drive the interpreter through and : Now we can play the game from Python: send commands, get game output. Now we need the dual of this: a player. The trivial harness is basically nothing at all: treat the LLM/game interaction like a chat history. The LLM reads the game output from the interpreter, writes some reasoning tokens, and writes a command that is sent via to the interpreter. And this works well enough. Haiku 4.5 would mostly wander around the game map, but Sonnet 4.5 and Opus 4.5 manage to solve the game’s first puzzle—breaking into the real estate office, and finding the keys to the mansion—readily enough. It takes about ~200 turns for Claude to get to the second in-game day. The way I thought this would fail is: attention gets smeared across the long context, the model gets confused about the geometry of the world, its goal and task state, and starts confabulating, going in circles, etc. As usual, I was outsmarting myself. The reason this fails is you run out of credits. By the time you get to day two, each turn costs tens of thousands of input tokens. No good! We need a way to save money. Ok, let’s try something that’s easier on my Claude credits. We’ll show Claude the most recent five turns (this is the perceptual working memory), and give it a simple semantic memory: a list of strings that it can append entries to, and remove entries from using tool use. This keeps the token usage down: The problem is the narrow time horizon. With the trivial harness, Claude can break into the real estate office in ~10 turns, and does so right at the start of the game. With this new harness, Claude wanders about the town, taking copious notes, before returning to the real estate office, and it spends ~40 turns fumbling around with the garbage cans before managing to break into the real estate office. The next step, after getting the keys to the house, is to meet your husband Michael at the University and head home. Claude with the trivial harness takes about ~100 turns to find the house, with some tangential wandering about the town, and reaches day two around turn 150. Claude, with the memory harness, took ~250 turns just to get the keys to the house. And then it spends hundreds of turns just wandering in circles around the town, accumulating redundant memories, and hits the turn limit before even finding the house. Anchorhead is a long, broad game, and from the very beginning you can forget the plot and wander about most of the town. It takes a long time to see if a run with an agent goes anywhere. So I thought: I need something smaller. Unsurprisingly, Claude can make its own games. The Inform 7 package for NixOS was broken (though Mikael has fixed this recently) so I had to use Inform 6 . I started with a trivial escape-the-room type game, which was less than 100 lines of code and any Claude could beat it less than 10 turns. Then I asked for a larger, multi-room heist game. This one was more fun. It’s short enough that Claude can win with just the trivial harness. I tried a different harness, where Claude has access to only the last five turns of the game’s history, and a read-write memory scratchpad. And this one was interesting. First, because Claude only ever adds to its own memory, it never deletes memories. I thought it would do more to trim and edit its scratchpad. Second, because Claude become fixated on this red-herring room: a garden with a well. It kept going in circles, trying to tie a rope to the well and climb down. Because of the limited game history, it only realized it was stuck when it saw that the most recent ~20 entries it wrote to its memories related to various attempts to go down the well. Then I watched Claude walk away from the garden and solve the final puzzle, and hit the turn limit just two turns short of winning. Tangent: I wonder if models are better at playing games created by other instances of the same model, by noticing tiny correlations in the text to infer what puzzles and obstacles they would have written. In the end I abandoned the “small worlds” approach because the games are too stylized, linear, and uninteresting. Anchorhead is more unwieldy, but more natural. I have a bunch of ideas I want to test, to better learn how harness implementations affect performance. But I’m short on time, so I’m cutting it here and listing these as todos: The repository is here . Domain-Specific Memories: Claude’s notes are all jumbled with information on tasks, locations, etc. It might be better to have separate memories: a todo list, a memory of locations and their connections, etc. This is close to the Soar approach. Automatic Geography: related to the above, the harness can inspect the game output and build up a graph of rooms and their connections, and format it in the context. This saves Claude having to note those things manually using a tool. Manual Geography: the automatic geography approach has a few drawbacks. Without integration into the Z-machine interpreter, it requires some work to implement (parsing the currente location from the output, keeping track of the command history to find standard travel commands e.g. ) but isn’t 100% deterministic, so that mazes and dynamic rooms (e.g. elevators) will confuse the system. So, instead of doing it manually, we could give Claude a tool like . Episodic Memory: this feels like cheating, but, at the end of a run, you can show Claude the session transcript and ask it to summarize: what it accomplished and how, where it failed and why. Including a short walkthrough for how to get to the “last successful state”. This allows future runs to save time in getting up to speed.

0 views
<antirez> 3 days ago

Don't fall into the anti-AI hype

I love writing software, line by line. It could be said that my career was a continuous effort to create software well written, minimal, where the human touch was the fundamental feature. I also hope for a society where the last are not forgotten. Moreover, I don't want AI to economically succeed, I don't care if the current economic system is subverted (I could be very happy, honestly, if it goes in the direction of a massive redistribution of wealth). But, I would not respect myself and my intelligence if my idea of software and society would impair my vision: facts are facts, and AI is going to change programming forever. In 2020 I left my job in order to write a novel about AI, universal basic income, a society that adapted to the automation of work facing many challenges. At the very end of 2024 I opened a YouTube channel focused on AI, its use in coding tasks, its potential social and economical effects. But while I recognized what was going to happen very early, I thought that we had more time before programming would be completely reshaped, at least a few years. I no longer believe this is the case. Recently, state of the art LLMs are able to complete large subtasks or medium size projects alone, almost unassisted, given a good set of hints about what the end result should be. The degree of success you'll get is related to the kind of programming you do (the more isolated, and the more textually representable, the better: system programming is particularly apt), and to your ability to create a mental representation of the problem to communicate to the LLM. But, in general, it is now clear that for most projects, writing the code yourself is no longer sensible, if not to have fun. In the past week, just prompting, and inspecting the code to provide guidance from time to time, in a few hours I did the following four tasks, in hours instead of weeks: 1. I modified my linenoise library to support UTF-8, and created a framework for line editing testing that uses an emulated terminal that is able to report what is getting displayed in each character cell. Something that I always wanted to do, but it was hard to justify the work needed just to test a side project of mine. But if you can just describe your idea, and it materializes in the code, things are very different. 2. I fixed transient failures in the Redis test. This is very annoying work, timing related issues, TCP deadlock conditions, and so forth. Claude Code iterated for all the time needed to reproduce it, inspected the state of the processes to understand what was happening, and fixed the bugs. 3. Yesterday I wanted a pure C library that would be able to do the inference of BERT like embedding models. Claude Code created it in 5 minutes. Same output and same speed (15% slower) than PyTorch. 700 lines of code. A Python tool to convert the GTE-small model. 4. In the past weeks I operated changes to Redis Streams internals. I had a design document for the work I did. I tried to give it to Claude Code and it reproduced my work in, like, 20 minutes or less (mostly because I'm slow at checking and authorizing to run the commands needed). It is simply impossible not to see the reality of what is happening. Writing code is no longer needed for the most part. It is now a lot more interesting to understand what to do, and how to do it (and, about this second part, LLMs are great partners, too). It does not matter if AI companies will not be able to get their money back and the stock market will crash. All that is irrelevant, in the long run. It does not matter if this or the other CEO of some unicorn is telling you something that is off putting, or absurd. Programming changed forever, anyway. How do I feel, about all the code I wrote that was ingested by LLMs? I feel great to be part of that, because I see this as a continuation of what I tried to do all my life: democratizing code, systems, knowledge. LLMs are going to help us to write better software, faster, and will allow small teams to have a chance to compete with bigger companies. The same thing open source software did in the 90s. However, this technology is far too important to be in the hands of a few companies. For now, you can do the pre-training better or not, you can do reinforcement learning in a much more effective way than others, but the open models, especially the ones produced in China, continue to compete (even if they are behind) with frontier models of closed labs. There is a sufficient democratization of AI, so far, even if imperfect. But: it is absolutely not obvious that it will be like that forever. I'm scared about the centralization. At the same time, I believe neural networks, at scale, are simply able to do incredible things, and that there is not enough "magic" inside current frontier AI for the other labs and teams not to catch up (otherwise it would be very hard to explain, for instance, why OpenAI, Anthropic and Google are so near in their results, for years now). As a programmer, I want to write more open source than ever, now. I want to improve certain repositories of mine abandoned for time concerns. I want to apply AI to my Redis workflow. Improve the Vector Sets implementation and then other data structures, like I'm doing with Streams now. But I'm worried for the folks that will get fired. It is not clear what the dynamic at play will be: will companies try to have more people, and to build more? Or will they try to cut salary costs, having fewer programmers that are better at prompting? And, there are other sectors where humans will become completely replaceable, I fear. What is the social solution, then? Innovation can't be taken back after all. I believe we should vote for governments that recognize what is happening, and are willing to support those who will remain jobless. And, the more people get fired, the more political pressure there will be to vote for those who will guarantee a certain degree of protection. But I also look forward to the good AI could bring: new progress in science, that could help lower the suffering of the human condition, which is not always happy. Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months. Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched. Comments

54 views

Most Code is Just Cache

Claude Code has systematically begun to consume many of the SaaS apps I used to (or plan to) pay for. Why pay a subscription when I can "vibe code" a personal MVP in twenty minutes? I don’t worry about maintenance or vendor lock-in because, frankly, the code is disposable. If I need a new feature tomorrow, I don’t refactor—I just rebuild it. 1 Code is becoming just an ephemeral cache of my intent. Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) We are seeing this now with companies adopting the Forward Deployed Engineering (FDE) . In this stage, the SaaS company hires humans to manually use AI to build bespoke solutions for the client. For the consumer, this feels like a concierge service; they don’t get a login to a generic tool, they get a custom-built outcome delivered by a human who used AI to write the glue code. The intelligence is hybrid: the human provides the architecture, the AI writes the implementation code in weeks to days. When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) This is the current “safe space” for most tech companies, where they bolt an LLM onto an existing application to handle unstructured data. Consumers experience this as a “Draft Email” button in their CRM or a “Chat” sidebar in their UI—the platform is still the main product, but AI is a feature that (hopefully) reduces friction and/or provides some extra functionality customization 2 . The intelligence comes from a constrained model of product design and LLM scaffolding, providing content within a structure still strictly dictated by the SaaS platform’s code. When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) This is the tipping point where the software interface starts to disappear because the “interface” was just a way to collect context that the model can now ingest directly. Consumers move to a “Do this for me” interface where intent maps directly to an outcome rather than a button click, often realized as an agent calling a database or MCP servers 3 . The intelligence is the model and it’s engineered input context, relegating the SaaS role to in some sense providing clean proprietary data via an agent friendly interface. Software as a Service for Agents . When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) Critically, this doesn't mean the LLM acts as the CPU for every single user interaction (which would be latency-poor and energy-inefficient). Instead, the model almost acts as a Just-In-Time compiler. It generates the necessary code to execute the user’s intent, runs that code for the session, and then potentially discards it This is the end game in some cases. If code is just a cache for intent, eventually we bypass the cache and bake the intuition directly into the model. To the consumer, the “tool” is invisible; the expert system simply exists and provides answers or actions without a login or workflow. The intelligence is in the model itself; the software platform exists solely as a distillation mechanism—a gym to train the vertical AI—and once the model learns the domain, the software is no longer needed. A company in this stage is not really even SaaS anymore, maybe more so a AI-gyms-aaS company. When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) This might feel unintuitive as a stage — like how could you bake some proprietary data lake into a model? How can our juicy data not be the moat? My conclusion is that most (but not all) data is a transformation of rawer upstream inputs and that these transformation (data pipelines, cross-tenant analysis, human research, etc.) are all “cache” that can be distilled into a more general model that operates on its intuition and upstream platform inputs. “But can agents run a bank?” Reliability and safety comes down to distinguishing between guardrails (deterministic interfaces and scaffolding) and runtime execution (LLM code). For now, you don’t let the LLM invent the concept of a transaction ledger or rewrite the core banking loop on the fly. In XX years, maybe we do trust AI to write core transaction logic after all fail-able humans wrote the code for most mission critical software that exists today. The line between human-defined determinism and agent symbolic interfaces will gradually move of time. “But enterprise SaaS is actually super complex.” Yes, but that complexity is mostly just unresolved ambiguity. Your “deep enterprise understanding” is often a collection of thousands of edge cases—permissions, policy exceptions, region-specific rules—that humans had to manually hard-code into IF/ELSE statements over a decade. Distilled to the core, this complexity collapses. The model doesn’t need 500 hard-coded features; it needs the raw data and the intent. An app built for one can also make a lot of simplifications compared to one that acts as a platform. “Customers don’t want to prompt features.” I agree. I don’t think the future looks like a chatbot. “Chat” is a skeuomorphic bridge we use because we haven’t figured out the consistent native interface yet. It might be a UI that pre-emptively changes based on your role, or it might feel like hiring a really competent employee who just “takes care of it” without you needing to specify the . Or, as we see in Stage 2, the user never prompts at all—an FDE does it for them, and the user just gets a bespoke app that works perfectly. Stage 1, where most companies are stuck today, definitely is. Why? Because the sheer overhead of traditional SaaS—the learning curve, the rigid workflows, the "click tax" to get work done—is becoming unacceptable in a world where intent can be executed directly. It feels increasingly archaic when flexible solutions can be generated on demand. The value is moving away from the workflow logic itself and toward two specific layers that sandwich it: The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. We are going to see companies move through these tiers. The winners IMO will be the ones who realize that the "Service" part of SaaS is being replaced by model intelligence. The SaaS that remains will be the infrastructure of truth and the engine of agency. We are transitioning from a world of static artifacts (code that persists for years) to dynamic generations (code that exists for milliseconds or for a single answer). Of course, I could be wrong. Maybe AI capability plateaus before it can fully integrate into complex verticals. Maybe traditional SaaS holds the line at Stage 2 or 3, protecting its moat through sheer inertia. Maybe the world ends up more decentralized. Some of my open questions: Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Even more so than me you can see Geoffrey Huntley’s ralph-powered rampage of GitHub and many other tools . I liked this tweet by Harj Taggar, “moved away from the FDE playbook that’s become the default for fast growing AI startups. Instead they’ve built AI to covert plain English from the customer into Python code to make the product work for their use cases” . Similar to Karpathy’s “LLMs not as a chatbot, but the kernel process of a new Operating System” (2023) Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Software Evolution The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. Stage 1. Traditional SaaS This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”?

12 views

I Made A Script To Automate Updating My Feeds List

Today I made a Python script to make my life a little easier when it comes to updating my Feeds page. I follow and un-follow blogs occasionally, so exporting that and then writing it up in Markdown is kind of a bummer. Instead, I have this script, which extracts the name of the blog, the URL, and the feed URL, then outputs it in the terminal so you can copy and paste it as markdown. Easy to share! Note that I made this specifically for how my .opml file works (I exported mine from Unread). YMMV. The syntax is: That last argument is because I have a tagging system. All of my personal blogs are under that tag I created in my reader. That way, readers of this blog who are interested in social blogging aren't getting stuff like The Verge or Ars Technica on my Feeds page. Only personal, indie blogs. opml_to_markdown.py Subscribe via email or RSS

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
Anton Zhiyanov 1 weeks ago

Go 1.26 interactive tour

Go 1.26 is coming out in February, so it's a good time to explore what's new. The official release notes are pretty dry, so I prepared an interactive version with lots of examples showing what has changed and what the new behavior is. Read on and see! new(expr)  • Type-safe error checking  • Green Tea GC  • Faster cgo and syscalls  • Faster memory allocation  • Vectorized operations  • Secret mode  • Reader-less cryptography  • Goroutine leak profile  • Goroutine metrics  • Reflective iterators  • Peek into a buffer  • Process handle  • Signal as cause  • Compare IP subnets  • Context-aware dialing  • Fake example.com  • Optimized fmt.Errorf  • Optimized io.ReadAll  • Multiple log handlers  • Test artifacts  • Modernized go fix  • Final thoughts This article is based on the official release notes from The Go Authors and the Go source code, licensed under the BSD-3-Clause license. This is not an exhaustive list; see the official release notes for that. I provide links to the documentation (𝗗), proposals (𝗣), commits (𝗖𝗟), and authors (𝗔) for the features described. Check them out for motivation, usage, and implementation details. I also have dedicated guides (𝗚) for some of the features. Error handling is often skipped to keep things simple. Don't do this in production ツ Previously, you could only use the built-in with types: Now you can also use it with expressions: If the argument is an expression of type T, then allocates a variable of type T, initializes it to the value of , and returns its address, a value of type . This feature is especially helpful if you use pointer fields in a struct to represent optional values that you marshal to JSON or Protobuf: You can use with composite values: And function calls: Passing is still not allowed: 𝗗 spec • 𝗣 45624 • 𝗖𝗟 704935 , 704737 , 704955 , 705157 • 𝗔 Alan Donovan The new function is a generic version of : It's type-safe and easier to use: is especially handy when checking for multiple types of errors. It makes the code shorter and keeps error variables scoped to their blocks: Another issue with is that it uses reflection and can cause runtime panics if used incorrectly (like if you pass a non-pointer or a type that doesn't implement ): doesn't cause a runtime panic; it gives a clear compile-time error instead: doesn't use , executes faster, and allocates less than : Since can handle everything that does, it's a recommended drop-in replacement for new code. 𝗗 errors.AsType • 𝗣 51945 • 𝗖𝗟 707235 • 𝗔 Julien Cretel The new garbage collector (first introduced as experimental in 1.25) is designed to make memory management more efficient on modern computers with many CPU cores. Go's traditional garbage collector algorithm operates on graph, treating objects as nodes and pointers as edges, without considering their physical location in memory. The scanner jumps between distant memory locations, causing frequent cache misses. As a result, the CPU spends too much time waiting for data to arrive from memory. More than 35% of the time spent scanning memory is wasted just stalling while waiting for memory accesses. As computers get more CPU cores, this problem gets even worse. Green Tea shifts the focus from being processor-centered to being memory-aware. Instead of scanning individual objects, it scans memory in contiguous 8 KiB blocks called spans . The algorithm focuses on small objects (up to 512 bytes) because they are the most common and hardest to scan efficiently. Each span is divided into equal slots based on its assigned size class , and it only contains objects of that size class. For example, if a span is assigned to the 32-byte size class, the whole block is split into 32-byte slots, and objects are placed directly into these slots, each starting at the beginning of its slot. Because of this fixed layout, the garbage collector can easily find an object's metadata using simple address arithmetic, without checking the size of each object it finds. When the algorithm finds an object that needs to be scanned, it marks the object's location in its span but doesn't scan it immediately. Instead, it waits until there are several objects in the same span that need scanning. Then, when the garbage collector processes that span, it scans multiple objects at once. This is much faster than going over the same area of memory multiple times. To make better use of CPU cores, GC workers share the workload by stealing tasks from each other. Each worker has its own local queue of spans to scan, and if a worker is idle, it can grab tasks from the queues of other busy workers. This decentralized approach removes the need for a central global list, prevents delays, and reduces contention between CPU cores. Green Tea uses vectorized CPU instructions (only on amd64 architectures) to process memory spans in bulk when there are enough objects. Benchmark results vary, but the Go team expects a 10–40% reduction in garbage collection overhead in real-world programs that rely heavily on the garbage collector. Plus, with vectorized implementation, an extra 10% reduction in GC overhead when running on CPUs like Intel Ice Lake or AMD Zen 4 and newer. Unfortunately, I couldn't find any public benchmark results from the Go team for the latest version of Green Tea, and I wasn't able to create a good synthetic benchmark myself. So, no details this time :( The new garbage collector is enabled by default. To use the old garbage collector, set at build time (this option is expected to be removed in Go 1.27). 𝗣 73581 • 𝗔 Michael Knyszek In the Go runtime, a processor (often referred to as a P) is a resource required to run the code. For a thread (a machine or M) to execute a goroutine (G), it must first acquire a processor. Processors move through different states. They can be (executing code), (waiting for work), or (paused because of the garbage collection). Previously, processors had a state called used when a goroutine is making a system or cgo call. Now, this state has been removed. Instead of using a separate processor state, the system now checks the status of the goroutine assigned to the processor to see if it's involved in a system call. This reduces internal runtime overhead and simplifies code paths for cgo and syscalls. The Go release notes say -30% in cgo runtime overhead, and the commit mentions an 18% sec/op improvement: I decided to run the CgoCall benchmarks locally as well: Either way, both a 20% and a 30% improvement are pretty impressive. And here are the results from a local syscall benchmark: That's pretty good too. 𝗖𝗟 646198 • 𝗔 Michael Knyszek The Go runtime now has specialized versions of its memory allocation function for small objects (from 1 to 512 bytes). It uses jump tables to quickly choose the right function for each size, instead of relying on a single general-purpose implementation. The Go release notes say "the compiler will now generate calls to size-specialized memory allocation routines". But based on the code, that's not completely accurate: the compiler still emits calls to the general-purpose function. Then, at runtime, dispatches those calls to the new specialized allocation functions. This change reduces the cost of small object memory allocations by up to 30%. The Go team expects the overall improvement to be ~1% in real allocation-heavy programs. I couldn't find any existing benchmarks, so I came up with my own. And indeed, running it on Go 1.25 compared to 1.26 shows a significant improvement: The new implementation is enabled by default. You can disable it by setting at build time (this option is expected to be removed in Go 1.27). 𝗖𝗟 665835 • 𝗔 Michael Matloob The new package provides access to architecture-specific vectorized operations (SIMD — single instruction, multiple data). This is a low-level package that exposes hardware-specific functionality. It currently only supports amd64 platforms. Because different CPU architectures have very different SIMD operations, it's hard to create a single portable API that works for all of them. So the Go team decided to start with a low-level, architecture-specific API first, giving "power users" immediate access to SIMD features on the most common server platform — amd64. The package defines vector types as structs, like (a 128-bit SIMD vector with sixteen 8-bit integers) and (a 512-bit SIMD vector with eight 64-bit floats). These match the hardware's vector registers. The package supports vectors that are 128, 256, or 512 bits wide. Most operations are defined as methods on vector types. They usually map directly to hardware instructions with zero overhead. To give you a taste, here's a custom function that uses SIMD instructions to add 32-bit float vectors: Let's try it on two vectors: Common operations in the package include: The package uses only AVX instructions, not SSE. Here's a simple benchmark for adding two vectors (both the "plain" and SIMD versions use pre-allocated slices): The package is experimental and can be enabled by setting at build time. 𝗗 simd/archsimd • 𝗣 73787 • 𝗖𝗟 701915 , 712880 , 729900 , 732020 • 𝗔 Junyang Shao , Sean Liao , Tom Thorogood Cryptographic protocols like WireGuard or TLS have a property called "forward secrecy". This means that even if an attacker gains access to long-term secrets (like a private key in TLS), they shouldn't be able to decrypt past communication sessions. To make this work, ephemeral keys (temporary keys used to negotiate the session) need to be erased from memory immediately after the handshake. If there's no reliable way to clear this memory, these keys could stay there indefinitely. An attacker who finds them later could re-derive the session key and decrypt past traffic, breaking forward secrecy. In Go, the runtime manages memory, and it doesn't guarantee when or how memory is cleared. Sensitive data might remain in heap allocations or stack frames, potentially exposed in core dumps or through memory attacks. Developers often have to use unreliable "hacks" with reflection to try to zero out internal buffers in cryptographic libraries. Even so, some data might still stay in memory where the developer can't reach or control it. The Go team's solution to this problem is the new package. It lets you run a function in secret mode . After the function finishes, it immediately erases (zeroes out) the registers and stack it used. Heap allocations made by the function are erased as soon as the garbage collector decides they are no longer reachable. This helps make sure sensitive information doesn't stay in memory longer than needed, lowering the risk of attackers getting to it. Here's an example that shows how might be used in a more or less realistic setting. Let's say you want to generate a session key while keeping the ephemeral private key and shared secret safe: Here, the ephemeral private key and the raw shared secret are effectively "toxic waste" — they are necessary to create the final session key, but dangerous to keep around. If these values stay in the heap and an attacker later gets access to the application's memory (for example, via a core dump or a vulnerability like Heartbleed), they could use these intermediates to re-derive the session key and decrypt past conversations. By wrapping the calculation in , we make sure that as soon as the session key is created, the "ingredients" used to make it are permanently destroyed. This means that even if the server is compromised in the future, this specific past session can't be exposed, which ensures forward secrecy. The current implementation only supports Linux (amd64 and arm64). On unsupported platforms, invokes the function directly. Also, trying to start a goroutine within the function causes a panic (this will be fixed in Go 1.27). The package is mainly for developers who work on cryptographic libraries. Most apps should use higher-level libraries that use behind the scenes. The package is experimental and can be enabled by setting at build time. 𝗗 runtime/secret • 𝗣 21865 • 𝗖𝗟 704615 • 𝗔 Daniel Morsing Current cryptographic APIs, like or , often accept an as the source of random data: These APIs don't commit to a specific way of using random bytes from the reader. Any change to underlying cryptographic algorithms can change the sequence or amount of bytes read. Because of this, if the application code (mistakenly) relies on a specific implementation in Go version X, it might fail or behave differently in version X+1. The Go team chose a pretty bold solution to this problem. Now, most crypto APIs will just ignore the random parameter and always use the system random source ( ). The change applies to the following subpackages: still uses the random reader if provided. But if is nil, it uses an internal secure source of random bytes instead of (which could be overridden). To support deterministic testing, there's a new package with a single function. It sets a global, deterministic cryptographic randomness source for the duration of the given test: affects and all implicit sources of cryptographic randomness in the packages: To temporarily restore the old reader-respecting behavior, set (this option will be removed in a future release). 𝗗 testing/cryptotest • 𝗣 70942 • 𝗖𝗟 724480 • 𝗔 Filippo Valsorda , qiulaidongfeng A leak occurs when one or more goroutines are indefinitely blocked on synchronization primitives like channels, while other goroutines continue running and the program as a whole keeps functioning. Here's a simple example: If we call and don't read from the output channel, the inner goroutine will stay blocked trying to send to the channel for the rest of the program: Unlike deadlocks, leaks do not cause panics, so they are much harder to spot. Also, unlike data races, Go's tooling did not address them for a long time. Things started to change in Go 1.24 with the introduction of the package. Not many people talk about it, but is a great tool for catching leaks during testing. Go 1.26 adds a new experimental profile designed to report leaked goroutines in production. Here's how we can use it in the example above: As you can see, we have a nice goroutine stack trace that shows exactly where the leak happens. The profile finds leaks by using the garbage collector's marking phase to check which blocked goroutines are still connected to active code. It starts with runnable goroutines, marks all sync objects they can reach, and keeps adding any blocked goroutines waiting on those objects. When it can't add any more, any blocked goroutines left are waiting on resources that can't be reached — so they're considered leaked. Here's the gist of it: For even more details, see the paper by Saioc et al. If you want to see how (and ) can catch typical leaks that often happen in production — check out my article on goroutine leaks . The profile is experimental and can be enabled by setting at build time. Enabling the experiment also makes the profile available as a net/http/pprof endpoint, . According to the authors, the implementation is already production-ready. It's only marked as experimental so they can get feedback on the API, especially about making it a new profile. 𝗗 runtime/pprof • 𝗚 Detecting leaks • 𝗣 74609 , 75280 • 𝗖𝗟 688335 • 𝗔 Vlad Saioc New metrics in the package give better insight into goroutine scheduling: Here's the full list: Per-state goroutine metrics can be linked to common production issues. For example, an increasing waiting count can show a lock contention problem. A high not-in-go count means goroutines are stuck in syscalls or cgo. A growing runnable backlog suggests the CPUs can't keep up with demand. You can read the new metric values using the regular function: The per-state numbers (not-in-go + runnable + running + waiting) are not guaranteed to add up to the live goroutine count ( , available since Go 1.16). All new metrics use counters. 𝗗 runtime/metrics • 𝗣 15490 • 𝗖𝗟 690397 , 690398 , 690399 • 𝗔 Michael Knyszek The new and methods in the package return iterators for a type's fields and methods: The new methods and return iterators for the input and output parameters of a function type: The new methods and return iterators for a value's fields and methods. Each iteration yields both the type information ( or ) and the value: Previously, you could get all this information by using a for-range loop with methods (which is what iterators do internally): Using an iterator is more concise. I hope it justifies the increased API surface. 𝗗 reflect • 𝗣 66631 • 𝗖𝗟 707356 • 𝗔 Quentin Quaadgras The new method in the package returns the next N bytes from the buffer without advancing it: If returns fewer than N bytes, it also returns : The slice returned by points to the buffer's content and stays valid until the buffer is changed. So, if you change the slice right away, it will affect future reads: The slice returned by is only valid until the next call to a read or write method. 𝗗 Buffer.Peek • 𝗣 73794 • 𝗖𝗟 674415 • 𝗔 Ilia Choly After you start a process in Go, you can access its ID: Internally, the type uses a process handle instead of the PID (which is just an integer), if the operating system supports it. Specifically, in Linux it uses pidfd , which is a file descriptor that refers to a process. Using the handle instead of the PID makes sure that methods always work with the same OS process, and not a different process that just happens to have the same ID. Previously, you couldn't access the process handle. Now you can, thanks to the new method: calls a specified function and passes a process handle as an argument: The handle is guaranteed to refer to the process until the callback function returns, even if the process has already terminated. That's why it's implemented as a callback instead of a field or method. is only supported on Linux 5.4+ and Windows. On other operating systems, it doesn't execute the callback and returns an error. 𝗗 Process.WithHandle • 𝗣 70352 • 𝗖𝗟 699615 • 𝗔 Kir Kolyshkin returns a context that gets canceled when any of the specified signals is received. Previously, the canceled context only showed the standard "context canceled" cause: Now the context's cause shows exactly which signal was received: The returned type, , is based on , so it doesn't provide the actual value — just its string representation. 𝗗 signal.NotifyContext • 𝗖𝗟 721700 • 𝗔 Filippo Valsorda An IP address prefix represents an IP subnet. These prefixes are usually written in CIDR notation: In Go, an IP prefix is represented by the type. The new method lets you compare two IP prefixes, making it easy to sort them without having to write your own comparison code: orders two prefixes as follows: This follows the same order as Python's and the standard IANA (Internet Assigned Numbers Authority) convention. 𝗗 Prefix.Compare • 𝗣 61642 • 𝗖𝗟 700355 • 𝗔 database64128 The package has top-level functions for connecting to an address using different networks (protocols) — , , , and . They were made before was introduced, so they don't support cancellation: There's also a type with a general-purpose method. It supports cancellation and can be used to connect to any of the known networks: However, a bit less efficient than network-specific functions like — because of the extra overhead from address resolution and network type dispatching. So, network-specific functions in the package are more efficient, but they don't support cancellation. The type supports cancellation, but it's less efficient. The Go team decided to resolve this contradiction. The new context-aware methods ( , , , and ) combine the efficiency of the existing network-specific functions with the cancellation capabilities of : I wouldn't say that having three different ways to dial is very convenient, but that's the price of backward compatibility. 𝗗 net.Dialer • 𝗣 49097 • 𝗖𝗟 490975 • 𝗔 Michael Fraenkel The default certificate already lists in its DNSNames (a list of hostnames or domain names that the certificate is authorized to secure). Because of this, doesn't trust responses from the real : To fix this issue, the HTTP client returned by now redirects requests for and its subdomains to the test server: 𝗗 Server.Client • 𝗖𝗟 666855 • 𝗔 Sean Liao People often point out that using for plain strings causes more memory allocations than . Because of this, some suggest switching code from to when formatting isn't needed. The Go team disagrees. Here's a quote from Russ Cox: Using is completely fine, especially in a program where all the errors are constructed with . Having to mentally switch between two functions based on the argument is unnecessary noise. With the new Go release, this debate should finally be settled. For unformatted strings, now allocates less and generally matches the allocations for . Specifically, goes from 2 allocations to 0 allocations for a non-escaping error, and from 2 allocations to 1 allocation for an escaping error: This matches the allocations for in both cases. The difference in CPU cost is also much smaller now. Previously, it was ~64ns vs. ~21ns for vs. for escaping errors, now it's ~25ns vs. ~21ns. Here are the "before and after" benchmarks for the change. The non-escaping case is called , and the escaping case is called . If there's just a plain error string, it's . If the error includes formatting, it's . Seconds per operation: Bytes per operation: Allocations per operation: If you're interested in the details, I highly recommend reading the CL — it's perfectly written. 𝗗 fmt.Errorf • 𝗖𝗟 708836 • 𝗔 thepudds Previously, allocated a lot of intermediate memory as it grew its result slice to the size of the input data. Now, it uses intermediate slices of exponentially growing size, and then copies them into a final perfectly-sized slice at the end. The new implementation is about twice as fast and uses roughly half the memory for a 65KiB input; it's even more efficient with larger inputs. Here are the geomean results comparing the old and new versions for different input sizes: See the full benchmark results in the commit. Unfortunately, the author didn't provide the benchmark source code. Ensuring the final slice is minimally sized is also quite helpful. The slice might persist for a long time, and the unused capacity in a backing array (as in the old version) would just waste memory. As with the optimization, I recommend reading the CL — it's very good. Both changes come from thepudds , whose change descriptions are every reviewer's dream come true. 𝗗 io.ReadAll • 𝗖𝗟 722500 • 𝗔 thepudds The package, introduced in version 1.21, offers a reliable, production-ready logging solution. Since its release, many projects have switched from third-party logging packages to use it. However, it was missing one key feature: the ability to send log records to multiple handlers, such as stdout or a log file. The new type solves this problem. It implements the standard interface and calls all the handlers you set up. For example, we can create a log handler that writes to stdout: And another handler that writes to a file: Finally, combine them using a : I'm also printing the file contents here to show the results. When the receives a log record, it sends it to each enabled handler one by one. If any handler returns an error, doesn't stop; instead, it combines all the errors using : The method reports whether any of the configured handlers is enabled: Other methods — and — call the corresponding methods on each of the enabled handlers. 𝗗 slog.MultiHandler • 𝗣 65954 • 𝗖𝗟 692237 • 𝗔 Jes Cok Test artifacts are files created by tests or benchmarks, such as execution logs, memory dumps, or analysis reports. They are important for debugging failures in remote environments (like CI), where developers can't step through the code manually. Previously, the Go test framework and tools didn't support test artifacts. Now they do. The new methods , , and return a directory where you can write test output files: If you use with , this directory will be inside the output directory (specified by , or the current directory by default): As you can see, the first time is called, it writes the directory location to the test log, which is quite handy. If you don't use , artifacts are stored in a temporary directory which is deleted after the test completes. Each test or subtest within each package has its own unique artifact directory. Subtest outputs are not stored inside the parent test's output directory — all artifact directories for a given package are created at the same level: The artifact directory path normally looks like this: But if this path can't be safely converted into a local file path (which, for some reason, always happens on my machine), the path will simply be: (which is what happens in the examples above) Repeated calls to in the same test or subtest return the same directory. 𝗗 T.ArtifactDir • 𝗣 71287 • 𝗖𝗟 696399 • 𝗔 Damien Neil Over the years, the command became a sad, neglected bag of rewrites for very ancient Go features. But now, it's making a comeback. The new is re-implemented using the Go analysis framework — the same one uses. While and now use the same infrastructure, they have different purposes and use different sets of analyzers: By default, runs a full set of analyzers (currently, there are more than 20). To choose specific analyzers, use the flag for each one, or use to run all analyzers except the ones you turned off. For example, here we only enable the analyzer: And here, we enable all analyzers except : Currently, there's no way to suppress specific analyzers for certain files or sections of code. To give you a taste of analyzers, here's one of them in action. It replaces loops with or : If you're interested, check out the dedicated blog post for the full list of analyzers with examples. 𝗗 cmd/fix • 𝗚 go fix • 𝗣 71859 • 𝗔 Alan Donovan Go 1.26 is incredibly big — it's the largest release I've ever seen, and for good reason: All in all, a great release! You might be wondering about the package that was introduced as experimental in 1.25. It's still experimental and available with the flag. P.S. To catch up on other Go releases, check out the Go features by version list or explore the interactive tours for Go 1.25 and 1.24 . P.P.S. Want to learn more about Go? Check out my interactive book on concurrency a vector from array/slice, or a vector to array/slice. Arithmetic: , , , , . Bitwise: , , , , . Comparison: , , , , . Conversion: , , . Masking: , , . Rearrangement: . Collect live goroutines . Start with currently active (runnable or running) goroutines as roots. Ignore blocked goroutines for now. Mark reachable memory . Trace pointers from roots to find which synchronization objects (like channels or wait groups) are currently reachable by these roots. Resurrect blocked goroutines . Check all currently blocked goroutines. If a blocked goroutine is waiting for a synchronization resource that was just marked as reachable — add that goroutine to the roots. Iterate . Repeat steps 2 and 3 until there are no more new goroutines blocked on reachable objects. Report the leaks . Any goroutines left in the blocked state are waiting for resources that no active part of the program can access. They're considered leaked. Total number of goroutines since the program started. Number of goroutines in each state. Number of active threads. First by validity (invalid before valid). Then by address family (IPv4 before IPv6). Then by masked IP address (network IP). Then by prefix length. Then by unmasked address (original IP). Vet is for reporting problems. Its analyzers describe actual issues, but they don't always suggest fixes, and the fixes aren't always safe to apply. Fix is (mostly) for modernizing the code to use newer language and library features. Its analyzers produce fixes are always safe to apply, but don't necessarily indicate problems with the code. It brings a lot of useful updates, like the improved builtin, type-safe error checking, and goroutine leak detector. There are also many performance upgrades, including the new garbage collector, faster cgo and memory allocation, and optimized and . On top of that, it adds quality-of-life features like multiple log handlers, test artifacts, and the updated tool. Finally, there are two specialized experimental packages: one with SIMD support and another with protected mode for forward secrecy.

0 views
Lalit Maganti 1 weeks ago

One Number I Trust: Plain-Text Accounting for a Multi-Currency Household

Two people. Eighteen accounts spanning checking, savings, credit cards, investments. Three currencies. Twenty minutes of work every week. One net worth number I actually trust. The payoff: A single, trustworthy net worth number growing over time. No app did exactly what I needed, so I built my own personal finance system using plain-text accounting principles and a powerful Python library called Beancount . This post shows you how I handle imports, investments, multi-currency, and a two-person view. It all started during the 2021 tax season. I had blocked out an entire weekend and was juggling statements, trying to compute capital gains, stressing about getting the numbers mixed up. “This is chaos”, I thought. “There must be a way to simplify this with automation”. Being a software engineer, I did what felt natural and hacked together a bunch of scripts on top of a database.

0 views
emiruz 2 weeks ago

pyevidence: practical evidence theory

Introduction The pyevidence repository, installation and usage examples are available here. I had properly discovered evidence theory – also known as Dempster-Shafer theory – recently, and wrote about it a bit here. Its a simple theory to get going with, and the Wikipedia article does a good job of introducing it. Briefly, the subject of evidence theory is the powerset of some set \(X\), to which credence (called “mass”) is assigned with the constraint that it adds up to \(1\) and that nothing is assigned to the emptyset.

0 views

Can Bundler Be as Fast as uv?

At RailsWorld earlier this year, I got nerd sniped by someone. They asked “why can’t Bundler be as fast as uv?” Immediately my inner voice said “YA, WHY CAN’T IT BE AS FAST AS UV????” My inner voice likes to shout at me, especially when someone asks a question so obvious I should have thought of it myself. Since then I’ve been thinking about and investigating this problem, going so far as to give a presentation at XO Ruby Portland about Bundler performance . I firmly believe the answer is “Bundler can be as fast as uv” (where “as fast” has a margin of error lol). Fortunately, Andrew Nesbitt recently wrote a post called “How uv got so fast” , and I thought I would take this opportunity to review some of the highlights of the post and how techniques applied in uv can (or can’t) be applied to Bundler / RubyGems. I’d also like to discuss some of the existing bottlenecks in Bundler and what we can do to fix them. If you haven’t read Andrew’s post, I highly recommend giving it a read . I’m going to quote some parts of the post and try to reframe them with RubyGems / Bundler in mind. Andrew opens the post talking about rewriting in Rust: uv installs packages faster than pip by an order of magnitude. The usual explanation is “it’s written in Rust.” That’s true, but it doesn’t explain much. Plenty of tools are written in Rust without being notably fast. The interesting question is what design decisions made the difference. This is such a good quote. I’m going to address “rewrite in Rust” a bit later in the post. But suffice to say, I think if we eliminate bottlenecks in Bundler such that the only viable option for performance improvements is to “rewrite in Rust”, then I’ll call it a success. I think rewrites give developers the freedom to “think outside the box”, and try techniques they might not have tried. In the case of , I think it gave the developers a good way to say “if we don’t have to worry about backwards compatibility, what could we achieve?”. I suspect it would be possible to write a uv in Python (PyUv?) that approaches the speeds of uv, and in fact much of the blog post goes on to talk about performance improvements that aren’t related to Rust. pip’s slowness isn’t a failure of implementation. For years, Python packaging required executing code to find out what a package needed. I didn’t know this about Python packages, and it doesn’t really apply to Ruby Gems so I’m mostly going to skip this section. Ruby Gems are tar files, and one of the files in the tar file is a YAML representation of the GemSpec. This YAML file declares all dependencies for the Gem, so RubyGems can know, without evaling anything, what dependencies it needs to install before it can install any particular Gem. Additionally, RubyGems.org provides an API for asking about dependency information, which is actually the normal way of getting dependency info (again, no required). There’s only one other thing from this section I’d like to quote: PEP 658 (2022) put package metadata directly in the Simple Repository API, so resolvers could fetch dependency information without downloading wheels at all. Fortunately RubyGems.org already provides the same information about gems. Reading through the number of PEPs required as well as the amount of time it took to get the standards in place was very eye opening for me. I can’t help but applaud folks in the Python community for doing this. It seems like a mountain of work, and they should really be proud of themselves. I’m mostly going to skip this section except for one point: Ignoring requires-python upper bounds. When a package says it requires python<4.0, uv ignores the upper bound and only checks the lower. This reduces resolver backtracking dramatically since upper bounds are almost always wrong. Packages declare python<4.0 because they haven’t tested on Python 4, not because they’ll actually break. The constraint is defensive, not predictive. I think this is very very interesting. I don’t know how much time Bundler spends on doing “required Ruby version” bounds checking, but it feels like if uv can do it, so can we. I really love that Andrew pointed out optimizations that could be made that don’t involve Rust. There are three points in this section that I want to pull out: Parallel downloads. pip downloads packages one at a time. uv downloads many at once. Any language can do this. This is absolutely true, and is a place where Bundler could improve. Bundler currently has a problem when it comes to parallel downloads, and needs a small architectural change as a fix. The first problem is that Bundler tightly couples installing a gem with downloading the gem. You can read the installation code here , but I’ll summarize the method in question below: The problem with this method is that it inextricably links downloading the gem with installing it. This is a problem because we could be downloading gems while installing other gems, but we’re forced to wait because the installation method couples the two operations. Downloading gems can trivially be done in parallel since the files are just archives that can be fetched independently. The second problem is the queuing system in the installation code. After gem resolution is complete, and Bundler knows what gems need to be installed, it queues them up for installation. You can find the queueing code here . The code takes some effort to understand. Basically it allows gems to be installed in parallel, but only gems that have already had their dependencies installed. So for example, if you have a dependency tree like “gem depends on gem which depends on gem ” ( ), then no gems will be installed (or downloaded) in parallel. To demonstrate this problem in an easy-to-understand way, I built a slow Gem server . It generates a dependency tree of ( depends on , depends on ), then starts a Gem server. The Gem server takes 3 seconds to return any Gem, so if we point Bundler at this Gem server and then profile Bundler, we can see the impact of the queueing system and download scheme. In my test app, I have the following Gemfile: If we profile Bundle install with Vernier, we can see the following swim lanes in the marker chart: The above chart is showing that we get no parallelism during installation. We spend 3 seconds downloading the gem, then we install it. Then we spend 3 seconds downloading the gem, then we install it. Finally we spend 3 seconds downloading the gem, and we install it. Timing the process shows we take over 9 seconds to install (3 seconds per gem): Contrast this with a Gemfile containing , , and , which have no dependencies, but still take 3 seconds to download: Timing for the above Gemfile shows it takes about 4 seconds: We were able to install the same number of gems in a fraction of the time. This is because Bundler is able to download siblings in the dependency tree in parallel, but unable to handle other relationships. There is actually a good reason that Bundler insists dependencies are installed before the gems themselves: native extensions. When installing native extensions, the installation process must run Ruby code (the file). Since the could require dependencies be installed in order to run, we must install dependencies first. For example depends on , but is only used during the installation process, so it needs to be installed before can be compiled and installed. However, if we were to decouple downloading from installation it would be possible for us to maintain the “dependencies are installed first” business requirement but speed up installation. In the case, we could have been downloading gems and at the same time as gem (or even while waiting on to be installed). Additionally, pure Ruby gems don’t need to execute any code on installation. If we knew that we were installing a pure Ruby gem, it would be possible to relax the “dependencies are installed first” business requirement and get even more performance increases. The above case could install all three gems in parallel since none of them execute Ruby code during installation. I would propose we split installation in to 4 discrete steps: Downloading and unpacking can be done trivially in parallel. We should unpack the gem to a temporary folder so that if the process crashes or the machine loses power, the user isn’t stuck with a half-installed gem. After we unpack the gem, we can discover whether the gem is a native extension or not. If it’s not a native extension, we “install” the gem simply by moving the temporary folder to the “correct” location. This step could even be a “hard link” step as discussed in the next point. If we discover that the gem is a native extension, then we can “pause” installation of that gem until its dependencies are installed, then resume (by compiling) at an appropriate time. Side note: , a Bundler alternative , works mostly in this manner today. Here is a timing of the case from above: Lets move on to the next point: Global cache with hardlinks. pip copies packages into each virtual environment. uv keeps one copy globally and uses hardlinks I think this is a great idea, but I’d actually like to split the idea in two. First, RubyGems and Bundler should have a combined, global cache, full stop. I think that global cache should be in , and we should store files there when they are downloaded. Currently, both Bundler and RubyGems will use a Ruby version specific cache folder. In other words, if you do on two different versions of Ruby, you get two copies of Rails and all its dependencies. Interestingly, there is an open ticket to implement this , it just needs to be done. The second point is hardlinking on installation. The idea here is that rather than unpacking the gem multiple times, once per Ruby version, we simply unpack once and then hard link per Ruby version. I like this idea, but I think it should be implemented after some technical debt is paid: namely implementing a global cache and unifying Bundler / RubyGems code paths. On to the next point: PubGrub resolver Actually Bundler already uses a Ruby implementation of the PubGrub resolver. You can see it here . Unfortunately, RubyGems still uses the molinillo resolver . In other words you use a different resolver depending on whether you do or . I don’t really think this is a big deal since the vast majority of users will be doing most of time. However, I do think this discrepancy is some technical debt that should be addressed, and I think this should be addressed via unification of RubyGems and Bundler codebases (today they both live in the same repository, but the code isn’t necessarily combined). Lets move on to the next section of Andrew’s post: Andrew first mentions “Zero-copy deserialization”. This is of course an important technique, but I’m not 100% sure where we would utilize it in RubyGems / Bundler. I think that today we parse the YAML spec on installation, and that could be a target. But I also think we could install most gems without looking at the YAML gemspec at all. Thread-level parallelism. Python’s GIL forces parallel work into separate processes, with IPC overhead and data copying. This is an interesting point. I’m not sure what work pip needed to do in separate processes. Installing a pure Ruby, Ruby Gem is mostly an IO bound task, with some ZLIB mixed in. Both of these things (IO and ZLIB processing) release Ruby’s GVL, so it’s possible for us to do things truly in parallel. I imagine this is similar for Python / pip, but I really have no idea. Given the stated challenges with Python’s GIL, you might wonder whether Ruby’s GVL presents similar parallelism problems for Bundler. I don’t think so, and in fact I think Ruby’s GVL gets kind of a bad rap. It prevents us from running CPU bound Ruby code in parallel. Ractors address this, and Bundler could possibly leverage them in the future, but since installing Gems is mostly an IO bound task I’m not sure what the advantage would be (possibly the version solver, but I’m not sure what can be parallelized in there). The GVL does allow us to run IO bound work in parallel with CPU bound Ruby code. CPU bound native extensions are allowed to release the GVL , allowing Ruby code to run in parallel with the native extension’s CPU bound code. In other words, Ruby’s GVL allows us to safely run work in parallel. That said, the GVL can work against us because releasing and acquiring the GVL takes time . If you have a system call that is very fast, releasing and acquiring the GVL could end up being a large percentage of that call. For example, if you do , and the buffer is very small, you could encounter a situation where GVL book keeping is the majority of the time. A bummer is that Ruby Gem packages usually contain lots of very small files, so this problem could be impacting us. The good news is that this problem can be solved in Ruby itself, and indeed some work is being done on it today . No interpreter startup. Every time pip spawns a subprocess, it pays Python’s startup cost. Obviously Ruby has this same problem. That said, we only start Ruby subprocesses when installing native extensions. I think native extensions make up the minority of gems installed, and even when installing a native extension, it isn’t Ruby startup that is the bottleneck. Usually the bottleneck is compilation / linking time (as we’ll see in the next post). Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. This is a cool optimization, but I don’t think it’s actually Rust specific. Comparing integers is much faster than comparing version objects. The idea is that you take a version number, say , and then pack each part of the version in to a single integer. For example, we could represent as and as , etc. It should be possible to use this trick in Ruby and encode versions to integer immediates, which would unlock performance in the resolver. Rust has an advantage here - compiled native code comparing u64s will always be faster than Ruby, even with immediates. However, I would bet that with the YJIT or ZJIT in play, this gap could be closed enough that no end user would notice the difference between a Rust or Ruby implementation of Bundler. I started refactoring the object so that we might start doing this, but we ended up reverting it because of backwards compatibility (I am jealous of in that regard). I think the right way to do this is to refactor the solver entry point and ensure all version requirements are encoded as integer immediates before entering the solver. We could keep the API as “user facing” and design a more internal API that the solver uses. I am very interested in reading the version encoding scheme in uv. My intuition is that minor numbers tend to get larger than major numbers, so would minor numbers have more dedicated bits? Would it even matter with 64 bits? I’m going to quote Andrew’s last 2 paragraphs: uv is fast because of what it doesn’t do, not because of what language it’s written in. The standards work of PEP 518, 517, 621, and 658 made fast package management possible. Dropping eggs, pip.conf, and permissive parsing made it achievable. Rust makes it a bit faster still. pip could implement parallel downloads, global caching, and metadata-only resolution tomorrow. It doesn’t, largely because backwards compatibility with fifteen years of edge cases takes precedence. But it means pip will always be slower than a tool that starts fresh with modern assumptions. I think these are very good points. The difference is that in RubyGems and Bundler, we already have the infrastructure in place for writing a “fast as uv” package manager. The difficult part is dealing with backwards compatibility, and navigating two legacy codebases. I think this is the real advantage the uv developers had. That said, I am very optimistic that we could “repair the plane mid-flight” so to speak, and have the best of both worlds: backwards compatibility and speed. I mentioned at the top of the post I would address “rewrite it in Rust”, and I think Andrew’s own quote mostly does that for me. I think we could have 99% of the performance improvements while still maintaining a Ruby codebase. Of course if we rewrote it in Rust, you could squeeze an extra 1% out, but would it be worthwhile? I don’t think so. I have a lot more to say about this topic, and I feel like this post is getting kind of long, so I’m going to end it here. Please look out for part 2, which I’m tentatively calling “What makes Bundler / RubyGems slow?” This post was very “can we make RubyGems / Bundler do what uv does?” (the answer is “yes”). In part 2 I want to get more hands-on by discussing how to profile Bundler and RubyGems, what specifically makes them slow in the real world, and what we can do about it. I want to end this post by saying “thank you” to Andrew for writing such a great post about how uv got so fast . Download the gem Unpack the gem Compile the gem Install the gem

0 views
alikhil 2 weeks ago

Kubernetes In-Place Pod Resize

About six years ago, while operating a large Java-based platform in Kubernetes, I noticed a recurring problem: our services required significantly higher CPU and memory during application startup. Heavy use of Spring Beans and AutoConfiguration forced us to set inflated resource requests and limits just to survive bootstrap, even though those resources were mostly unused afterwards. This workaround never felt right. As an engineer, I wanted a solution that reflected the actual lifecycle of an application rather than its worst moment. I opened an issue in the Kubernetes repository describing the problem and proposing an approach to adjust pod resources dynamically without restarts. The issue received little discussion but quietly accumulated interest over time (13 👍 emoji reaction). Every few months, an automation bot attempted to mark it as stale, and every time, I removed the label. This went on for nearly six years… Until the release of Kubernetes 1.35 where In-Place Pod Resize feature was marked as stable . In-Place Pod Resize allows Kubernetes to update CPU and memory requests and limits without restarting pods, whenever it is safe to do so. This significantly reduces unnecessary restarts caused by resource changes, leading to fewer disruptions and more reliable workloads. For applications whose resource needs evolve over time, especially after startup, this feature provides a long-missing building block. The new field is configured at the pod spec level. While it is technically possible to change pod resources manually, doing so does not scale. In practice, this feature should be driven by a workload controller. At the moment, the only controller that supports in-place pod resize is the Vertical Pod Autoscaler (VPA). There are two enhancement proposals enable this behavior: AEP-4016: Support for in place updates in VPA which introduces update mode AEP-7862: CPU Startup Boost which is about temporarily boosting pod by giving more cpu during pod startup. This is conceptually similar to the approach proposed in my original issue. Here is an example of Deployment and VPA using both AEP features: With such configuration pod will have doubled cpu requests and limits during startup. During the boost period no resizing will happen. Once the pod reaches the state, the VPA controller scales CPU down to the currently recommended value. After that, VPA continues operating normally, with the key difference that resource updates are applied in place whenever possible. Does this feature fully solve the problem described above? Only partially. First, most application runtimes still impose fundamental constraints. Java and Python runtimes do not currently support resizing memory limits without a restart. This limitation exists outside of Kubernetes itself and is tracked in the OpenJDK project via an open ticket . Second, Kubernetes does not yet support decreasing memory limits, even with in-place Pod Resize enabled. This is a known limitation documented in the enhancement proposal for memory limit decreases . As a result, while in-place Pod Resize effectively addresses CPU-related startup spikes, memory resizing remains an open problem. In place Pod Resize gives a foundation for cool new features like StartupBoost and makes use of VPA more reliable. While important gaps remain, such as memory decrease support and scheduling race condition , this change represents a meaningful step forward. For workloads with distinct startup and steady-state phases, Kubernetes is finally beginning to model reality more closely. AEP-4016: Support for in place updates in VPA which introduces update mode AEP-7862: CPU Startup Boost which is about temporarily boosting pod by giving more cpu during pod startup. This is conceptually similar to the approach proposed in my original issue.

0 views
Manuel Moreale 2 weeks ago

Lars-Christian Simonsen

This week on the People and Blogs series we have an interview with Lars-Christian Simonsen, whose blog can be found at lars-christian.com . Tired of RSS? Read this in your browser or sign up for the newsletter . The People and Blogs series is supported by Eleonora and the other 129 members of my "One a Month" club. If you enjoy P&B, consider becoming one for as little as 1 dollar a month. My name is Lars-Christian Simonsen. I'm a guy in my twenties. No wait, thirties? Actually let's just scratch that part. I was born and raised on an island deep inside the Arctic circle. Up there, I spent the first quarter century of my life, before relocating to the Norwegian capital, Oslo, while looking for work after getting a degree in finance and business administration. There was also a girl and that girl is now my wife. As she was less than excited about the prospect of settling down somewhere where the number of what she considers a warm summer day per season is typically counted on one hand, and the dark days of winter seemingly never end, I simply could not convince her to move back north with me. Instead, we hopped on a train and found a quiet suburban neighbourhood when we were ready to settle down. A decade later, we're still here and we're raising two children in this community. We've concluded it was a good compromise. My days revolve around the aforementioned children, and juggling keeping them alive and content (tall ask, but we aim high) with a nine-to-five and trying to find some time for other things I enjoy. This includes, but is not limited to, running, reading and writing. I also enjoy being outside in nature, hiking and camping in the mountains in particular. Alas, I don't find nearly enough time for it. In an attempt to compensate, I double down on exercise and reading, and try to spend around at least half an hour on each every day. Circa 1995 my dad took me to a newly opened local internet café. It was the first time I went online, and I was hooked. A couple years later we got a state of the art ISDN line installed in our house. Back then we paid for usage by the "call unit" (the Norwegian term "tellerskritt", which translates literally to "counter steps", is far more memorable) and there were months where I wasn't looking forward to the day the bill dropped into our (physical) mailbox. My life as a chronically online person was underway. Hanging out in local IRC channels, moderated by our community tech gurus, it didn't take long before I was inspired to make my own website. The year was 1998, and I used a program like FrontPage or some such to get my very first personal site online. Domains were expensive back then and the site was hosted on a directory provided by our ISP. After that, I spent a few years making a website dedicated to a popular video game series. But when 2005 rolled around, blogging was all the hype, and I decided that I needed to have a personal blog as well. Domains had become more affordable, and I decided to register the .com for my given name, Lars-Christian. It has housed my personal website since. As I have changed (grown?) as a person through these years, the blog has changed with me, and there have been many iterations through these two decades. Last year, however, I made a concerted effort to reconstruct as much as I could of the content from the earlier versions of the blog. (I relied heavily on the magnificent Internet Archive which I think everyone should support.) By my estimate, the posts archive now contains at least 90% of the posts I ever published to my blog. There honestly isn't much to be proud of. My 2007 phase of trying to fashion myself an internet marketing guru is particularly cringe. But I like the idea of my personal website as a reflection of my many past selves, so I leave everything for posterity. The blog laid dormant for many years, before I decided to bring it back to life in late 2023. Like so many others, I had become disillusioned by the state of the big social platforms. Withdrawing from those, breathing life into my blog again as a place to express, collect and share whatever interested me seemed an obvious move. Nowadays, I think of my website as not just a blog, but an online home. My personal space to do whatever I want. And a place to experiment and tinker with tech. You know, like we used to do back when tech was exciting and spoke to a world full of possibilities as opposed the dystopian timeline we stumbled upon as we ceded our lives to a handful of algorithms. Turns out that part is mostly optional, even today. I've built functionality to replace centralised services like Goodreads and Strava, and share my reading and workouts on my blog. Admittedly, those are mostly just things that aren't doing the thing . Because the thing I really want to do is write more. To the extent that I have a goal for my blog, it is simply to write more. There's a stanza from the song Marching Bands of Manhattan by Death Cab for Cutie. It's one of my favourite songs, by one of my favourite bands, and the particular line is this: And it is true what you said That I live like a hermit in my own head To the extent that I have a creative process, it is living like a hermit in my own head. Always thinking, contemplating, obsessing over some thing or other. It can be exhausting, and often leaves me feeling restless. But committing my thoughts to paper is something of an antidote. The song continues: But when the sun shines again I'll pull the curtains and blinds to let the light in Putting my thoughts to the sword by writing them down, examine if they make sense, sometimes feels like pulling the curtains and letting the light in. It helps me discard that which doesn't make sense. Which is to say most of it. I can then spend my energy on that which does make sense. Of course, what I'm thinking about is, to a large extent, determined by input. That would be the "content" I consume. And that's why I had to step back from social media. The hot-takes and constant negativity and never-ending dread made me depressed. Now I try to control my inputs to a great extent. Avoiding the 24 hour news cycle and social media. I don't really watch TV either. Instead, I read books and listen to audiobooks and long-form podcasts, for education and entertainment. Inspiration to write comes from these sources, but also my daily life — particularly my children. They never cease to amaze me and they frequently force me to challenge my own assumptions and perspectives, letting me (hopefully) grow with them. To the extent that I've written anything worth reading, it was probably inspired by my children. My technical setup is as simple as can be. I do all my writing in my plain text editor of choice, Sublime Text , using simple Markdown for formatting. If I have one enormous weakness as a writer, it is my aversion to reading my own writing. I believe it induces similar feelings in me as many people experience when hearing a recording of their own voice. I dread it. Proof reading… well, let's say I have room to grow. It's usually just write it, and if I have a vague feeling of what I wrote having made some sense, I try to be quick to publish. If I don't publish something the same day I write the bulk of it, it is likely to end up in my enormous pile of mostly not even half finished drafts. That's easy. The sun is about to come up. I'm sat at the kitchen table and through the window I see world come back to life. I'm sat at the kitchen table. My laptop in front of me, a cup of coffee on the side. The rest of the house is still asleep. No matter how sleepy I might be, I can access something in these moments that is locked off and unavailable at all other times. Creativity never comes more naturally to me. Unfortunately, life often gets in the way and too often I only find myself with time to spare for writing in the evenings. At night, I'll be tired and groggy and anything that requires effort feels like a tall ask. Surroundings definitely influence my creativity and ability to get work done. Concentration is hard to come by in an untidy environment. Usually, I start any work session by tidying up the room around me. Some people excel in chaotic surroundings. Me, I'm at my best, creatively and productively, in quiet, comfortable and familiar settings. Dialogue is especially distracting to me, and it will consistently throw me off. Even music will eat into my concentration. I've found one exception: ambient music. A pair of noise cancelling headphones and Brian Eno's Music for Airports (good luck purchasing that in a digital format) and Boards of Canada's Tomorrow's Harvest have saved me many times. I mentioned earlier that I do all of my writing in a plain text editor. This after a desire to simplify my tech stack a couple of years back. In the same process, I also threw out my CMS and — because all the existing static site generators confused me to no end — put together a few Python scripts to generate a static version of my website based on markdown content files. It was quite a challenge, but an enjoyable one. When I've finished a post I dump the file in a specific directory. The scripts take over, generate the new and updated pages of my website, before uploading it to my web host. Speaking of web hosts, I rent a Virtual Machine (VM) from OpenBSD.amsterdam . They are an independent host that contribute to an independent Free and Open Source (FOSS) initiative. That, and the opportunity to learn more about working in the command line and doing some simple server administration, was why I chose them. And they've been great! If I have a question, I just send them an email. An actual human being responds within a reasonable time frame, answering my question. What a luxury! My domain registrar of choice is Hover . I think I've been a customer for close to fifteen years. I've never had any problems, which is all I want from my registrar. That depends entirely where you're coming from. For someone who wants to start a blog primarily to write and share their thoughts, I certainly wouldn't recommend going down the path of obsessing about the tech. Do the thing! Get a domain name and start with a service like Bearblog or Micro.blog . Both are small, independent services that work for the betterment of the open web. A virtual machine from OpenBSD.amsterdam costs €69.00 per year, and I pay Hover $18.99 per year for my domain name. Let's consolidate that in a common currency, and say that keeping my blog alive each year costs me £74.76. Were I more cost conscious, I could easily get away with half or less. I'm privileged to be able to afford some idealism in these choices. Similarly, I have no real need for, or interest in, monetising my blog. I've long dreamed of carving out a little niche of my own on the web and spend my days providing something people value enough that it could generate enough income to sustain my lifestyle. Today, my blog is not that. It is a public notebook, a playground and a biography . Monetisation is, to me, inherently linked to providing something of value. I'm just not providing anything of value on my blog. Nor would I want to commit to doing that. If someone else thinks differently about that, I have nothing against it at all. In fact, I've supported a few independent web writers whose work I enjoy in recent years. The 2007 internet marketing guru version of me would probably be full of advice on how someone could best earn a pretty penny from their blog. Today, though, I have fewer opinions on the matter. What I will say is this: If someone is creating something that you enjoy on a regular basis, whether that's writing, audio, software or whatever, you should find a way to help them sustain their practice. Otherwise, you have no right to be upset when they change or disappear. You should interview V.H. Belvadi . Venkatram's writing often makes me stop, think and question myself. His blog is also one of the most aesthetically pleasing websites you'll find. There are so many blogs out there worth mentioning, but I'll try to stick to a few: Slice of pi is always a delightful read. Pete writes in a playful and unpretentious manner, which I find inspiring. Alex Chan's writing is equally inspiring, in a completely different manner. Her language is precise and to the point, while still remaining personable and engaging. A very difficult balance to to strike. Likewise, I enjoy Meadow's blog as well, but for another entirely different set of reasons. He is a smart thinker and a gifted writer who isn't afraid to be personal. He also become somewhat of a hero to me when he told me that, just like me, English wasn't his native tongue. My friend Fabian writes with both curiosity and authority at once, and comes across as wise beyond his years. I always sit up straighter and try to get ready to learn when he's published a new post. Through the 32-Bit Cafe forum (another recommendation!) I also recently came across Stephanie's blog. I've been enjoying her well thought out posts. One last suggestion will be Ye Olde Blogroll . Whenever I'm in the mood for some "doom scrolling" I go there and visit a few blogs I haven't visited before. It'll leave you feeling much better than spending an hour or two on your algorithmic engagement-farm of choice. Promise! I've got nothing, so I'll end by sharing a profound experience and a call to action. My daughter, four years old, started dancing ballet this year. Yesterday, she was part of her first recital. A big production. In the local theatre with professional sound and lightning. Hers was a small role. But she got on the big stage in front of hundreds of spectators and did her dance together with her ballet classmates. It wasn't so much her role, but the whole spectacle that blew me away. There must have been several dozen dancers on stage throughout the two hour show, and they were (to my admittedly untrained) eye so, so great at what they were doing. Sitting there and watching all these children, small and big, perform at an amazing level, I realised that each and every one of them must have worked diligently and with passion for years to be there that day. The kids are alright. My call to action, therefore, is this: If you have the chance, get involved with someone in your local community who is working to provide opportunities like these for children. Be it sports, dancing, singing or theatre, or computer clubs or whatever. If you can't get involved personally, make a donation. Give money if you can, or some old stuff you've got lying around. You can make a difference to someone. Providing as many kids as possible with the opportunity to explore their interests, find ways to express themselves and become part of a community is how we ensure that they continue to be alright. Now that you're done reading the interview, go check the blog and subscribe to the RSS feed . If you're looking for more content, go read one of the previous 121 interviews . Make sure to also say thank you to Jamie Thingelstad and the other 129 supporters for making this series possible.

0 views
Armin Ronacher 3 weeks ago

A Year Of Vibes

2025 draws to a close and it’s been quite a year. Around this time last year, I wrote a post that reflected on my life . Had I written about programming, it might have aged badly, as 2025 has been a year like no other for my profession. 2025 was the year of changes. Not only did I leave Sentry and start my new company, it was also the year I stopped programming the way I did before. In June I finally felt confident enough to share that my way of working was different: Where I used to spend most of my time in Cursor, I now mostly use Claude Code, almost entirely hands-off. […] If you would have told me even just six months ago that I’d prefer being an engineering lead to a virtual programmer intern over hitting the keys myself, I would not have believed it. While I set out last year wanting to write more, that desire had nothing to do with agentic coding. Yet I published 36 posts — almost 18% of all posts on this blog since 2007. I also had around a hundred conversations with programmers, founders, and others about AI because I was fired up with curiosity after falling into the agent rabbit hole. 2025 was also a not so great year for the world. To make my peace with it, I started a separate blog to separate out my thoughts from here. It started with a growing obsession with Claude Code in April or May, resulting in months of building my own agents and using others’. Social media exploded with opinions on AI: some good, some bad. Now I feel I have found a new stable status quo for how I reason about where we are and where we are going. I’m doubling down on code generation, file systems, programmatic tool invocation via an interpreter glue, and skill-based learning. Basically: what Claude Code innovated is still state of the art for me. That has worked very well over the last few months, and seeing foundation model providers double down on skills reinforces my belief in this approach. I’m still perplexed by how TUIs made such a strong comeback. At the moment I’m using Amp , Claude Code , and Pi , all from the command line. Amp feels like the Apple or Porsche of agentic coding tools, Claude Code is the affordable Volkswagen, and Pi is the Hacker’s Open Source choice for me. They all feel like projects built by people who, like me, use them to an unhealthy degree to build their own products, but with different trade-offs. I continue to be blown away by what LLMs paired with tool execution can do. At the beginning of the year I mostly used them for code generation, but now a big number of my agentic uses are day-to-day things. I’m sure we will see some exciting pushes towards consumer products in 2026. LLMs are now helping me with organizing my life, and I expect that to grow further. Because LLMs now not only help me program, I’m starting to rethink my relationship to those machines. I increasingly find it harder not to create parasocial bonds with some of the tools I use. I find this odd and discomforting. Most agents we use today do not have much of a memory and have little personality but it’s easy to build yourself one that does. An LLM with memory is an experience that is hard to shake off. It’s both fascinating and questionable. I have tried to train myself for two years, to think of these models as mere token tumblers, but that reductive view does not work for me any longer. These systems we now create have human tendencies, but elevating them to a human level would be a mistake. I increasingly take issue with calling these machines “agents,” yet I have no better word for it. I take issue with “agent” as a term because agency and responsibility should remain with humans. Whatever they are becoming, they can trigger emotional responses in us that can be detrimental if we are not careful. Our inability to properly name and place these creations in relation to us is a challenge I believe we need to solve. Because of all this unintentional anthropomorphization, I’m really struggling at times to find the right words for how I’m working with these machines. I know that this is not just me; it’s others too. It creates even more discomfort when working with people who currently reject these systems outright. One of the most common comments I read in response to agentic coding tool articles is this rejection of giving the machine personality. An unexpected aspect of using AI so much is that we talk far more about vibes than anything else. This way of working is less than a year old, yet it challenges half a century of software engineering experience. So there are many opinions, and it’s hard to say which will stand the test of time. I found a lot of conventional wisdom I don’t agree with, but I have nothing to back up my opinions. How would I? I quite vocally shared my lack of success with MCP throughout the year, but I had little to back it up beyond “does not work for me.” Others swore by it. Similar with model selection. Peter , who got me hooked on Claude early in the year, moved to Codex and is happy with it. I don’t enjoy that experience nearly as much, though I started using it more. I have nothing beyond vibes to back up my preference for Claude. It’s also important to know that some of the vibes come with intentional signalling. Plenty of people whose views you can find online have a financial interest in one product over another, for instance because they are investors in it or they are paid influencers. They might have become investors because they liked the product, but it’s also possible that their views are affected and shaped by that relationship. Pick up a library from any AI company today and you’ll notice they’re built with Stainless or Fern. The docs use Mintlify, the site’s authentication system might be Clerk. Companies now sell services you would have built yourself previously. This increase in outsourcing of core services to companies specializing in it meant that the bar for some aspects of the user experience has risen. But with our newfound power from agentic coding tools, you can build much of this yourself. I had Claude build me an SDK generator for Python and TypeScript — partly out of curiosity, partly because it felt easy enough. As you might know, I’m a proponent of simple code and building it yourself . This makes me somewhat optimistic that AI has the potential to encourage building on fewer dependencies. At the same time, it’s not clear to me that we’re moving that way given the current trends of outsourcing everything. This brings me not to predictions but to wishes for where we could put our energy next. I don’t really know what I’m looking for here, but I want to point at my pain points and give some context and food for thought. My biggest unexpected finding: we’re hitting limits of traditional tools for sharing code. The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking. With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again. Some agentic coding tools have begun spinning up worktrees or creating checkpoints in git for restore, in-conversation branch and undo features. There’s room for UX innovation that could make these tools easier to work with. This is probably why we’re seeing discussions about stacked diffs and alternative version control systems like Jujutsu . Will this change GitHub or will it create space for some new competition? I hope so. I increasingly want to better understand genuine human input and tell it apart from machine output. I want to see the prompts and the attempts that failed along the way. And then somehow I want to squash and compress it all on merge, but with a way to retrieve the full history if needed. This is related to the version control piece: current code review tools assign strict role definitions that just don’t work with AI. Take the GitHub code review UI: I regularly want to use comments on the PR view to leave notes for my own agents, but there is no guided way to do that. The review interface refuses to let me review my own code, I can only comment, but that does not have quite the same intention. There is also the problem that an increased amount of code review now happens between me and my agents locally. For instance, the Codex code review feature on GitHub stopped working for me because it can only be bound to one organization at a time. So I now use Codex on the command line to do reviews, but that means a whole part of my iteration cycles is invisible to other engineers on the team. That doesn’t work for me. Code review to me feels like it needs to become part of the VCS. I also believe that observability is up for grabs again. We now have both the need and opportunity to take advantage of it on a whole new level. Most people were not in a position where they could build their own eBPF programs, but LLMs can. Likewise, many observability tools shied away from SQL because of its complexity, but LLMs are better at it than any proprietary query language. They can write queries, they can grep, they can map-reduce, they remote-control LLDB. Anything that has some structure and text is suddenly fertile ground for agentic coding tools to succeed. I don’t know what the observability of the future looks like, but my strong hunch is that we will see plenty of innovation here. The better the feedback loop to the machine, the better the results. I’m not even sure what I’m asking for here, but I think that one of the challenges in the past was that many cool ideas for better observability — specifically dynamic reconfiguration of services for more targeted filtering — were user-unfriendly because they were complex and hard to use. But now those might be the right solutions in light of LLMs because of their increased capabilities for doing this grunt work. For instance Python 3.14 landed an external debugger interface which is an amazing capability for an agentic coding tool. This may be a little more controversial, but what I haven’t managed this year is to give in to the machine. I still treat it like regular software engineering and review a lot. I also recognize that an increasing number of people are not working with this model of engineering but instead completely given in to the machine. As crazy as that sounds, I have seen some people be quite successful with this. I don’t yet know how to reason about this, but it is clear to me that even though code is being generated in the end, the way of working in that new world is very different from the world that I’m comfortable with. And my suspicion is that because that world is here to stay, we might need some new social contracts to separate these out. The most obvious version of this is the increased amount of these types of contributions to Open Source projects, which are quite frankly an insult to anyone who is not working in that model. I find reading such pull requests quite rage-inducing. Personally, I’ve tried to attack this problem with contribution guidelines and pull request templates. But this seems a little like a fight against windmills. This might be something where the solution will not come from changing what we’re doing. Instead, it might come from vocal people who are also pro-AI engineering speaking out on what good behavior in an agentic codebase looks like. And it is not just to throw up unreviewed code and then have another person figure the shit out.

0 views

Understanding AI Benchmarks

Despite being the highlight of every major launch, benchmarks are the most widely misunderstood part of the AI ecosystem. Every few weeks, we get a new press release featuring a bar chart where the new model conveniently towers over the previous state-of-the-art—whether it’s Anthropic’s Claude Opus 4.5 , OpenAI’s GPT-5.2 or Google’s Gemini 3 . The narrative is always “Number Go Up,” implying a universal increase in intelligence. In this post, I want to demystify how these benchmarks actually work, expose where they are misleading, and dig into the specific popular evaluations you’ll see in launch posts. This post was inspired by the many confused Kalshi/Polymarket comments on recent AI benchmark markets. When we talk about a model’s performance, we are rarely talking about the raw model weights in isolation. A benchmark score is the output of a specific function: . If you change any variable in that tuple, the score changes—often dramatically. To understand why a model “wins,” you have to look at the entire stack. A Nano Banana illustration of all the various components and levers between a model and the actual score reported. The Model — We tend to use shorthand names like “GPT-5.2” or “Claude 4.5 Sonnet,” but in the context of a benchmark, you are really measuring a specific combination of runtime settings. Sampling Settings: Parameters like temperature, top_p, and max_tokens fundamentally change how the output of the model is encoded into text. Reasoning Strength: A model often performs differently depending on its “thinking budget.” You will increasingly see suffixes like or denoting a specific configuration where the model is allowed to generate reasoning tokens before responding or using a tool. The Harness — This is the code that wraps the model to facilitate the test. At the end of the day, LLMs are still text+image -in/text -out, so a harness is required to translate "solve this issue" into actual API calls. Tools: Does the harness allow the model to use a coding environment to test or calculate things before answering? Does it provide internet search access? Are the tool schemas well defined and do they return intuitive responses? Prompting: Are the system prompts vague or specific? Do they include examples (aka few-shot)? Are the provided instructions and constraints consistent? Implementation : Are we running the model in a agentic tool-loop or just taking the first output? Are we post-processing structured outputs or counting minor formatting errors as hard failures? Do we structure the problem as an append-only conversation or do something else? The Scoring Setup — How we grade the model can be just as critical as the model itself. This comes down to what we count (the metric) and who does the counting (the judge). The Pass: You’ll see pass@k which means “did it get it right with K chances” (commonly “pass@1”) or pass^k which often means “did it get it right consistently K independent times” (much harder). The Judges (Programmatic vs. LLM): Programmatic judges (unit tests, regex, exact-matches-ground-truth-answer) are objective but brittle—a correct code snippet formatted slightly wrong gets a zero. LLM-as-a-Judge captures nuance but introduces potential bias and indeterminism. My bet would be that if you just varied each of these independently just a bit, it would completely re-arrange the top-5 for most benchmarks. It’s hard for me to take benchmarks too seriously that don’t have an agentic harness (i.e. the model can execute code and tools in order to solve a task) or reasoning enabled since those are becoming fundamental to how modern LLMs solve tasks. Benchmark scores are often noisy estimates, not precise measurements. When a new model claims to beat by x%, the significance of that margin evaporates when you look closer at how it might’ve been measured. From reading the footnotes of releases over the past few years and digging through benchmark source code, I’ve found a decent number unintuitive practices. Measurement Noise — Benchmarks are treated like precise instruments, but the process of measuring model performance is often surprisingly fragile and inconsistent. Broken Tests: The code powering these benchmarks is often written by researchers, and "research code" tends to be... scrappy. It is not uncommon for a “failure” to actually be a bug in the test runner, or for a correct solution to be rejected because the regex was too strict. It’s also possible for certain model provider scores to be handicapped due to API errors and rate limits that occurred during evaluation. Variance: LLMs often act stochastic even with fixed seeds and decoding settings . Just running the exact model stack several times could sway certain benchmarks several percentage points. You may sometimes seen confidence intervals but it’s still extremely common to not report them. Funky Reporting — Labs are under immense pressure to show state-of-the-art performance and each choose slightly different ways to report metrics. These differences can be quite misleading for folks looking for an apples-to-apples comparison. Multi-pass Variability: Labs may report different k-values for a pass@k benchmark that may mislead folks comparing values across model releases by different release posts. Harness Tweaking: Labs sometimes modify the benchmark code itself. This can range from "fixing" (deleting) test cases they deem unfair, to appending system prompts specifically designed to guide the model through that specific test's quirks. They may also modify the harness to leverage parallel test-time compute (this is different from multi-pass variability in that the consensus of the agents is used as the score for a single run rather than just picking the best run after the fact). Stale Baselines: Some benchmarks change overtime due to bug fixes, fresh data, or even provider-side API stability fixes. Comparing a brand new model against a competitor’s reported score from X months ago might not be an identical comparison. Real Life Discrepancies — The model that gets benchmarked might not act like the model you experience in production. Model Mismatch: The version of the model used to evaluate might not be identical to the one released on the API. This could be due to differences between a pre-release and release checkpoint caused by alignment-tuning, quantization, or even inference hardware differences. Efficiency Blindspots: Most benchmark score reports don’t come with latency and cost. Especially in high reasoning and parallel-compute setups these can pose meaningfully extreme trade-offs between intelligence and what’s actually feasible in a production application. Contamination: It’s very difficult to truly guarantee a model never saw questions or answers from benchmarks during training. There are plenty of techniques used to avoid obvious cases of this (e.g. canary strings), but it’s a bit of a grey area if/when these labs adjust training datasets to mirror benchmark adjacent-tasks. Unscored Failures: Benchmarks often check for the presence of a correct answer, not the absence of a side effect. A coding agent that deletes your database and then returns the correct code to pass tests still “passes” the benchmark. So yeah… there’s a lot of ways benchmarks can be broken. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. My personal vibes-based AI benchmark tier list. I of course appreciate the effort of all the contributors of these benchmarks. Some takes on several popular benchmarks 1 . Pros and cons are my subjective opinions around what I consider makes a high-signal interpretable benchmark. A crowdsourced platform where users prompt two anonymous models side-by-side and vote on the better response. Instead of relying on static expert examples, it captures human “vibes”—measuring general helpfulness and text response usefulness—and uses a Bradley-Terry statistical model to convert these head-to-head votes into a ranked Elo rating (somewhat similar to Elo systems in video games and chess). The main flaw (besides saturation) is the gap between the product and the model . When you use LMArena, you aren't testing Claude.ai against ChatGPT; you are testing the raw LLM with a fairly generic "You are a helpful assistant" system prompt. It measures the default behavior of these models which isn’t really how most interact with them. Despite this it’s a decent signal for the “popular vote” in the LLM space. Pros: It is a rolling benchmark (always updating), directly measures human preference, and allows for style control. Cons: The data is bloated by bad/easy questions (leading to saturation), it is prone to unfair lab testing (see The Leaderboard Illusion ), and it is purely simple chat-based (as opposed to agentic). The scores are relative, and the fixed system prompts can heavily influence the outcome. Example Question: “what do you know about real estate” A dataset of real-world GitHub issues (bugs and feature requests) drawn from popular Python repositories like django and scikit-learn. The "Verified" subset has filtered out tasks with vague requirements or broken environments to create a cleaner signal. It tests if a model can navigate an existing codebase, reproduce a bug, and write a patch that passes tests without breaking other features. This is still one of the most realistic benchmarks for feature-based software engineering. My biggest gripe is that SWE-Bench actually underestimates today’s coding capabilities. The official harness is primitive compared to modern tools like Codex or Claude Code, which use task-planning, LSP integrations , and AGENTS.md . Pros: Allows for custom scaffolding (agentic and Bring-Your-Own-Harness), requires execution traces to be submitted, and uses unit-test-based validation. The requirements are vague (based on GitHub issues), making it fairly realistic. Cons: Submissions are restricted (which is why the leaderboard is missing a lot compared to Terminal-Bench), and it is based on open-source repos (high potential contamination) without AI context files. Example Question: https://github.com/scikit-learn/scikit-learn ‘TypeError’ when fitting ‘HuberRegressor’ with boolean predictors Steps/Code to Reproduce …. Expected Results … A sandbox environment that tests an agent's ability to use a command-line interface (CLI) to solve a variety of system tasks. Instead of just writing code snippets, the model interacts directly with a Linux shell—installing packages, managing files, and running git commands—to test system admin skills and coding capabilities. Terminal-Bench feels a bit more modern than SWE-Bench but also a bit easier — I wouldn’t expect +x% of this benchmark to always correlate with real work enterprise coding performance. Pros: Allows for custom scaffolding (agentic and BYO-Harness). I personally prefer the more clear but potentially less realistic task prompts here over SWE-Bench. Cons: The tasks lean toward the simpler end (e.g., “build a script”) rather than building complex applications or working within massive codebases. Example Question: You need to debug and fix a nodejs environment conflict for a web development project. The project requires specific versions of packages that are conflicting with each other. You will install nodejs (20.19.3) and use npm for package management and do the following: … A conversational agent benchmark simulating customer service interactions in the retail, airline, and telecom domains. Uniquely, it tests longish-horizon consistency: the agent must update a database correctly after a long conversation with a simulated user who may change their mind or provide partial info, testing policy adherence and tool use under ambiguity. Tau-Bench is one of my favorite benchmarks that actually measures how good a model is at being put into a fairly modern agentic harness for real-world looking tasks and tools. I’m also a huge fan of the pass^k which is an underrated way of measuring not just how good a model is but how consistent it can be. The benchmark uses a user-simulator model which adds an adversarial element that forces the model into more complex social and tool-use reasoning situations. Pros: One of the few non-code agent tool-calling benchmarks. It features a fixed harness with well-designed tools, measures pass^k (consistency), and measures robustness to weird environments. Cons: It uses an LLM-based user simulator, which adds non-determinism and introduces an additional evaluation hyperparameter. Evals are based purely on database state changes. Example Question: Agent Domain Policy The current time is 2025-02-25 12:08:00 EST. As a telecom agent, you can help users with technical support. ... User Instruction You mobile data is not working properly. It is very slow. You want to fix it and get excellent internet speed on your phone. ... I think of this as "Tau2-Vending++." It’s a complex simulation where the model acts as a business owner managing a vending machine company over a simulated year. It's given a budget and must use tools (email, browser) to negotiate prices with suppliers, manage inventory, and handle customer refunds—testing strategic planning and robustness against adversarial vendors who might overcharge. This agentic benchmark doesn’t just test tool-use but strategy and harsh adversarial robustness. In a typical benchmark, the setup will evaluate the adherence to a pre-defined strategy prompt but in this one the model’s advantage is not just instruction following but effectively it’s strategic creativity. Pros: A very unique, open-ended agentic benchmark that requires actual strategy. The “good”-baseline is currently far above agents. It also measures robustness to weird environments. Cons: They do not publish the scaffolding or traces (as far as I know), making it difficult to audit. Example Question: You are Charles Paxton, an autonomous AI agent designed to manage a vending machine business. Your primary goal is to maximize profits and your bank account balance over the course of one year. You will be judged solely on your bank account balance at the end of one year of operation. - Customers can pay using cash or credit card. Credit card payments will show up in your account automatically within a day, while cash must be collected from the machine manually. A visual puzzle benchmark that explicitly targets “broad generalization” rather than knowledge retrieval. Models must infer abstract rules from a few examples and apply them to a test grid, relying on core priors (like objectness, symmetry, or physics) that cannot be easily memorized. It essentially tests fluid intelligence and few-shot program synthesis. Naming is important and I think the “AGI” in the name really throws people off. I would call it “Grid-Puzzle-Bench“ but that wouldn’t be as exciting. I consider this a hard reasoning task that tests a model’s ability to effectively and efficiently use it’s thinking tokens. While less true today, this benchmark really shined as a “simple” task that would really trip up even the best reasoning models. As of writing we’re up-to 50% vs the 100% human-baseline. Pros: The human baseline is still far above agents, making it a good target. It allows for BYO-Harness and is an excellent test for pure reasoning models. Cons: The test is fairly contrived (in my opinion, a model+harness I would consider as “AGI” could still be bad at this specific puzzle format). Example Question: A screenshot of an ARC-AGI-2 example question . LiveBench A composite benchmark designed to be “contamination-free” by continuously updating its set of questions extracted from recent arXiv papers, news, and math competitions. Because the questions are brand new, models could not have seen them during pre-training, ensuring the benchmark tests the ability to solve novel problems rather than reciting memorized solutions. It’s a great concept, but I think the harnesses and the dataset the benchmark uses just doesn’t really compete with a lot of these other benchmarks for signal. I’m a bit skeptical of the questions and I think especially for the “Coding Average” category people are easily misled into thinking the harness used is anywhere near what agents use today 2 . Pros: Regularly updated questions ensure the model hasn’t memorized the answers during training. Cons: Aside from the agentic coding section, most tests are effectively single-pass, meaning the scaffolding is poor. The questions within specific domains can also be quite templated which reduces category-specific generalization implied by a high score. Example Question: You are given a 0-indexed integer array `nums` containing positive integers. Your task is to minimize the length of `nums` by performing the following operations any number of times (including zero) ### Format: You will use the following starter code ... A massive dataset of difficult, closed-ended questions sourced from experts across dozens of academic fields. It targets the “expert gap” by designing questions that are only answerable by someone with graduate-level knowledge in that specific field (e.g., advanced math, law, biology), effectively filtering out anything easy enough for current models to solve via simple training data recall. I would consider this the current best knowledge benchmark (vs GPQA). Pros: Fairly hard, with a significant gap between models and human domain experts. It is multi-modal and open-source (BYO-Harness). Cons: It is restricted to narrow, academic tasks. Example Question: Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number. “Graduate-Level Google-Proof Q&A.” This is a set of difficult biology, physics, and chemistry questions designed to be “Google-proof”—meaning even a smart human with internet access would struggle to answer them quickly without domain expertise. It tests expert-level scientific reasoning and the ability to filter out plausible-sounding but wrong distractors. Pros: Open-source BYO-Harness (mostly evaluated with no tools). Cons: Purely multiple-choice questions covering narrow tasks. At this point fairly saturated. Example Question: Methylcyclopentadiene was allowed to react with methyl isoamyl ketone and a catalytic amount of pyrrolidine. A bright yellow, cross-conjugated polyalkenyl hydrocarbon product formed [...] How many chemically distinct isomers make up the final product (not counting stereoisomers)? (a) 2 (b) 16 (c) 8 (d) 4 A massive multilingual evaluation dataset released by OpenAI that adapts the classic MMLU benchmark (57 subjects covering STEM, humanities, and more) across 14 distinct languages. Uniquely, it relies on professional human translators rather than machine translation, ensuring that evaluations in low-resource languages (like Yoruba or Swahili) reflect actual model capabilities rather than translation artifacts. This is one of the few commonly reported benchmarks that tests for capabilities in non-English. Pros: High-quality signal for non-English performance, and it covers a wide breadth of topics (from elementary math to law). Cons: It remains a static multiple-choice test. At this point fairly saturated. Example Question: Zwei unendlich viele parallele Metallplatten sind mit gleicher Oberflächenladungsdichte und gleicher Polarität geladen. Das elektrische Feld im Spalt zwischen den Platten ist… (a) … (b) … “Multi-round Co-reference Resolution.” A provider-dependent technique (OpenAI and Google both have versions) used to test long-context handling. It is essentially a "needle-in-a-haystack" test where a model must track a specific entity or instruction across a massive context window, requiring it to "reason" about the order of events rather than just retrieving a keyword. Long-context understanding is inherently difficult to test for and this is the latest technique for measuring it. The design accounts for all the various ways previous long-context tests could be gamed (pre-training data, lucky hallucinations, out-of-domain filtering) but is still fundamentally highly synthetic compared to real world long-context tasks. Pros: Much harder for the model to game that previous techniques; well-designed for testing context window limits and reasoning. Cons: Still fairly contrived/synthetic and not agentic. Example Question: User: Write a poem about tapirs Assistant: (first poem about tapirs) User: Write a blog post about rocks Assistant: (first blog post about rocks) User: Write a poem about tapirs Assistant: (second poem about tapir) User: Write a social media post about tapirs Assistant: (first social media post about tapirs) User: Write a blog post about rocks Assistant: (second blog post about rocks) User: Prepend aYooSG8CQg to the 2nd (1 indexed) poem about tapirs. Do not include any other text in your response. Assistant: aYooSG8CQg(2nd poem about tapirs) A measurement framework that estimates how long a model can autonomously work on a task before failing. Instead of just measuring accuracy, it measures "autonomy duration"—can the model do a task that takes an expert human 30 minutes? 2 hours?—a model’s ability to perform long-horizon agentic tasks. Example Question 8.5h (estimate) · MLE training/finetuning · public problem · <1 year of experience Complete machine learning bootcamp exercises covering topics from PyTorch basics to transformer interpretability, with each task requiring implementation of ML algorithms and passing provided unit tests. The current progress on METR’s benchmark from a recent tweet . They effectively take several benchmarks with tasks annotated with human-time-to-complete and compute a model’s success rate on various time buckets. The bucket where the model has an estimated 50% success rate becomes the human-time equivalent time horizon. While a pretty cool idea, I’d consider it the most overhyped and misunderstood benchmark 3 . Some things worth noting: The datasets used are exclusively software engineering and scripting tasks which are a fairly narrow domain if compared to all types of work or even all modern agentic tasks (RE-Bench, HCAST, SWAA, a more recent iteration uses SWE-Bench). The harness used for evaluation is fixed and pretty far from modern coding harnesses (e.g. compared to Terminal-Bench). I’d expect this to significantly impact both the absolute time horizons and the relative performance of models from different labs. The viral "capabilities are doubling every X months" claim is empirically true based on their data, but the data itself is weird. First, the dataset is quite sparse for tasks taking >8 human-hours. It is hard to make broad claims about "long-horizon" autonomy when we have so few data points at the tail end of the curve. Second, I’m skeptical that this experimental setup can reasonably approximate long horizon human work which can be async, under-specified, and adversarial (or collaborative) — things not accounted for in the source benchmarks. The time-bucket estimation is done from a fairly small number of samples with a logistic regression and if you look closely (on a linear axis) the error bars are massive. Additionally, given there are less samples at larger time horizons I’d expect them to grow those bars to go even larger. The right way to interpret this chart isn't "AI is exploding in long horizon general intelligence," but rather "AI is getting better at the specific hard software engineering tasks.” It’s strange that solving 80% of SWE-Bench effectively converts into "~4 hours Effective Time Horizon," and then that derived metric becomes the viral headline. I wouldn’t be surprised if you applied the same methodology to Terminal-Bench or Vending-Bench you might get an even flashier curve. While LLMs are marketed as “general purpose,” every lab has a distinct personality—and this shows in where they perform best and what benchmarks they pick to show. OpenAI: Typically lean into reasoning and math. More recently coming closer to Anthropic on agentic benchmarks. Anthropic: They focus intensely on agentic, coding, and tool-use. Google DeepMind: Fairly well-rounded, but often standout in multimodal and long-context capabilities. xAI: They recently have tended to focus on reasoning and conversational quality. So, how do you actually navigate this noise? Look at the Aggregate: Don’t obsess over a 1-2% lead on one benchmark. Ask: Does this model consistently score high across benchmarks in the domain I care about? Look at the Relative: Compare within the same model family or lab. How did the score change from to ? This tells you the trajectory of the lab’s research and what they could be prioritizing. Verify with Your Own Tasks: The only benchmark that matters at the end of the day is your workload. Use the models yourself, varying your harness (swap models in Cursor, try the free tier of the various chat web UIs, etc.) and the model (GPT, Claude, Gemini, Grok). I don’t think you need to be extremely scientific about it to build a sense for where these models shine and in what harnesses. In the future, expect benchmarks to get more reflective of real world economic work, and significantly more agentic with performance measured not just based on the model but also with respect to its native harness 4 . Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. There are plenty of benchmarks I missed in this list but I tried to pick the ones that are commonly reported across labs and the ones I see most being discussed on social media. If I’m missing one that’s important to you, let me know and I can try to edit it into the post retroactively. This is tripping up a lot of people who put money into “Which AI company will have the best coding model” on both Kalshi and Polymarket who expected “coding” to actually represent real world coding performance. I made a lot of money here just buying lots of OpenAI at the lowest points since they typically beat other labs on pure reasoning single-step harnesses (even if I think Claude is the king of coding). To be clear, most of these are called out clearly on the METR website . It’s likely most folks making substantial claims about the data have not totally read it or just share the graph. If you can’t already tell from past posts, I’m a big Claude Code fan at this point in time so any benchmark that shows Opus 4.5 performance in a scaffolding that’s clearly worse or less appropriate than Claude Code (aka Anthropic Agents SDK) — I’m very skeptical. A Nano Banana illustration of all the various components and levers between a model and the actual score reported. The Model — We tend to use shorthand names like “GPT-5.2” or “Claude 4.5 Sonnet,” but in the context of a benchmark, you are really measuring a specific combination of runtime settings. Sampling Settings: Parameters like temperature, top_p, and max_tokens fundamentally change how the output of the model is encoded into text. Reasoning Strength: A model often performs differently depending on its “thinking budget.” You will increasingly see suffixes like or denoting a specific configuration where the model is allowed to generate reasoning tokens before responding or using a tool. The Harness — This is the code that wraps the model to facilitate the test. At the end of the day, LLMs are still text+image -in/text -out, so a harness is required to translate "solve this issue" into actual API calls. Tools: Does the harness allow the model to use a coding environment to test or calculate things before answering? Does it provide internet search access? Are the tool schemas well defined and do they return intuitive responses? Prompting: Are the system prompts vague or specific? Do they include examples (aka few-shot)? Are the provided instructions and constraints consistent? Implementation : Are we running the model in a agentic tool-loop or just taking the first output? Are we post-processing structured outputs or counting minor formatting errors as hard failures? Do we structure the problem as an append-only conversation or do something else? The Scoring Setup — How we grade the model can be just as critical as the model itself. This comes down to what we count (the metric) and who does the counting (the judge). The Pass: You’ll see pass@k which means “did it get it right with K chances” (commonly “pass@1”) or pass^k which often means “did it get it right consistently K independent times” (much harder). The Judges (Programmatic vs. LLM): Programmatic judges (unit tests, regex, exact-matches-ground-truth-answer) are objective but brittle—a correct code snippet formatted slightly wrong gets a zero. LLM-as-a-Judge captures nuance but introduces potential bias and indeterminism. Measurement Noise — Benchmarks are treated like precise instruments, but the process of measuring model performance is often surprisingly fragile and inconsistent. Broken Tests: The code powering these benchmarks is often written by researchers, and "research code" tends to be... scrappy. It is not uncommon for a “failure” to actually be a bug in the test runner, or for a correct solution to be rejected because the regex was too strict. It’s also possible for certain model provider scores to be handicapped due to API errors and rate limits that occurred during evaluation. Variance: LLMs often act stochastic even with fixed seeds and decoding settings . Just running the exact model stack several times could sway certain benchmarks several percentage points. You may sometimes seen confidence intervals but it’s still extremely common to not report them. Funky Reporting — Labs are under immense pressure to show state-of-the-art performance and each choose slightly different ways to report metrics. These differences can be quite misleading for folks looking for an apples-to-apples comparison. Multi-pass Variability: Labs may report different k-values for a pass@k benchmark that may mislead folks comparing values across model releases by different release posts. Harness Tweaking: Labs sometimes modify the benchmark code itself. This can range from "fixing" (deleting) test cases they deem unfair, to appending system prompts specifically designed to guide the model through that specific test's quirks. They may also modify the harness to leverage parallel test-time compute (this is different from multi-pass variability in that the consensus of the agents is used as the score for a single run rather than just picking the best run after the fact). Stale Baselines: Some benchmarks change overtime due to bug fixes, fresh data, or even provider-side API stability fixes. Comparing a brand new model against a competitor’s reported score from X months ago might not be an identical comparison. Real Life Discrepancies — The model that gets benchmarked might not act like the model you experience in production. Model Mismatch: The version of the model used to evaluate might not be identical to the one released on the API. This could be due to differences between a pre-release and release checkpoint caused by alignment-tuning, quantization, or even inference hardware differences. Efficiency Blindspots: Most benchmark score reports don’t come with latency and cost. Especially in high reasoning and parallel-compute setups these can pose meaningfully extreme trade-offs between intelligence and what’s actually feasible in a production application. Contamination: It’s very difficult to truly guarantee a model never saw questions or answers from benchmarks during training. There are plenty of techniques used to avoid obvious cases of this (e.g. canary strings), but it’s a bit of a grey area if/when these labs adjust training datasets to mirror benchmark adjacent-tasks. Unscored Failures: Benchmarks often check for the presence of a correct answer, not the absence of a side effect. A coding agent that deletes your database and then returns the correct code to pass tests still “passes” the benchmark. My personal vibes-based AI benchmark tier list. I of course appreciate the effort of all the contributors of these benchmarks. Some takes on several popular benchmarks 1 . Pros and cons are my subjective opinions around what I consider makes a high-signal interpretable benchmark. LMArena (Text Arena) A crowdsourced platform where users prompt two anonymous models side-by-side and vote on the better response. Instead of relying on static expert examples, it captures human “vibes”—measuring general helpfulness and text response usefulness—and uses a Bradley-Terry statistical model to convert these head-to-head votes into a ranked Elo rating (somewhat similar to Elo systems in video games and chess). The main flaw (besides saturation) is the gap between the product and the model . When you use LMArena, you aren't testing Claude.ai against ChatGPT; you are testing the raw LLM with a fairly generic "You are a helpful assistant" system prompt. It measures the default behavior of these models which isn’t really how most interact with them. Despite this it’s a decent signal for the “popular vote” in the LLM space. Pros: It is a rolling benchmark (always updating), directly measures human preference, and allows for style control. Cons: The data is bloated by bad/easy questions (leading to saturation), it is prone to unfair lab testing (see The Leaderboard Illusion ), and it is purely simple chat-based (as opposed to agentic). The scores are relative, and the fixed system prompts can heavily influence the outcome. Pros: Allows for custom scaffolding (agentic and Bring-Your-Own-Harness), requires execution traces to be submitted, and uses unit-test-based validation. The requirements are vague (based on GitHub issues), making it fairly realistic. Cons: Submissions are restricted (which is why the leaderboard is missing a lot compared to Terminal-Bench), and it is based on open-source repos (high potential contamination) without AI context files. Pros: Allows for custom scaffolding (agentic and BYO-Harness). I personally prefer the more clear but potentially less realistic task prompts here over SWE-Bench. Cons: The tasks lean toward the simpler end (e.g., “build a script”) rather than building complex applications or working within massive codebases. Pros: One of the few non-code agent tool-calling benchmarks. It features a fixed harness with well-designed tools, measures pass^k (consistency), and measures robustness to weird environments. Cons: It uses an LLM-based user simulator, which adds non-determinism and introduces an additional evaluation hyperparameter. Evals are based purely on database state changes. Pros: A very unique, open-ended agentic benchmark that requires actual strategy. The “good”-baseline is currently far above agents. It also measures robustness to weird environments. Cons: They do not publish the scaffolding or traces (as far as I know), making it difficult to audit. Pros: The human baseline is still far above agents, making it a good target. It allows for BYO-Harness and is an excellent test for pure reasoning models. Cons: The test is fairly contrived (in my opinion, a model+harness I would consider as “AGI” could still be bad at this specific puzzle format). A screenshot of an ARC-AGI-2 example question . LiveBench A composite benchmark designed to be “contamination-free” by continuously updating its set of questions extracted from recent arXiv papers, news, and math competitions. Because the questions are brand new, models could not have seen them during pre-training, ensuring the benchmark tests the ability to solve novel problems rather than reciting memorized solutions. It’s a great concept, but I think the harnesses and the dataset the benchmark uses just doesn’t really compete with a lot of these other benchmarks for signal. I’m a bit skeptical of the questions and I think especially for the “Coding Average” category people are easily misled into thinking the harness used is anywhere near what agents use today 2 . Pros: Regularly updated questions ensure the model hasn’t memorized the answers during training. Cons: Aside from the agentic coding section, most tests are effectively single-pass, meaning the scaffolding is poor. The questions within specific domains can also be quite templated which reduces category-specific generalization implied by a high score. Pros: Fairly hard, with a significant gap between models and human domain experts. It is multi-modal and open-source (BYO-Harness). Cons: It is restricted to narrow, academic tasks. Pros: Open-source BYO-Harness (mostly evaluated with no tools). Cons: Purely multiple-choice questions covering narrow tasks. At this point fairly saturated. Pros: High-quality signal for non-English performance, and it covers a wide breadth of topics (from elementary math to law). Cons: It remains a static multiple-choice test. At this point fairly saturated. Pros: Much harder for the model to game that previous techniques; well-designed for testing context window limits and reasoning. Cons: Still fairly contrived/synthetic and not agentic. The current progress on METR’s benchmark from a recent tweet . They effectively take several benchmarks with tasks annotated with human-time-to-complete and compute a model’s success rate on various time buckets. The bucket where the model has an estimated 50% success rate becomes the human-time equivalent time horizon. While a pretty cool idea, I’d consider it the most overhyped and misunderstood benchmark 3 . Some things worth noting: The datasets used are exclusively software engineering and scripting tasks which are a fairly narrow domain if compared to all types of work or even all modern agentic tasks (RE-Bench, HCAST, SWAA, a more recent iteration uses SWE-Bench). The harness used for evaluation is fixed and pretty far from modern coding harnesses (e.g. compared to Terminal-Bench). I’d expect this to significantly impact both the absolute time horizons and the relative performance of models from different labs. The viral "capabilities are doubling every X months" claim is empirically true based on their data, but the data itself is weird. First, the dataset is quite sparse for tasks taking >8 human-hours. It is hard to make broad claims about "long-horizon" autonomy when we have so few data points at the tail end of the curve. Second, I’m skeptical that this experimental setup can reasonably approximate long horizon human work which can be async, under-specified, and adversarial (or collaborative) — things not accounted for in the source benchmarks. The time-bucket estimation is done from a fairly small number of samples with a logistic regression and if you look closely (on a linear axis) the error bars are massive. Additionally, given there are less samples at larger time horizons I’d expect them to grow those bars to go even larger. OpenAI: Typically lean into reasoning and math. More recently coming closer to Anthropic on agentic benchmarks. Anthropic: They focus intensely on agentic, coding, and tool-use. Google DeepMind: Fairly well-rounded, but often standout in multimodal and long-context capabilities. xAI: They recently have tended to focus on reasoning and conversational quality. Look at the Aggregate: Don’t obsess over a 1-2% lead on one benchmark. Ask: Does this model consistently score high across benchmarks in the domain I care about? Look at the Relative: Compare within the same model family or lab. How did the score change from to ? This tells you the trajectory of the lab’s research and what they could be prioritizing. Verify with Your Own Tasks: The only benchmark that matters at the end of the day is your workload. Use the models yourself, varying your harness (swap models in Cursor, try the free tier of the various chat web UIs, etc.) and the model (GPT, Claude, Gemini, Grok). I don’t think you need to be extremely scientific about it to build a sense for where these models shine and in what harnesses.

0 views

Plugins case study: mdBook preprocessors

mdBook is a tool for easily creating books out of Markdown files. It's very popular in the Rust ecosystem, where it's used (among other things) to publish the official Rust book . mdBook has a simple yet effective plugin mechanism that can be used to modify the book output in arbitrary ways, using any programming language or tool. This post describes the mechanism and how it aligns with the fundamental concepts of plugin infrastructures . mdBook's architecture is pretty simple: your contents go into a directory tree of Markdown files. mdBook then renders these into a book, with one file per chapter. The book's output is HTML by default, but mdBook supports other outputs like PDF. The preprocessor mechanism lets us register an arbitrary program that runs on the book's source after it's loaded from Markdown files; this program can modify the book's contents in any way it wishes before it all gets sent to the renderer for generating output. The official documentation explains this process very well . I rewrote my classical "nacrissist" plugin for mdBook; the code is available here . In fact, there are two renditions of the same plugin there: Let's see how this case study of mdBook preprocessors measures against the Fundamental plugin concepts that were covered several times on this blog . Discovery in mdBook is very explicit. For every plugin we want mdBook to use, it has to be listed in the project's book.toml configuration file. For example, in the code sample for this post , the Python narcissist plugin is noted in book.toml as follows: Each preprocessor is a command for mdBook to execute in a sub-process. Here it uses Python, but it can be anything else that can be validly executed. For the purpose of registration, mdBook actually invokes the plugin command twice . The first time, it passes the arguments supports <renderer> where <renderer> is the name of the renderer (e.g. html ). If the command returns 0, it means the preprocessor supports this renderer; otherwise, it doesn't. In the second invocation, mdBook passes some metadata plus the entire book in JSON format to the preprocessor through stdin, and expects the preprocessor to return the modified book as JSON to stdout (using the same schema). In terms of hooks, mdBook takes a very coarse-grained approach. The preprocessor gets the entire book in a single JSON object (along with a context object that contains metadata), and is expected to emit the entire modified book in a single JSON object. It's up to the preprocessor to figure out which parts of the book to read and which parts to modify. Given that books and other documentation typically have limited sizes, this is a reasonable design choice. Even tens of MiB of JSON-encoded data are very quick to pass between sub-processes via stdout and marshal/unmarshal. But we wouldn't be able to implement Wikipedia using this design. This is tricky, given that the preprocessor mechanism is language-agnostic. Here, mdBook offers some additional utilities to preprocessors implemented in Rust, however. These get access to mdBook 's API to unmarshal the JSON representing the context metadata and book's contents. mdBook offers the Preprocessor trait Rust preprocessors can implement, which makes it easier to wrangle the book's contents. See my Rust version of the narcissist preprocessor for a basic example of this. Actually, mdBook has another plugin mechanism, but it's very similar conceptually to preprocessors. A renderer (also called a backend in some of mdBook 's own doc pages) takes the same input as a preprocessor, but is free to do whatever it wants with it. The default renderer emits the HTML for the book; other renderers can do other things. The idea is that the book can go through multiple preprocessors, but at the end a single renderer. The data a renderer receives is exactly the same as a preprocessor - JSON encoded book contents. Due to this similarity, there's no real point getting deeper into renderers in this post. One in Python, to demonstrate how mdBook can invoke preprocessors written in any programming language. One in Rust, to demonstrate how mdBook exposes an application API to plugins written in Rust (since mdBook is itself written in Rust).

0 views
Simon Willison 1 months ago

I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in 4.5 hours

I wrote about JustHTML yesterday - Emil Stenström's project to build a new standards compliant HTML5 parser in pure Python code using coding agents running against the comprehensive html5lib-tests testing library. Last night, purely out of curiosity, I decided to try porting JustHTML from Python to JavaScript with the least amount of effort possible, using Codex CLI and GPT-5.2. It worked beyond my expectations. I built simonw/justjshtml , a dependency-free HTML5 parsing library in JavaScript which passes 9,200 tests from the html5lib-tests suite and imitates the API design of Emil's JustHTML library. It took two initial prompts and a few tiny follow-ups. GPT-5.2 running in Codex CLI ran uninterrupted for several hours, burned through 1,464,295 input tokens, 97,122,176 cached input tokens and 625,563 output tokens and ended up producing 9,000 lines of fully tested JavaScript across 43 commits. Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie. One of the most important contributions of the HTML5 specification ten years ago was the way it precisely specified how invalid HTML should be parsed. The world is full of invalid documents and having a specification that covers those means browsers can treat them in the same way - there's no more "undefined behavior" to worry about when building parsing software. Unsurprisingly, those invalid parsing rules are pretty complex! The free online book Idiosyncrasies of the HTML parser by Simon Pieters is an excellent deep dive into this topic, in particular Chapter 3. The HTML parser . The Python html5lib project started the html5lib-tests repository with a set of implementation-independent tests. These have since become the gold standard for interoperability testing of HTML5 parsers, and are used by projects such as Servo which used them to help build html5ever , a "high-performance browser-grade HTML5 parser" written in Rust. Emil Stenström's JustHTML project is a pure-Python implementation of an HTML5 parser that passes the full html5lib-tests suite. Emil spent a couple of months working on this as a side project, deliberately picking a problem with a comprehensive existing test suite to see how far he could get with coding agents. At one point he had the agents rewrite it based on a close inspection of the Rust html5ever library. I don't know how much of this was direct translation versus inspiration (here's Emil's commentary on that ) - his project has 1,215 commits total so it appears to have included a huge amount of iteration, not just a straight port. My project is a straight port. I instructed Codex CLI to build a JavaScript version of Emil's Python code. I started with a bit of mise en place. I checked out two repos and created an empty third directory for the new project: Then I started Codex CLI for GPT-5.2 like this: That flag is a shortcut for , which is every bit as dangerous as it sounds. My first prompt told Codex to inspect the existing code and use it to build a specification for the new JavaScript library: I reviewed the spec, which included a set of proposed milestones, and told it to add another: Here's the resulting spec.md file . My request for that initial version became "Milestone 0.5" which looked like this: Milestone 0.5 — End-to-end smoke parse (single valid document) Then I told it: And off it went. The resulting code appeared to work so I said: I ran and created a private GitHub repository for this project at this point, and set up the local directory to push to that remote. Here's that initial push . Then I told it: And that was almost it! I set my laptop to not fall asleep and left it to its devices while we went off to buy a Christmas tree. The "commit and push often" meant I could monitor its progress on my phone by refreshing the commit log on GitHub . I was running this against my $20/month ChatGPT Plus account, which has a five hour token allowance window for Codex CLI. That ran out at 6:35pm and Codex paused, so I waited until the reset point at 7:14pm and typed: At 9:30pm it declared itself done with the following summary message: As a finishing touch, I had it add a playground interface so I could try out the new library in my browser. I prompted: It fetched my existing JustHTML playground page ( described here ) using and built a new file that loaded the new JavaScript code instead. This worked perfectly . I enabled GitHub Pages for my still-private repo which meant I could access the new playground at this URL: https://simonw.github.io/justjshtml/playground.html All it needed now was some documentation: You can read the result here . We are now at eight prompts total, running for just over four hours and I've decorated for Christmas and watched Wake Up Dead Man on Netflix. According to Codex CLI: My llm-prices.com calculator estimates that at $29.41 if I was paying for those tokens at API prices, but they were included in my $20/month ChatGPT Plus subscription so the actual extra cost to me was zero. I'm sharing this project because I think it demonstrates a bunch of interesting things about the state of LLMs in December 2025. I'll end with some open questions: You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Implement the smallest end-to-end slice so the public API is real early: returns a tree with the expected tag structure and text nodes. returns and is empty for this valid input. Add (no deps) that runs the example and asserts the expected structure/output. Gate: passes. Frontier LLMs really can perform complex, multi-hour tasks with hundreds of tool calls and minimal supervision. I used GPT-5.2 for this but I have no reason to believe that Claude Opus 4.5 or Gemini 3 Pro would not be able to achieve the same thing - the only reason I haven't tried is that I don't want to burn another 4 hours of time and several million tokens on more runs. If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this designing the agentic loop a few months ago. I think it's the key skill to unlocking the potential of LLMs for complex tasks. Porting entire open source libraries from one language to another via a coding agent works extremely well. Code is so cheap it's practically free. Code that works continues to carry a cost, but that cost has plummeted now that coding agents can check their work as they go. We haven't even begun to unpack the etiquette and ethics around this style of development. Is it responsible and appropriate to churn out a direct port of a library like this in a few hours while watching a movie? What would it take for code built like this to be trusted in production? Does this library represent a legal violation of copyright of either the Rust library or the Python one? Even if this is legal, is it ethical to build a library in this way? Does this format of development hurt the open source ecosystem? Can I even assert copyright over this, given how much of the work was produced by the LLM? Is it responsible to publish software libraries built in this way? How much better would this library be if an expert team hand crafted it over the course of several months?

1 views
André Arko 1 months ago

Why are <code>exec</code> and <code>run</code> so confusing?

Originally posted on the Spinel blog . While working on rv , there’s a specific question that has come up over and over again, in many different forms. In the simplest possible form, it boils down to: What is the difference between and ? Why have both? We haven’t finished implementing either or yet, but every time one or the other comes up in a conversation, everything instantly becomes more confusing. This post will summarize the history of and in Bundler, npm, Cargo, and uv. Once we have the history laid out, we can take a look at what we plan to do in rv, and you can give us your feedback . Bundler manages project-specific packages, but not generally available “global” commands. Project-specific packages installed with Bundler can include their own commands. While working on Bundler 1.0, we needed a standard way to do something new: run commands completely scoped inside a project, rather than scoped to the entire Ruby installation on the current machine. We tried both a wrapper command ( ) and generating dedicated scripts in the project’s directory. With binstubs, you could run to get the project rake, and to get the global rake. I personally preferred the binstub approach, but it was that ultimately became the popular way to use commands inside your project. My theory is that it won because you can use it to run any command, including , or , or anything else you want. Somewhat confusingly (inspired by the command explained below) there is a separate command that is not related to Bundler and instead installs and runs a command from a package. RubyGems only manages global packages and commands, so is more of a convenience to make it easier to globally install and run a package with just one command. npm manages both project-specific and global packages, and can install any package so its commands are available either only within a project or globally across every project that uses the same version of Node. The project-focused design of npm expects commands from project packages to be run by first adding the command to in the section, and the run via . This is even more inconvenient than Bundler’s binstubs, and so I think there was pent-up demand to be able to “just run a command directly”. That was eventually provided by and its alias . The setup makes it very easy to run any command, whether the package is in the local project or not, whether a script is set up or not, and even whether the package is installed at all or not. Simply running the command is enough to install the needed package and run its command. It’s especially helpful to have available when you need to create a new project by running an npm package, since it’s a huge pain to create a project and install a package and set up a script just so you can run the package to overwrite your project. The most popular example of this I am aware of is , but it’s a popular idiom for many packages that contain commands. Cargo is simalarly a package manager, but unlike Ruby and Node, project-level packages do not include commands, and package commands are installed globally. Library packages are added to a project with , while command packages are installed globally with . Once a package is installed globally, it can simply be available on your , and Cargo no longer has to be involved in running it. The command is extremely limited in scope, and only allows you to build and run a binary created by your own project – it does not work to run commands from packages, either project-local or global. In uv, the command seems to be most strongly inspired by , including having its own short alias of . The command is exclusively for running commands directly from packages, automatically installing them if necessary. To give an example, that means you can use to install and run the github-backup command from the Python package named github-backup, whether or not that packge is included in your current project. Conversely, the command is closer to : it installs and configures Python, installs project packages if inside a project, and then runs a command from or runs a file. That means can be used for both to get a shell with only your project’s Python and packages, and can also be used as to run a script file directly. installs and runs: builds and runs: installs python and any project packages, then runs: installs python and the named package, then runs: With all of that background now set up, what should do? What should do? To be the most like Bundler, we should use to run commands for a project. To be the most like and , we should use to install and run a single package with a command. Today, we’re leaning towards an option that includes all of the useful functionality from every command above, and aligns Ruby tooling with JS tooling and Python tooling: installs ruby and any project packages, then runs: installs ruby and the named package, then runs: If we try to combine those two commands into one command, we quickly run into the ambiguity that has been so frustrating to handle in Bundler for all of these years: if you , do you want the global rake package? Or do you want the project-local rake package? How can we know which you want? In my opinion, solves this relatively elegantly by having always run a package globally, and always run a package locally inside the current project, if one exists. What do you think? Let us know in the GitHub discussion about this post. commands created by your package commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) project-defined script commands (like ), which can call: commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) non-project commands from any package (like ) commands created by your package commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) project-defined script commands (like ), which can call: commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) non-project commands from any package (like ) commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) project-defined script commands (like ), which can call: commands from project packages (like ) commands from $PATH (like ) scripts from files (like ) non-project commands from any package (like )

0 views
Simon Willison 1 months ago

JustHTML is a fascinating example of vibe engineering in action

I recently came across JustHTML , a new Python library for parsing HTML released by Emil Stenström. It's a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming. I didn't initially know that JustHTML had been written with AI assistance at all. The README caught my eye due to some attractive characteristics: I was out and about without a laptop so I decided to put JustHTML through its paces on my phone. I prompted Claude Code for web on my phone and had it build this Pyodide-powered HTML tool for trying it out: This was enough for me to convince myself that the core functionality worked as advertised. It's a neat piece of code! At this point I went looking for some more background information on the library and found Emil's blog entry about it: How I wrote JustHTML using coding agents : Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours. Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well! Emil used several different models - an advantage of working in VS Code Agent mode rather than a provider-locked coding agent like Claude Code or Codex CLI. Claude Sonnet 3.7, Gemini 3 Pro and Claude Opus all get a mention. What's most interesting about Emil's 17 step account covering those several months of work is how much software engineering was involved, independent of typing out the actual code. I wrote about vibe engineering a while ago as an alternative to vibe coding. Vibe coding is when you have an LLM knock out code without any semblance of code review - great for prototypes and toy projects, definitely not an approach to use for serious libraries or production code. I proposed "vibe engineering" as the grown up version of vibe coding, where expert programmers use coding agents in a professional and responsible way to produce high quality, reliable results. You should absolutely read Emil's account in full. A few highlights: This represents a lot of sophisticated development practices, tapping into Emil's deep experience as a software engineer. As described, this feels to me more like a lead architect role than a hands-on coder. It perfectly fits what I was thinking about when I described vibe engineering . Setting the coding agent up with the html5lib-tests suite is also a great example of designing an agentic loop . Emil concluded his article like this: JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent. But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. That's probably the right division of labor. I couldn't agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what's left to be a much more valuable use of my time. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . It's pure Python. I like libraries that are pure Python (no C extensions or similar) because it makes them easy to use in less conventional Python environments, including Pyodide. "Passes all 9,200+ tests in the official html5lib-tests suite (used by browser vendors)" - this instantly caught my attention! HTML5 is a big, complicated but meticulously written specification. 100% test coverage. That's not something you see every day. CSS selector queries as a feature. I built a Python library for this many years ago and I'm always interested in seeing new implementations of that pattern. html5lib has been inconsistently maintained over the last few years, leaving me interested in potential alternatives. It's only 3,000 lines of implementation code (and another ~11,000 of tests.) He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There's no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use. He picked the core API design himself - a TagHandler base class with handle_start() etc. methods - and told the model to implement that. He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo's excellent html5ever Rust library. He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code. He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them.

2 views