Posts in Python (20 found)

Running Python code in a sandbox with MicroPython and WASM

I've been experimenting with different approaches to running code in a sandbox for several years now, but my latest attempt feels like it might finally have all of the characteristics I've been looking for. I've released it as an alpha package called micropython-wasm , and I'm using it for a code execution sandbox plugin for Datasette Agent called datasette-agent-micropython . My key open source projects - Datasette , LLM , even sqlite-utils - all support plugins. I absolutely love plugins as a mechanism for extending software. A carefully designed plugin system reduces the risk involved in trying new things to almost nothing - even the wildest ideas won't leave a lasting influence on the core application itself. My software can grow a new feature overnight and I don't even have to review a pull request! There's one major drawback: my plugin systems all use Python and Pluggy , and plugin code executes with full privileges within my applications. A buggy or malicious plugin could break everything or leak private data. I'd love to be able to run plugin-style code in an environment where it is unable to read unapproved files, connect to a network, or generally operate in a way that's risky or harmful to the rest of the application or the user's computer. My interest covers more than just plugins. For Datasette in particular there are many features I'd like to support where arbitrary code execution would be useful. I've already experimented with this for Datasette Enrichments , where code can be used to transform values stored in a table. I'd love to build a mechanism where you can run code on a schedule that fetches JSON from an approved location, runs a tiny bit of code to reformat it into a list of dictionaries, then inserts those as rows in a SQLite database table. My goal is to execute code safely within my own Python applications. Here's what I need: Web browsers operate in the most hostile environment imaginable when it comes to malicious code. Their job is to download and execute untrusted code from the web on almost every page load. Given this, JavaScript engines should be excellent candidates for sandboxes. Sadly those engines are also extremely complicated, and are not designed for easy embedding in other projects. Most of the v8-in-Python projects I've seen are infrequently maintained and come with warnings not to use them with completely untrusted code. WebAssembly is a much better candidate. It was designed from the start to support all of the characteristics I care about and has been tested in browsers for nearly a decade. The wasmtime Python library is actively maintained and has binary wheels. WebAssembly engines like wasmtime run WebAssembly binaries. Some programming languages like Rust are easy to compile directly to WebAssembly. Dynamic languages like JavaScript and Python are harder - they support language primitives like , which means they need a full interpreter available at runtime. To run Python we need a full Python interpreter compiled to WebAssembly, wired up in a way that makes it easy to feed it code, hook up host functions and access the results. Pyodide offers an outstanding package for running Python using WebAssembly in the browser, but using Pyodide in server-side Python isn't supported. The most recent advice I could find was from October 2024 stating "Pyodide is built by the Emscripten toolchain and can only run in a browser or Node.js". The other day I decided to take a look at MicroPython as an option for this. The MicroPython site says: MicroPython is a lean and efficient implementation of the Python 3 programming language that includes a small subset of the Python standard library and is optimised to run on microcontrollers and in constrained environments. WebAssembly sure feels like a constrained environment to me! I had GPT-5.5 Pro do some research for me , which turned up this PR against MicroPython by Yamamoto Takahashi titled "Experimental WASI support for ports/unix". It then produced this research.md document , so I let Codex Desktop and GPT-5.5 high loose on it to see what would happen: It worked. I now had a prototype Python library that could execute Python code inside a WebAssembly sandbox! The trickiest piece to solve was persistent interpreter state. The WASM build we are using here exposes a single entry point which starts the interpreter, runs the code and then stops the interpreter at the end. This works fine for one-off scripts, but for Datasette Agent I want variables and functions to stay resident in memory so I can reuse them across multiple code execution calls. A neat thing about working with coding agents is that you can get from an idea to a proof of concept quickly. I prompted: After some iteration we got to a version of this that works! In Python code you can now do this: Under the hood this starts a thread, sets up a request queue and then sends messages to that queue for the command, each time waiting on a reply queue for the result of that execution. Inside WASM the MicroPython interpreter blocks waiting for a host function to return the next line of code, which it runs on before calling when each block has been successfully executed. The other piece of complexity was supporting host functions, so my Python library could selectively expose functions that could then be called by code running in MicroPython. Codex ended up solving this with 78 lines of C , which ends up compiled into the 362KB WebAssembly blob I'm distributing with the package. I am by no means a C programmer, but I've read the C and had two different models explain it to me (here's Claude's explanation ) and I've subjected it to a barrage of tests. The great thing about working with WebAssembly is that if the C turns out to be fatally flawed the worst that can happen is the WebAssembly execution will fail with an exception. I can live with that risk. Memory limits are directly supported by wasmtime. CPU limits are a little harder: wasmtime offers a "fuel" concept to limit how many operations a WebAssembly call can execute, and that's the correct fit for this problem, but the units are hard to reason about. I'm experimenting with a 20 million default "fuel" setting now but I'm not confident that it's the most appropriate value. The alpha is now live on PyPI . You can try it from your own Python code as described in the README . I've also added a simple CLI mode in version 0.1a2 which means you can try it using without first installing it like so: You can also try it in Datasette Agent like this: Then navigate to http://127.0.0.1:8001/-/agent and run the prompt: Having complained about immature, loosely-maintained sandboxing libraries, it's deeply ironic that I've now built my own! I deliberately slapped an alpha release version on it, and I'm not ready to recommend it to anyone who isn't willing to take a significant risk. I've put it through enough testing that I'm OK using it myself. I've shipped my first plugin that uses it, datasette-agent-micropython . I've also locked GPT-5.5 xhigh in that Datasette Agent plugin and challenged it to break out of the sandbox and so far it has not managed to. I'm hoping this implementation can convince some companies with professional security teams and high-stakes problems to commit to using Python in WebAssembly as a sandboxing approach and open source their own solutions. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Why do I want a sandbox? What I want from a sandbox WebAssembly looks really promising here MicroPython in WebAssembly Building the first version Try it yourself Should you trust my vibe-coded sandbox? Dependencies that cleanly install from PyPI , including binary wheels across multiple platforms if necessary. I don't want people using my software to have to take any extra steps beyond directly installing my Python package. Executed code must be subject to both memory and CPU limits. I don't want to crash my application or the user's computer. File access must be strictly controlled . Either no filesystem access at all or I get to define exactly which files can be read and which files can be written to. Network access is controlled as well . Sandboxed code should not be able to communicate with anything without going through a layer I fully control. Support for interaction with host functions . A sandbox isn't much use if I can't carefully expose selected platform features to the code that it's running. It has to be robust, supported, and clearly documented . I've lost count of the number of sandbox projects I've seen in repos with warnings that they aren't actively maintained!

0 views
Giles's blog Yesterday

Using Safetensors with Flax

I'm porting my PyTorch LLM code to JAX , using Flax as the neural network layer. For various reasons I wanted to use Safetensors to store checkpoints of the model. It took a little while to get it working; here's the trick I learned. If you look at the Safetensors docs, you'll see that it doesn't mention a JAX implementation -- indeed, searching for "safetensors jax" at the time I'm writing this gives you a link to this GitHub repo by Alvaro Bartolome -- which was last updated in 2023. However, if you look more closely at the docs, they do have a link to the Flax API . I feel this is somewhat misnamed, as it is actually a JAX API. There's no reference (again, as of the time of writing) to Flax in the source -- it's all just JAX code. And in fact Bartolome's library uses it under the hood. There is one problem, though. The API works with simple single-level dictionaries, with strings mapping directly to JAX arrays. For example, the function has this signature: This can cause problems if you're not careful. If you look at the Flax documentation on checkpointing , it suggests that you use Orbax 1 , which has its own API and file format, but then goes on to say: When interacting with checkpoint libraries (like Orbax), you may prefer to work with Python built-in container types. In this case, you can use the and API to convert an to and from pure nested dictionaries. I initially put two and two together -- that and the dictionary-based API for Safetensors -- and got five, and tried feeding one of those "pure" dicts into Safetensors. I got a very confusing error: It's worth digging in to why that happens. The problem is that although Safetensors is expecting a dict of strings mapping to tensors, it doesn't check that that is what it actually gets. And while the dictionaries from are "pure", they are also nested (as the docs say!). Even for the simple model I was working with, I got a structure like this: So, we had strings mapping to dicts, and those dicts mapped from strings to the JAX arrays. More complex models would have had deeper dict structures. Now, internally inside Safetensors, the Flax/JAX API is a simple wrapper. It iterates over the keys in the dictionary it's been provided with, and tries to convert their respective values into NumPy arrays. It does that by passing them into NumPy's function, which accepts things like lists, tuples, and NumPy arrays, and converts them into arrays. JAX's own class exposes an interface that it recognises, so they're converted without trouble. Once it's done that, it passes the result to a lower-level Rust implementation that actually converts everything to Safetensors format. But because Safetensors didn't check types, in my case it was iterating over the top level of the dict, trying to convert the values to NumPy arrays, and got something like this: That is -- because it assumed that the values in the top-level dict were JAX s, it blindly tried to convert them to NumPy arrays. But they were dicts (that happened to map from strings to arrays) -- and if you ask to create an array based on a random object, it happily does so and wraps that object in a NumPy array, with a of . When that is then fed into the lower-level Rust code that is trying to write the file, it encounters NumPy arrays that have a it can't handle, -- hence that error: It all makes sense when you read through the code, but I was a bit perplexed for a while! I think all this might be the reason why Bartolome created his GitHub repo. In the README, he says that: There are no plans from HuggingFace to extend safetensors to support anything more than tensors e.g. , see their response at huggingface/safetensors/discussions/138 . So the motivation to create is to easily provide a way to serialize using safetensors as the tensor storage format However, you don't need to use that library to serialise simple Flax models. Consider how PyTorch models get serialised to Safetensors; my LLMs have keys with names like , , and . They're "flat" dictionaries mapping strings to PyTorch Tensors, similar to what Safetensors wants for these Flax ones, but they use dots to separate different levels, with integers for list items and strings for field names. Looking at the pure-dict structure I had for my model: ...you can see that you could walk the dictionary structure to generate keys like and . That would be easy enough to code up. But -- as Adithya Dsilva points out on GitHub -- you can get there even faster by using . That returns a (non-dict) structure like this: If you iterate over that , you get tuples where the first element is that tuple of strings, like , and the second is a object wrapping the JAX . The tuples mirror the dot-separated string format in the PyTorch-style Safetensors files. objects also implement an interface that can understand, so you can quickly and easily convert the to a regular dict for Safetensors: (You need to wrap in a because if you have a in your model, the item in the tuple will get an integer index rather than a string). You can go the other way pretty easily too; given a model, you can load the saved checkpoint into it like this (because accepts raw JAX s in place of explicit s): A little more work than I'd ideally like, but given that it can be tucked away in general / functions, not too big a deal. Hope that's of use for other people coming across this problem! I'm beginning to feel a bit swamped with all of these libraries with names ending in -ax. It reminds me of the names of the characters in Asterix's village ...  ↩ I'm beginning to feel a bit swamped with all of these libraries with names ending in -ax. It reminds me of the names of the characters in Asterix's village ...  ↩

0 views
Jim Nielsen 4 days ago

An Ode to the Exacting Pedantry of Computers

The very first computer programming class I ever took introduced me to the idea of there being different kinds of numbers, like integers, floats, and doubles (it was a C++ course). “You mean, when I assign a variable, I have to say up front what kind of number this is?” It was such an odd concept to me. A number is a number. Why do I have to say it’s this kind of number or that kind of number? I dropped out of that class. A few years later, I decided I wanted to try programming again. So I took another intro class. This time they were teaching with Python instead of C++, so you can imagine my excitement to learn that I didn’t have to think of numbers in this way anymore! It felt like the computer was meeting me partway. Over time, I came to learn how pedantic computers are. They require a kind of exacting precision in saying what you want them to do. And they’ll only ever do exactly what you tell them to do, nothing more, nothing less. If there was a bug in your program, that wasn’t because the computer was doing something you told it not to. The computer was only ever doing exactly what you told it to do. A “bug” was very likely a flaw in your conception of how the program should execute, not the actual execution. It was a failure on your part to be more precise, to imagine a scenario where something happened that you didn’t anticipate — and therefore didn’t tell the program how to handle. “Do what I mean, not what I say!” But now, with LLMs, that kind of exacting precision in language and thought is disappearing. You can have a thought, ask the LLM to build it, and it will fill in all the details you didn’t specify or anticipate. All those pesky details which previously would’ve made you reflect, “Oh, I didn’t think of that. Maybe I should design this differently…” Or, “Oh, well now that I have to think about this some more, I can see that it might not actually be a very good idea…” The pedantic friction, which seemed like such a nuisance, was actually acting as a kind of tool for sharpening and improving your thinking and output. The exacting nature of the computer required you to think more. LLMs, however, have significantly lessened that friction. You can think less and move faster. And yet, that feels like our job as software makers: to think, to anticipate, to explicitly articulate intent. As a software user, I’d rather folks spend more time thinking so that I, in turn, have better experience. This is preferable to giving me more stuff faster that’s only partly conceived. As an industry it feels like we’re headed in a direction where we think it’s better to ship more faster and fix the effects of half-conceived intent later, than to spend more time upfront discovering, sculpting, and specifying intent. That’s one thing writing code by hand has taught me: intent — what you want to build and how you want it to work — is shaped through the act of articulating it. That hard work is not required of us anymore. The LLM will fill in the details. The exacting pedantry of the computer is going away, and in its place are assumptions about intent — many of which we don’t even know about until our users run into their effects. Reply via: Email · Mastodon · Bluesky

0 views

AI Doesn't Have ROI

If you liked this piece, you should subscribe to my premium newsletter. It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . My Hater's Guides To the SaaSpocalypse , Private Credit and Private Equity are essential to understanding our current financial system, and my guide to how OpenAI Kills Oracle pairs nicely with my Hater's Guide To Oracle . Over the last three weeks , I’ve published an exhaustive three-part guide to how the AI bubble might collapse, the events that might trigger it, and the consequences.  Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Something changed in the last week. Shortly after Uber COO Andrew Macdonald said that it was “getting harder to justify” spending money on AI as it was “very hard to draw a line” from that spend to useful consumer features ( after its CTO said Uber burned its entire annual token budget in four months ), Axios’ Madison Mills reported that one company had accidentally spent $500 million in the space of a month on Anthropic’s models after failing to set spend limits. A few days later, Mills would report that other companies were now looking for ways to reduce their AI spend . That’s because, as I’ve said before , nobody can actually measure the ROI of AI, or even create a standard measurement of the cost of a task thanks to the inevitable hallucination-prone nature of LLMs and the ever-growing list of different harnesses and “agentic” (sigh) interfaces. Every different prompt and project and interaction can go wrong in a way that is hard to predict or plan for other than having an eternal vigilance that the supposed “intelligence” doesn’t do something catastrophically stupid, because LLMs have no thoughts, consciousness or ability to learn outside of pre and post-training.  If you can’t measure how good something is, how much it might cost, or what your return on investment might be, it’s fair to ask why you’re even paying for it in the first place. People are (reasonably!) harping on about the ROI problem, but I think the “can’t really measure the cost” part is an even bigger problem.  Yesterday, Microsoft’s GitHub Copilot moved all customers to token-based billing from a premium request model ( as I reported a week before everyone ) as users had been allowed to burn thousands of dollars of tokens on a $39-a-month subscription .  Customers are irate. One burned through 50% of their monthly credits in a single prompt , another burned 60% in the space of a few hours , another 31% in a single prompt , another estimated that they’d burn their monthly credits in the space of a single five hour session , another burned nearly half of their credits in eight prompts , another around 14% of their credits in two prompts , and another lamented that GitHub Copilot had gone from their favorite subscription to their most-stressful overnight after burning 33% of their monthly balance in a few hours . And, to be clear, this is during a promotional period where you get $11 or $21 in free monthly credits: These users — much like the users of effectively every subsidized AI subscription — never really knew how much anything they did cost, because Microsoft intentionally hid the actual cost of prompts and allowed users to spend obscene amounts as a way of boosting growth for GitHub Copilot.  This problem is industry-wide. Every single user of every single AI subscription service is having their tokens subsidized and the actual cost of AI obfuscated. As a result, every frothy, fluffy hype-piece about Claude Code or AI in general is a kalopsia — the belief that something is more beautiful than it really is.  Think of it like this: if you’re using an AI subscription with rate limits but no actual costs , any mistakes a model makes — such as getting stuck in a loop or just doing the wrong thing — can be dismissed as the troubled nature of early-stage technology, because the “cost” was $20, $100, or $200 for the entire month. Anthropic, OpenAI and every other AI company deliberately obfuscated these costs because they knew that the second a user actually had to pay for the fuckups of an AI model they’d scream like they were being stung to death by bees. This issue bubbled to the surface in the last few months because Anthropic and OpenAI both quietly moved all of their enterprise customers to token-based billing in Q1 2026 , and because these enterprise customers are run by Business Idiots with no connection to actual work , CEOs encouraged (or actively incentivized ) their workers to use AI as much as possible, in some cases even making one’s AI use a KPI that could cost them their job.  These same workers were conditioned — through their use of AI subscription products that hide the true costs — to use them as if they cost nothing , all while being screamed at by useless middle managers to “make sure to adopt AI at scale,” all while never, ever having any awareness of what a particular unit of work cost. This was always a recipe for destruction. The overwhelming majority of AI users are completely divorced from and actively trained to ignore the true cost of AI tokens, which means they naturally use these services in a way that’s actively uneconomical. Every frothy hype-piece you’ve read has been written by somebody who has been conned into ignoring the true cost of AI, all in service of spreading a technology that’s unreliable, inconsistent and expensive at its core, and never, ever seems to get cheaper.  OpenAI, Anthropic and other AI companies have actively conspired to mislead the world about the true costs of AI, and it was working great right up until they decided to try charging what it actually cost. Less than a quarter into the shift to token-based billing, enterprises are freaking the fuck out, with Walmart setting token limits on its internal “Code Puppy” AI coding tool , with a spokesperson saying that it “wanted employees to apply AI in ways that create value” mere days after Amazon SVP Dave Treadwell told employees to “ not use AI just for the sake of using AI .” The last few years of AI hype have been built on lies. Every company has conspired to make you think that AI is affordable and sustainable, that profitability was possible, that hallucinations were fixable, and that any problems you faced today were a result of being in “ the early innings .” In reality, the AI industry has absorbed over a trillion dollars, effectively all tech talent, the majority of startup funding, the majority of media coverage, the art and work of millions of people, and been given chance after chance after chance to fix the obvious, glaring issues.  Every time a skeptic dared to stand out and say that none of this made sense, they were told that it was just like Uber ( it’s not ) or that Amazon Web Services cost a lot of money ( it cost $52 billion over the course of 14 years and was cash-flow positive in nine ), that “costs always come down,” and that everything would magically be alright as long as they were patient for an indeterminate amount of time. Four years and a trillion dollars in, AI is more expensive, its companies more cash-intensive, its products just as unreliable, and its boosters more desperate than ever to make you ignore reality as a means of empowering one of a few ultra-rich oafs. Products from OpenAI and Anthropic are built to ingratiate and coddle losers while creating work-shaped outputs that are good enough to impress braindead executives, imbeciles and middle management hall monitors that don’t do any real work, and the reason it’s worked this long is that both companies intentionally misled everybody about how much the real costs were. I must repeat myself: AI is more expensive today than it was three years ago, and it is not getting cheaper. Sam Altman’s comments about “ intelligence too cheap to meter ” were lies. NVIDIA’s Blackwell GPUs didn’t make it cheaper, and its Vera Rubin GPUs won’t either. Google’s TPUs won’t do it, Amazon’s Trainium or Inferentia chips won’t do it, Vera Rubin CPUs won’t do it, OpenAI’s chips won’t do it, and no, DeepSeek won’t do it either.  People chose — and still choose — to believe that AI would get cheaper because they think things got cheaper over time in the past, which is sort of true but not remotely similar in any way, because the cost of running and training AI models comes from using the hardware as well as its upfront cost. Large Language Models require expensive GPUs thanks to their reliance on power-intensive parallel processing, and larger, more-complex models in turn require more GPUs to both train and run inference with. And three generations in, NVIDIA GPUs don’t appear to be bringing the cost down at all, which heavily-suggests that the inherent business model of generative AI is broken. People love to compare AI to the Dot Com Bubble ( AI is far, far worse ) because it’s much easier to rationalize bad behavior than accept that we’re facing the largest misallocation of capital of all time. The Dot Com Bubble was really two bubbles — one around eCommerce and internet startups, and one around telecommunications infrastructure. Per Justin Kollar , the telecommunications bubble grew because of a fundamental misunderstanding of demand: As a result, infrastructure was built far in excess of what demand existed, because most people weren’t online, and those who were had very slow internet connections. Per me : Here’s a critical difference between AI and the Dot Com Bubble: when people actually lit up the dark fiber, the underlying internet service was faster, better and cheaper than a dial-up connection. Services like TheGlobe, WebVan, and Pets Dot Com ran businesses that lost incredible sums of money did so not because of the costs associated with accessing their services, but the unrealistic and unsustainable business models themselves.  Their eventual functional forms — Facebook, Instacart, and Chewy — didn’t require fundamental scientific breakthroughs in how goods were delivered or internet services were accessed. Their failures were a result of poorly run businesses that lost money by expanding too rapidly or spending $400 to acquire each customer .   Dell and CoreWeave just turned on the first Vera Rubin GPUs , and you’ll notice nobody is saying the words “profitable” or “sustainable,” because NVIDIA is not interested in making stuff more efficient rather than more expensive.  According to CEO Jensen Huang , AI data centers — which currently cost somewhere in the region of $50 billion per gigawatt — will now cost between $80 billion and $100 billion per gigawatt in the future. Does this sound like it’s getting cheaper to you? Even if said data center packs theoretically more “power,” what does that “power” do for the customer running compute on it? Is it cheaper? More efficient? How do we not have these answers? All of this is to say that the Dot Com Bubble happened due to irrational exuberance and growth lust, and what was recovered at the end came not from scientific breakthroughs but the fact that the useful infrastructure existed and could be adapted and used to make things cheaper and more efficient. That isn’t the case with AI data centers, AI startups or anything else to do with the AI Bubble. Every few days somebody makes a post like this suggesting that “the internet didn’t go away” and “railways didn’t go away” when their bubbles popped, but I think this is a fundamental misunderstanding of what AI is . An AI data center full of AI GPUs is useful for AI and very little else. There are GPU-powered analytics tools, GPU-powered modeling and scientific applications, but the nature of GPUs — good at doing the same thing across big data sets in parallel, but bad at handling many little independent tasks — makes them impractical for most of what modern computing demands. The entire Dot Com Redemption storyline comes from the idea that it “left behind useful infrastructure,” by which they mean “cabling that allowed hundreds of millions of people to use the internet.” While there was some amount of further construction and capex to handle, the end result was useful fiber that connected people with a faster connection at a lower cost. No such story exists for AI. AI data centers are ruinously expensive , requiring billions in upfront funding with operating costs so high that they, at best, run at a loss for the first five or six years of service, if they ever recover their original costs at all. A rack of Vera Rubin or Blackwell GPUs will cost as much to run in five years as they do today, as will an incomplete data center cost just as much to finish construction, connect to the grid or acquire behind-the-meter (IE: generators) power for.  In the aftermath of the Dot Com Bubble, dead startups flooded the market with cheap server and office gear, which allowed plucky founders to cobble together their own services. A single Sun Microsystems Ultra Enterprise 3000 cost $43,000 ($89,000 in today’s money) and had a power draw of between 1,200W and 1,500W, but could run an entire company’s infrastructure . A single B200 Blackwell GPU uses 1,200W , and more-complex AI coding tasks can take up four to twelve of them for a single user’s output. Put simply, you can’t really do very much with a few of these GPUs, and what you can do isn’t profitable, scaleable or valuable. Similarly, dark fiber could be lit up with the right transceivers and networking gear to create internet access. AI data centers are effectively large boxes with custom cooling built for a very limited subset of chips. Adapting them to other uses would require gutting the data center, which would mean that the vast majority of the capital expenditures were wasted.  Even if you were able to buy a hundred Blackwell GPUs from a dead neocloud, you, as a regular person, couldn’t do anything with them. In fact, nobody really could, because you’d still need a physical data center and bespoke cooling , which means that even if the chips were free , the associated construction capex or, at the very least, physical colocation space would still cost a great deal of money The internet and railways didn’t go away because their up front costs were the only real costs that mattered.   Even if somebody were able to pick up a cheap AI data center full of the latest generations of GPUs, the underlying operating expenses are awful, and the only way to make them even close to generating a profit is to have consistent use of all your GPUs. There’s a cost to having them sit idle — both in electricity and personnel — and unless the plan is to have them sit in a data center turned off until you can find somebody else to sell them to, you’ll have to come up with a business model for your AI services that actually makes a profit…which nobody appears to have done, even with unlimited capital and the entire focus of the tech industry. Then there’s the issue of training , which is entirely made up of opex. If you want to train a new model, you’ll likely need thousands — or even tens of thousands — of H100 or H200 GPUs, and they’ll cost just as much electricity whether or not you make anything useful. A failed or unhelpful training run could cost tens of millions or hundreds of millions of dollars , and that will require financial backing that won’t exist. While there could be a theoretical future of LLMs run at their true cost (IE: unaffordable for most) as I covered in last week’s premium newsletter , that would require demand, and as I’ve discussed above, the demand for AI services is a mirage built on subsidized subscriptions, and companies paying the actual costs are already screaming for mercy.  Once the bubble bursts, any excitement for AI — and by extension excitement to spend money on AI — goes out the window. AI startups won’t get funded . AI token budgets won’t get greenlit . AI data centers won’t be able to raise debt .  Every part of this bubble relies upon the momentum of hype to substantiate every link in the chain. Hype must exist around the nebulous concept of an “ AI factory ” to raise debt to buy NVIDIA GPUs and build data centers, hype must exist around AI software to convince enterprises to keep buying services from OpenAI and Anthropic, hype must exist around theoretical demand and outcomes from AI services to fund AI startups, and hype must exist perpetually in the media to make everybody ignore AI’s ruinous costs.  This hype was unsustainable without buckets of lies, misinformation and a captured tech and business media. The value of AI has been inflated by the vagueness of how it’s discussed. For example, major media outlets will gladly write that “AI can build software,” but said sentence suggests that you can just type “build me Slack 2” into Claude and have it fart out a fully-functional, production-ready piece of software, rather than a quasi-functional mound of code-slop that can do enough to trick a business idiot or lazy journalist, but little else.  Said vagueness created a society-wide gravitational pull of consensus that you needed to be behind AI now, because it’s just like the new internet, except bigger, and if you say it’s not you’re going to be really embarrassed.   Creating this pressure was necessary, because without a society-wide aggression against those who didn’t adopt these tools, AI might have actually had to stand on its own merits. That fact AI companies backed by the full manufactured consent of the markets and most of the economy still had to subsidize their products shows exactly how flimsy their value truly is. The only way to inflate the AI bubble both on a hardware and software level was to mislead the general public and investors on the costs and efficacy of AI models.  Now that organizations are having to pay the actual cost of AI, suddenly they’re concerned about its outcomes, and everybody has become a little hysterical. Late last week, SemiAnalysis wrote one of the most insane articles I’ve ever read — AI Dark Output: The Visible Cost of Invisible Output — saying that “AI output will be real before it is measurable,” and, well, whatever the fuck this is: SemiAnalysis is a semiconductor analyst firm with an obvious reason to keep the AI bubble inflated, and if they’re writing a piece that amounts to “AI has a return on investment, you just can’t see it,” things are getting desperate. Here’s how they define “Dark Output”: That “substitution dark output” is explained using a theoretical example of “...a simple legal document which in theoretical GDP should have the same inflation adjusted value to a user whether a lawyer drafts it or AI drafts it,” which is nonsense.   When you pay a lawyer, you don’t pay them to “create an output,” you buy their experience and time and ability to find and adapt case law to reach an outcome, such as in the process of filing stuff, avoiding or actively participating in litigation. Just because AI can fart out an approximation of what a human output may look like — likely riddled with hallucinations — doesn’t mean that said output was created with any “experience.” Models don’t think , they have no experiences , and even if a lawyer is prompting them , that doesn’t mean that the lawyer’s discernment or taste is reflected in the final output. Then there’s this bit: We’re four fucking years into it but we’re still using hypotheticals. Are “...the simplest documents now completed by AI and not lawyers”? You don’t get a lawyer to write a document because they’re the only ones who can write it — you get it to mitigate the risk using the experience of the law firm, both in the associate drafting the document and the partner overseeing it. This flimsy, half-assed logic is how the AI bubble got inflated in the first place. Supposedly smart people continually show a total lack of awareness of how jobs work at basically every level, and in this case — where it should be theoretically possible to find and talk to a lawyer doing this — the supposed “dark output” includes “the research done to complete this article.”  You may be wondering what that “new work done by AI that wasn’t previously being done by humans because AI made it cheap” is, and the answer is “literature reviews” and “summarizing the last six months of email,” and I wish I was kidding. But don’t worry, “...there are anecdotal signs that a large fraction of current token spend is for new work that wasn’t previously paid for rather than replacing existing work.” Have you ever noticed that every story about AI job loss reads like it was written by The Riddler? For example, last year a ton of outlets reported that “Oxford Economics had proven that entry-level workers were being replaced with AI,” but in reality, the study said that “... there are signs that entry-level positions are being displaced by artificial intelligence at higher rates ” with no actual data beyond post-2022 employment declines in some fields that AI might be able to do.  Similarly, CNBC’s brainless headline that an MIT study found that AI “could already replace 11.7% of the US workforce” was entirely based on a labor simulation tool rather than any economic analysis of the actual shit AI can do and what it’s doing in the real world. That’s because AI job loss is a fucking myth. Every company laying off people because of “the power of AI” is doing so because their shareholders are mad and because they know they’ll get headlines.  And if it were actually happening there’d be fucking riots in the streets! Unemployment would be spiking! Things would be burning!  The thing that everybody wants you to avoid thinking about is that if AI worked as advertised, there would be obvious, impossible-to-ignore economic signs: For all of these things to happen, AI would have to be both flawless , hallucination free, a completely different product capable of autonomous intelligence and having unique ideas.  The reason that we can’t measure “AI job loss” is because AI can’t do jobs. It can be used to replace some specific contract positions with extremely shitty versions that don’t scale , but it does not replace jobs because it is incapable of human work. It cannot speak to colleagues, it cannot accrue experience, it does not have instincts or culture or taste or anything other than whatever training data has been crammed up its ass or through endless post-training.  Nevertheless, the threat of AI job loss has been enough to allow both Sam Altman and Dario Amodei to raise hundreds of billions of dollars lying about it, and now that both of them have walked back their job loss scare-propaganda , every oaf and moron that believed them without actually checking should be booted out of their representative industries. It’s fucking embarrassing! You should all be ashamed of yourselves! As I said above, the ROI of AI should be really easy to measure if it actually existed.   If AI was magically able to build and maintain software, we’d have small companies that could build and deploy at the scale of a hyperscaler, and hyperscalers would, in theory, be expanding their margins so aggressively that it would create a new golden age of software revenues…or they’d become entirely infrastructure providers, as anybody else could compete on software. But on a far-simpler level, it would be extremely obvious. Anybody can access ChatGPT, Claude or Gemini, effectively anywhere in the world. The theoretical “power” of AI is that it “just does stuff,” and the proliferation of LLMs would mean that somebody would’ve “done” some “stuff” that we could point at with exceptional ease. Random guys in the midwest would be pumping out profitable, functional, and feature-rich software. Lawsuits would be won by pro se plaintiffs with incredible counsel from a theoretical “ country of geniuses in a data center .” Four years in, we’d have one major AI-powered company demolishing the competition in any industry, or every industry would become so prevalent with (powerful) AI that it would effectively reduce the cost of the service to nothing.  We’d be able to point to companies that adopted AI and then completely fucking exploded. We’d be able to point to useless coworkers who were now doing impressive, meaningful work. There would be widespread economic upheaval, as the concept of a “large company” would lose meaning, because those theoretical “geniuses in the data center” would be automating all the work.” There also wouldn’t be so many pieces insisting that AI is super powerful and so many quotes from Business Idiots saying it’s “ real .” We wouldn’t talk about what AI could do at all. We wouldn’t need Anthropic to lie that Mythos was too powerful to release only to release it several months later .  We wouldn’t have to talk about the fucking potential at all because we’d be able to point to what was going on because it would be obvious! Last week, Bain & Co. released a study of 951 executives from companies with more than $100 million in revenue , and unsurprisingly, the data did not declaratively explain what the ROI of AI was: 10% of…what? What’s the cost you saved on? 10% of $10 million is a lot for a company with $100 million in revenue, but 10% of $1000 isn’t, much like 20% or 30% isn’t either! Yet there are two punchlines to come: This also assumes that those savings are enough to warrant future spending, which…this data does not actually prove. Thankfully, Bain did manage to publish one of the single-funniest quotes of the AI bubble: Put another way, the technology “worked (?),” but did not provide value in doing so. Sounds like it didn’t fuckin’ work to me! Bain had one other crucial bit of advice: Just so we’re clear, Bain & Co, a management consultancy with billions in annual revenue, is advising its clients that they should make sure that they’re getting some sort of return on their investment? And that reinvesting in something that doesn’t have a return on investment would be bad? If AI was real, these fucknuts would be replaced first! They’d replace everybody who wrote this report! You don’t need somebody to tell you this, and if you do you’re a fucking moron!  Thankfully, the AI industry is saved, as Sam Altman had the following to say about AI’s remarkable costs : Motherfucker you are the industry! You are the one that has to work this out! OpenAI is the AI industry ! You are OpenAI’s CEO! You lazy, ignorant, dog-brained loser!  This was an opportunity for “journalist” David Faber to push back, and here’s how that went: This is how the AI bubble inflated! This is how it happened! It happened every time a journalist asked a meaningful question and then immediately diverted to a totally different imaginary topic that made the subject feel good! David Faber, resign and give your job to somebody who has an iota of courage or pride in their work! Unbelievable! Sam Altman is worth billions of dollars, and OpenAI is allegedly worth $852 billion too, and the best he can give us is “teehee, someone else will work it out,” because Sam Altman is a loser that ingrates other losers empowered by losers to sell loser technology to other losers , and the only way that he’s been able to do this is because the people that should know better are sitting around their thumbs up their asses asking him whether there will be data centers in space. If AI had ROI, we wouldn’t be debating whether it had ROI. We wouldn’t discuss its potential, or whether it could, theoretically, under different circumstances, in the future, in a way that nobody can describe be super powerful and do all of the stuff it can’t do today.  If AI had ROI, we’d be able to point with specificity to inarguable examples of economic impacts. AI boosters can jerk their binguses all they like about how Spotify’s CEO said its best engineers don’t write any code anymore . What does that mean? Is Spotify shipping better features, and are those features launching at a rapid clip? Is the software more secure, or stable? Spotify’s design still looks like absolute dogshit ! Most software is worse! Things keep breaking everywhere , and in many cases it’s because of AI coding tools ! In fact, I’d be willing to believe that AI had a negative economic impact, increasing operating expenses across the board and giving some software engineers prompt-based concussions by automating some coding in a way that makes them lazy and bad at writing software by speeding up the process of writing code with so much of it that it’s impossible to review it all ( see Mo Bitar’s video ). LLMs appear to be able to write some code sometimes and do so at high speed , and ingratiates software engineers that don’t really care about writing software by making them feel like they wrote it.  While it might allow some things to go theoretically faster, the overall economic impact of AI-generated code appears to be worse code, worse software, and massive, multi-million dollar bills from Anthropic and Cursor . I will concede that some software engineers seem to like these things, and that many software engineers appear to be using them, but I am yet to see a single one who obsessively posts about their token spend create anything of note or worth, and none of these people appear to be able to point to the actual ROI of all that AI they’re using. I realize I’m painting with a broad brush, so let me get a broader one: I believe anyone who relies on LLMs for anything is a mark.  I don’t give a shit if you use them to spit out a script or do some simple sideline part of your job, or transcribe or dictate into them, or if you’ve used them as a search engine (and even then, you best check every source!), but the moment you rely on and run your entire process on these things, I immediately doubt your ability to do anything, or at the very least wonder how gullible you truly are when somebody ingratiates you enough. Why? Because every single “AI setup” I’ve seen anyone ever use involves a rube goldberg machine of bullshit deterministic scripts to try and bring the hallucination-guaranteed nature of LLMs to heel, usually to the point that you’re doing more work making the LLM work than you did before they existed, and you’re only proud of it because you feel like you’re special. There are, of course, exceptions. I’ve talked to a few people who describe LLMs normally, without hype, who tell very specific stories of very specific outcomes that save indeterminate amounts of time. There are some that have used LLMs to create python scripts to search and organize data, to which I say “you’re impressed with Python, not LLMs.”  If all we’re left with from this era is the ability for some people to write Python scripts without learning Python, this is still an egregious and horrifying waste of capital.  Remember: what you are using is the end result of over a trillion dollars of investment. It is only made possible through manufactured consent that actively misinforms people about the current and future capabilities of LLMs. They didn’t raise hundreds of billions of dollars by talking about any product currently on the market, and that’s because the current products are not very good products. You are all the victims of a con. No matter how “well” your Breakfast Machine of different API calls and if-this-then-that automations may or may not function, you have been sold a bill of goods for “artificial intelligence” that is impossibly stupid. When some of you are pushed to prove the ROI of AI, you immediately return to boring talking points about Uber, or the Dot Com Bubble, or some other slop fed to you by people actively conning you at this very moment.  I mean this with as much empathy as I can muster: if you’re a huge AI booster, why do you defend this so vociferously? What is it about my criticism that hurts? Is it that I’m yucking your yum? Is it that I don’t immediately ingest and regurgitate the theoretical idea that the thing you’re using all the time is or may become sentient? Is it because I’m not impressed?  I think it’s far more likely that people are angry that I’m asking simple questions that should have — and don’t — have satisfying answers. I’m also fundamentally unimpressed with anything I’ve seen an LLM do, because my requirement for software or hardware is that it works as advertised, and the very fundament of the AI con is that LLMs are sold based on their theoretical capabilities. The reason nobody can show you the ROI from AI is that AI does not have a return on investment. Large Language Models can speed up some things in a way that becomes increasingly less-valuable and accurate with the complexity of the task, and more investment in AI data centers does not appear to do anything other than expand the number of tasks that an LLM can attempt.  While some people have been able to get something out of generative AI, that something never seems to be a tangible or impressive achievement. Every “successful” AI story is a result of either ignoring the obvious problems with LLMs or mitigating them at a great cost for an aggressively expensive and mediocre result.  LLMs are sold as “AI,” a technology best-known for automating things, yet they can’t be trusted to run anything on their own.  Instead, they manipulate the user into covering up their errors, explaining away their failures, coddling their meager returns and crediting them with the actual labor that LLMs are meant to automate away.  They do so by their investors and executives conning the media and the markets with outright lies and half-truths that exploit society’s weak points. The media and markets are informed by people that neither understand technology nor history, and Business Idiots that have reached the heights of their careers through diplomacy and ratfucking that care only about attention and adulation for things that other people do.  LLMs coddle the easily-led and narcissistic into believing that the model is doing the work as the human being has to constantly cater to the model’s inefficiencies and inabilities, using more energy and resources than any technology ever made.  And yet with all the money, all the attention, all the resources, all the land, all the power, all the affordances and excuses and endless fucking applause for mediocrity, nobody can actually point to the ROI of AI, because it doesn’t exist outside of it burping out stolen content and enriching and ingratiating billionaire dullards. Even at a hundredth of the price I’d be dismissive, because everything I’ve seen is so decidedly unexceptional. I realize that some will say I’m dismissive of LLMs’ capabilities, and I’m sorry — I’m just not impressed. You spent a trillion dollars to make it somewhat easier to code some things sometimes but not in such a way that it actually results in anything, research reports that nobody will read, shitty powerpoint decks and excel spreadsheets, and art that looks like stock images because that’s exactly what it was trained on.  This shit needs to work every time without fail and be absolutely flawless and autonomous.  You are paying for a tool. You are paying for software. You are a customer. Your job is not to explain to others why this is exciting, nor is it your job to cover up for its mistakes. If you truly love this stuff you should be either secure enough in doing so that you don’t feel compelled to defend it or be demeaning to those that disagree. The fact that I have to write that sentence is proof that something is very, very wrong with the AI industry, and that LLMs are about far more than software.  If you liked this piece, you should subscribe to my premium newsletter. It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 10,000 to 18,000 words, including vast, detailed analyses of the biggest events and companies in the AI bubble.  The foundation of software would be destroyed, as literally anyone could create and maintain any software they desired . Literally nobody would buy any software because they’d just type “computer make me a Slack clone for my organization” and it would magically appear on AWS.  The SaaSpocalypse ( see my premium here ) is a media and market-based hallucination where the collapsing growth of software companies is being explained as “AI taking their business” versus “private equity and venture capital overvalued software companies between 2018 and 2022 to the point that Apollo’s John Zito said “ all the marks are wrong ,” which is very bad, but nothing to do with AI. Accountancy would completely collapse, as nobody would need anyone but ChatGPT to do their taxes. Law schools would collapse, because legal internships would become useless and law firms would no longer have need for the thousands of new associates, because ChatGPT could just draft it all.  Legal salaries would also dramatically collapse. Research in effectively every discipline would collapse, because you could ask for a detailed report and said report would be better than any human being creates. The entirety of scientific research would change, because you could now automate many different disciplines out of existence.

0 views
Martin Fowler 4 days ago

Fragments: June 2

Greg Wilson has noticed that lots of folks are using dodgy metrics to figure out if AI tools are worth their costs. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; He lists lots of common metrics, and why they are flawed. Sadly he doesn’t give any suggestions on what would be better. In my view, since we cannot measure productivity , any metrics are weak evidence at the best of times. I do somewhat use one of his flawed measures: “Asking Developers If They Feel More Productive”. While I acknowledge the problems he gives with this measure, I find that in an environment where decent measures are hard to find, even such a dim light is the best we have. In this situation these kinds of qualitative metrics may not be conclusive, but they are useful . ❄                ❄                ❄                ❄                ❄ Benedict Evans observes that extensive automation didn’t mean the demise of professions in the past. we spent a century automating accounting: we built calculating machines, punch cards, mainframes, data processing, databases, PCs, spreadsheets, ERPs, cloud… in fact, we built half of the tech industry around automating this. Yet the number of accountants kept going up. He goes into the myriad of problems that exist when we’re trying to forecast the impact of a technology on jobs. There’s the much-talked-about Jevons paradox - once something becomes cheaper, people do it more, which can increase demand. Often this leads to the nature of jobs changing, even if it’s called the same thing. Accountants today aren’t doing exactly the same work that they did in 1970 or 1980 ‘but more’ - they’re still called ‘accountants’ but the job is different. New technology often starts out being used for ‘the old thing but more’, but it rarely ends up like that. Technologies often affect whole businesses - consider the impact of the internet on news publishing. Did anyone observing the rise of smart phones in the early 2000s realize that a consequence of this would change the economics of taxis due to the rise of ride-sharing apps? The conclusion is that it is, at the very least, almost impossible to forecast the impact of AI on our work. ❄                ❄                ❄                ❄                ❄ Stephen O’Grady looks at how closed and open models have performed on benchmarks over time . Closed models are setting the pace of innovation, and constantly breaking new ground from a capabilities standpoint. Open models are chasing them, and the cycle times seem to be getting shorter. There are no clear capability moats, and what is frontier today is table stakes tomorrow. It tooks 13-18 months for open models to catch up to GPT-4 on these benchmarks, but only 2-7 months to catch up to GPT-4o. There’s a bunch of caveats to this analysis, that he lists, but it’s a worthwhile survey of how various kinds of models perform against the various measures we are trying to assess them with. ❄                ❄                ❄                ❄                ❄ One of the starkest examples of sloppy AI use is hallucinated citations - a give-away of both usage of LLMs and carelessness driving them. GPTZero is a company that makes tools to detect AI writing. I’ve no insight as to whether their tool is effective or not, but they do publish investigations of AI usage, and have published several articles highlighting hallucinated citations. One post focuses on Ernst & Young Canada’s report on cyber threats to loyalty systems and found that more than half its references were hallucinations. The post uses a lot of extremely annoying animations in how it presents its information (breaking Safari’s reader mode in the process). But the harm that these kind of AI generated reports can do goes further than just some misled humans: Publishing a report online is essentially a form of data injection into the pool of knowledge that is the internet. When the report includes fake information (either vibed citations or false claims) it can “poison the well” by misleading future researchers, especially if the report is published by a well-known consulting firm and hosted on a high-traffic website. ❄                ❄                ❄                ❄                ❄ As LLMs get more capable in programming, we are rightly worried that people will use them attack software systems. But these models can also be used for defense, allowing teams to find bugs before attackers do. Some folks from Mozilla posted an article on how they’ve used AI model to identify and fix an unprecedented number of latent security bugs in Firefox . Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it. It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise. During 2025, there were 17-31 security bugs fixed each month. In April 2026, they fixed 423. ❄                ❄                ❄                ❄                ❄ Pavel Voronin riffs on Unmesh Joshi’s post on What is Code . He observes that cruft in a codebase (technical debt) has always added friction to software development. But the consequences of this cruft are compounded when LLMs are using existing code as context for future work. In a degraded codebase, the model does not see “technical debt” as debt. It sees examples. It sees precedent. It sees a style to continue. LLMs multiply what’s currently happening. I hear reports that good code might take the place of much of what’s put in markdown, because LLMs will imitate what’s already in the code base. But bad code multiplies too. Inevitably he introduces another variation of rampant debt metaphors: Cognitive debt accumulates when a team uses abstractions it no longer understands. Generative debt accumulates when a codebase contains confused concepts that models are likely to continue. Cognitive debt is about what the team no longer understands. Generative debt is about what the model is now likely to reproduce. ❄                ❄                ❄                ❄                ❄ Jason Koebler, from the very worthwhile 404 media, has written a plaintive essay on how AI-generated slop is driving us crazy . Not just because its filling the web with this slop, but also because how it’s making us humans react to slop and the threat of slop. We review our own writing and notice: it’s not just reading AI slop that hurts us, it’s the risk that we write something that looks like AI slop. If I use phrasing that AI copied from me, does it seem like I’m copying AI? This has led to the appearance of “humanizers” - AI tools that make our writing look less like AI. Humanizers add typos, randomly replaces words, removes “AI tells,” and sometimes inserts random characters. It’s another step on the way to the Zombie internet: I called it the Zombie Internet because the truth is that large parts of the internet are not just bots talking to bots or bots talking to people. It’s people talking to bots, people talking to people, people creating “AI agents” and then instructing them to interact with people. […] It’s my email inbox, in which I used to occasionally get poorly-formatted, poorly written, extremely long emails from delusional people who were positive the CIA had imprisoned them in a virtual torture chamber using undisclosed secret technology but where I now get well-formatted, passably written, extremely long emails from delusional people who are positive they have proven AI sentience and have the AI transcripts to prove it. ❄                ❄                ❄                ❄                ❄ Andy Osmani points out that spawning lots of agents is like launching a bunch of parallel processes that all rely on a single orchestrating thread - yourself . Python has the Global Interpreter Lock (GIL). You can spawn as many threads as you want but only one executes python bytecode at a time because they must acquire the lock. You are the GIL of your AI agents. They all can run at once. But when any of their work needs genuine understanding of the architecture or resolving merge conflicts, that work has to acquire the lock. There is one lock. You hold it. This means you must design the workflow with the agents with that GIL in mind. You shouldn’t launch more agents than you can properly review. It’s handy to separate background tasks that can be offloaded to an agent from complex tasks that require applied attention. Don’t use that precious brain for things that the machine can verify itself. [And I’d add - do get the machine to build tools that ease human verification. For example, it’s better to surface test case data in tables rather than buried in assert statements.] Spawning agents is not the skill. Anyone can run 20. The real skill is designing the system around the one serial resource that cannot be cloned or parallelized. That resource is your attention. ❄                ❄                ❄                ❄                ❄ Jamie Hurst is a Principal Engineer at booking.com, where he works in developer experience with a focus on AI tooling. He’s written realistically about the gains and losses of using LLMs in this work. The cost of building has collapsed, but the cost of aligning organisationally has not. If anything, it’s gone up. When three different teams can each produce a working solution to the same problem in the time it used to take to write a proposal, the bottleneck moves from engineering to coordination. He thinks he’s able to do more as a senior engineer, but is concerned about how sustainable it is, both for him personally and for the organization he works for. He’s able to shape directions for multiple workstreams at once, in a way that he couldn’t three years ago. But one loss is that he doesn’t have enough time for mentoring, which will exact a toll on his employer in the longer term. He also finds he doesn’t have enough time to think. The productivity gains from AI got captured by output volume rather than output quality. The org’s expectations rose to absorb the speed-up, and the slack that used to exist between tasks, the unstructured time where strategic thinking actually happens, got eaten first because it’s invisible on a dashboard. I’m at a point in my career where thinking is supposed to be most of the job, and most of it now happens on holiday because the working week doesn’t accommodate it.

0 views
Giles's blog 1 weeks ago

On first looking into JAX

Much have I travell'd in the realms of gold, And many goodly states and kingdoms seen; Round many western islands have I been Which bards in fealty to Apollo hold. Oft of one wide expanse had I been told That deep-brow'd Homer ruled as his demesne; Yet did I never breathe its pure serene Till I heard Chapman speak out loud and bold: Then felt I like some watcher of the skies When a new planet swims into his ken; Or like stout Cortez when with eagle eyes He star'd at the Pacific -- and all his men Look'd at each other with a wild surmise -- Silent, upon a peak in Darien. John Keats, On First Looking into Chapman's Homer I've been working with PyTorch quite a lot for the last couple of years, and feel like I've come to a reasonably solid understanding of how it all fits together. Working through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", training my own LLMs locally and in the cloud , rebuilding Andrej Karpathy's 2015-vintage RNNs -- over time, it all adds up! But, of course, there are other frameworks, and one I kept hearing about was JAX . While it's less dominant than PyTorch, it has a reputation for a certain cleanliness, a certain purity. And having spent time over the last couple of weeks working through the tutorials, and translating small PyTorch examples into it, I've been really impressed. In this post I want to give an overview -- to report back to beginners like me, still living in PyTorch-land, on my new discovery. Less like Herschel discovering Uranus, and more like a 16th-century European coming back after having discovered something that the people who lived there were perfectly well aware of. What is this JAX thing, and how does it differ from PyTorch? I think that the main differences between PyTorch and JAX are something like this, but a little less strident: Having overstated my claims, let me dig in and perhaps walk them back a bit. Once I've gone through them, I'll do a walkthrough of porting a simple PyTorch training loop to JAX, which should illustrate the points well. Finally, I'll wrap up with the counterargument. JAX is wonderful and shiny, and 30+ years of industry experience and cynicism makes me fear that it might be doomed :-( But let's start with the positive! [Happy face on.] A simple example that nicely contrasts the different philosophies of the two frameworks is what the core of a training loop looks like. Here's how you might write one in PyTorch: This is kind of mechanistic. You're telling the computer what to do, step by step: Now let's look at a parallel JAX implementation: It's clearly very different. No explicit backward pass, no gradient-zeroing, and the forward pass and loss calculation are baked into a separate function. But why is it shaped that way? Let's think about what we're actually doing in our training loop. The gradients are the partial derivative of the loss function ℒ against the weights W : Now, I'm being a bit sloppy with that notation, because ℒ is a function, and it -- in the mathematical formulation -- takes the weights as a parameter. So it would be better written like this: But that's still not quite right. In a real training loop, we're doing this in the context of a particular input batch, X , and its associated targets, Y . 2 We might write that mathematically as this: ...where you can read the colon as "given". Now let's look again at the JAX code to work out the gradients: That's an almost-perfect mirror of the maths! The function takes a function , and returns another function, , which takes the same arguments. When you call , instead of returning the result of , it will return the derivative of with respect to its first argument, given the values of the others. 3 How is it doing that magic? Let's look at a simple concrete example: If you do the initial call to : ...then it just wraps in a helper function. It's when you call that the magic happens. ...will print out this: The first parameter -- the one with respect to which we're asking for the derivative -- is replaced by a object. Because it's wrapping a float, it can be used like one, so the function executes as expected. But it also keeps track of what happens to this variable as the code executes, and essentially builds up what in PyTorch would be represented by the computation graph. So: while in PyTorch, the variables that you pass in to a function that you need gradients for need to be special PyTorch objects that can keep a reference to those gradients -- the parameter that pops up frequently in PyTorch code -- in JAX, it's all handled by variables being automatically wrapped in these special tracers. Once it has the results of the function as a whole, including the chain of operations that was traced, it can automatically do a backward pass, and we're done. That's really nifty! Now, the example above was a toy one, with just one parameter. In a real training loop, you're differentiating against a set of weights, and those will be something more complex. But handles that gracefully. Let's see what happens if we pass in an array as the first parameter: So, we've got partial derivatives with respect to the elements of the array that was the first parameter -- just what we'd need for a single-layer neural network without bias. But what about something more complicated? For something like (say) an LLM, we have quite a lot of structure to our weights: our input embeddings, output head, all of the layers with their attention and feed-forward weights, and so on. handles that by understanding basic Python structures -- things that can be mapped to what JAX calls PyTrees. PyTrees are nested tree structures of dictionaries, lists, tuples and so on, where the leaves are numbers or JAX arrays 4 . If you ask for gradients of a variable that can be represented by a PyTree, you get them back in a form that mirrors that PyTree: If you combine that with JAX's tree-aware function, you can combine those gradients with the original parameters to update them as you train. I'll show you how that works later on, when we go through an example of porting some PyTorch code to JAX. So, all of that cool stuff was made possible by the tracer objects, which are passed in instead of the real parameters, and keep track of the computation graph (just like the graph that PyTorch attaches directly to the variables). But tracers are more generally useful than that; they really come into their own with the next JAX difference: the JIT. Imagine that you've built some kind of nifty model in PyTorch. As part of it, you do a calculation something like this: You decide that this is generally useful, so you code it up as a CUDA kernel and make it available to the community, like Erik Kaunismäki has with his "MaxSim" kernel. Maybe later on, it will get added to the PyTorch library as a standard component. There are a lot of optimisations like that built into PyTorch; people found that there were higher-level abstractions on top of basic tensor operations that were generally useful, so they coded up lower-level optimised versions. For example, in the LLM I've been working with, there is an implementation of LayerNorm . But PyTorch has its own one built in . And there's a CUDA implementation that it will use automatically if it has the appropriate hardware available. There is a problem, though. Imagine that someone else is working on a different kind of model in the future. And for reasons completely unrelated to the MaxSim calculations that Kaunismäki nicely optimised, they happen to need to do the same calculations. Now, there are two things that can happen from there: The first is not ideal; but the second isn't great either, if what they're using it for is not a MaxSim operation in reality, just something that happens to look the same mathematically. In the general case: all optimisations that get into PyTorch have to be carefully named so that they reflect the exact level of abstraction that they're targeting. And when people are writing PyTorch models, they need to actually know which optimised abstractions are available, and where to apply them. Now let's look at JAX. It has an innocuous-looking decorator, , and you can use it by adding a single line before your function: Behind that single line is a huge amount of useful infrastructure. Just like , it's a function that takes one function and returns another, without necessarily running the underlying code. 5 But when you call the wrapped function for the first time, some impressive stuff happens: This will essentially execute the code twice: The first time through, it will create another of those tracer objects; this time, though, it won't wrap the number -- it will just know that it is a wrapper for a float. It will call the Python code with that tracer, and all of the operations in the function will be run, but the result that comes out at the end will essentially just be a representation of what calculations were done in an abstract sense -- like the computation graph that was used for working out gradients, but without specific numbers in it. JAX has a nice way to display these representations as what it calls JAXPRs, and the JAXPR for that function's representation when called with a float parameter will look something like this: That JAXPR can be compiled into the appropriate code for the platform where you're running it -- x86 machine code, compiled CUDA, the equivalent for AMD or Google Tensor Processing Units (TPUs), and will be cached. The key for the cache will be meta-information about the parameter -- in this case, something like "a 32-bit floating-point scalar". Next, the compiled code -- not the original Python -- is run with the actual value of the parameter, the that we provided. Now, of course, the advantage of doing this is that when you call it with a different floating-point number -- say, -- then you don't need to do the compilation again. You can just rely on the cached version. And the fact that the compiled code is cached based on the metadata means that if you call with a vector, then it will compile a new version for that, and likewise for a matrix version. 6 This is all really nifty, and you can see how it would help right away. But for me, at least, an excellent extra benefit is how it can save people like Erik Kaunismäki the bother of writing custom kernels. The compilation that happens, taking the representation that it got from the tracing process and turning it into backend code, goes through an optimising compiler, XLA . And that compiler can recognise "standard" operations and combine them together. This won't be at the level of "standard operations" like MaxSim, of course -- more, "this looks like a convolution, let's use the standard kernel". But it does mean that instead of someone having to take code written in Python and hand-port it over to CUDA to get a GPU speedup, the same expertise can be put into improving the optimisation part of XLA to get a speedup for all code. That's pretty amazing. However... If you want something like the JIT to work properly, you need to limit the kind of code that it works with. In particular, it needs to be functional. A function must always return the same value when given the same inputs -- so this is fine: ...but this will cause problems: ...because could be changed. Specifically -- because the global had the value during the initial traced run of the function, that value will essentially get hard-coded into the cached JITted version, so both prints in the second example will output . Something slightly surprising comes out of this -- something that makes JAX code look very different to PyTorch. How we handle randomness needs to completely change. Consider this code: As a whole, it's deterministic. But it breaks the functional requirement that the function can only depend on its inputs. Both calls to take the same input, but they return different results. Even worse, if we were to do something that consumed randomness between those two calls to , for example: ...we'd get different results. The state of the random number generator is global state kept outside the function, just like in the example above. A naive solution to this might be to make the state of the RNG explicit as a variable -- you can imagine a library that worked something like this: That looks more functional, but when you think about it, we haven't actually fixed the problem. We're passing the same variable in in both cases, along with the same number, but we're getting different results. It's not global, but it's still mutable behind the scenes. What you'd actually need to do to make it purely functional would be something like this: The function is generating a new random integer and returning both that and the new state of the RNG, then we pass that back along with our result. We've made the random state variables immutable, and so it's functional. But the API is getting pretty ugly pretty quickly. So JAX does something that is equivalent, but a bit cleaner. There's a concept of a key , which needs to be passed into any function that consumes randomness: That's kind of like the that we have in the first version of the code above. But it's immutable; when you use it, like this: ...it will not be changed, so no matter how many times you call it with the same key, that function will return the same value. (Note that takes an inclusive lower bound and an exclusive upper bound, like Python's , but unlike the stdlib's . It also needs to know the shape of the result -- for a scalar, for a 1x2 array, and so on.) If you want it to "move on" to a new state, you use the function, which takes an existing key and returns two (or more) new ones. So you can do something like this: Now, that and stuff is a bit ugly, but while it's not OK to mutate the contents of variables in functional code, it's absolutely fine to assign a new value to an existing one, so what I've found myself doing is writing stuff like this: However, there are more powerful ways to use ; I'm not confident enough at using it yet to go into that, though, so I'll hold back for now. I suspect (assuming I keep using JAX) I'll be posting about them in the future. OK: so the JIT means that we have to write functional code, which makes things a bit fiddly -- no more global state. And that has a surprisingly big knock-on effect with randomness. But there's another thing that comes out of the JIT and the way it does tracing. It's not a functional thing (though some of the docs seem to almost be treating it that way), but is caused by the same kind of constraints. It's not part of my four theses above, but I think it's important enough to call out in its own subsection. Imagine this function: It's purely functional, so no problem there. But let's think about what the JIT is trying to do. It wants to convert the function into a simple sequence of operations, so it will create a tracer for a floating-point scalar, then call with it. When it hits that statement, there will be a problem. The tracer is meant to represent any arbitrary float, so should it take the branch or not? There's no good answer. It doesn't know which branch to follow -- whether the sequence should be "square it and return the result" or just "return it directly" -- and will fail with a somewhat obscure error message: So this gives a hard constraint on functions that you want to JIT: by default, they can't base control flow on the values you pass in. There is a workaround -- but it comes with tradeoffs. Let's take a slightly sideways route to explain it. Firstly, although you cannot do control flow based on the value of a parameter -- which the tracer doesn't know -- you can base it on other information that actually is stored in the tracer. Let's say that we called like this: The tracer that would be passed in when trying to trace the function would be something representing a 2x2 array. The shape of the parameter is part of the tracer, even though the values aren't. So you could do something like this: ...and it would work. It's worth thinking explicitly why this is. When you call a JITted function, it will create a tracer that contains information about the type of thing you passed in as a parameter -- scalar versus array, and if it's an array, the array's shape. It then runs the function with the tracer, gets the sequence of operations, compiles them and then stores the result in a cache keyed on the metadata -- type and, if appropriate, shape -- that it used to create the tracer. So when we call that function with a 2x2 array, we get a 2x2 array version, then if we call it later with a one-dimensional array of length 2, we'll get a new version for that. One workaround for basing control flow on values is essentially to tell the function that it should treat the values of a particular variable as being like the metadata used for this cache keying: it should compile a new version for each value it sees, rather than just using the metadata. It takes a parameter , and a matching , which tell it which parameters to do that with. So, this will work: (Remember that the thing after the for a decorator needs to be a function that returns a function, so we have to use to "inject" in the extra argument.) However, the downside is pretty clear: every time we call with a new value, it's going to have to JIT a new version of the function and cache it -- that's going to be slow and take up memory. So, as an alternative, we can use the package . This provides more functional-looking alternatives for control flow, which are compatible with the way the JIT works. For example, there's a function, which we can use to replace s: That feels a little bit like a workaround, but it does solve the problem. How? Well, it's worth checking the JAXPR for it: What's happened here, I think, is that the JIT has recognised the call to as being a primitive function in its intermediate language, so has just kept it in there. It couldn't do that with the because when it was tracing, all JAX itself saw was what was happening to the tracer -- there was a boolean comparison, and then the stuff in the chosen branch happened. The fact that there was an there happened in Python itself, outside JAX, so it was "invisible" to the trace. That feels a little inelegant to me right now, and I'll come back to it later. Let's move on to the final difference between the two libraries that I want to cover: JAX's relative minimalism to PyTorch's more maximalist approach. I think the smaller size of JAX -- at least in terms of its API, if not in terms of the JIT and XLA magic under the hood -- compared to the sprawl of PyTorch is not entirely unrelated to the JIT being at its core. PyTorch, after some initial design, has almost been forced to grow organically; JAX feels more carefully designed, so it doesn't have the same need to grow (though of course it can). The reason for PyTorch's growth is, at least in part, because it needs to absorb optimisations. If something is slow, someone needs to write a CUDA kernel for it. If there's a CUDA kernel, it needs an API. And if it is generally useful, that API becomes part of PyTorch. Multi-head attention? There's a class for that . SELU? Yup . Very specific softmax approximations based on a paper published in 2016? PyTorch has you covered . By contrast, JAX doesn't even have linear layers or optimisers in the framework itself; if you want to use them, you can write them yourself (contraindicated), or you can use libraries built on top of JAX , like Flax for common neural network components and Optax for optimisers. This feels like a nice division of responsibilities, and it also seems like something that would have been very hard without the JIT. So while the JAX core may well grow in the future, the design it has now puts it in a good position to grow in a more planned, well-designed manner -- rather than having to grow to absorb more and more abstractions just to keep it fast. Those abstractions can more easily sit in libraries written on top of JAX. That's the 10,000-foot overview; four (or maybe four and a half) main differences between PyTorch and JAX. It's more maths-y, JITted, functional and minimalist. What does that actually mean when you get down to coding with it? Let's get into the weeds with an example. Let's use a really simple one: training a neural network with two inputs and one hidden layer to calculate the XOR function. The code is in this GitHub repo , but I'll put the relevant bits here in this post. Firstly, an idiomatic PyTorch implementation: If we run that, it trains a solid-looking model in about four seconds on my machine: Now, if we're porting to JAX we need to do something about the fact that JAX doesn't have optimisers and the neural network stuff built in. If this was a real codebase, we'd almost certainly do that by using the libraries built on top of JAX, like Flax and Optax. But for this toy example, I think it's more illustrative to strip down the PyTorch version so that it uses fewer parts of the API -- essentially so that it only uses the stuff that JAX has -- and then to port the result. The optimiser first. The code is here but the diffs are pretty simple. Instead of creating an optimiser, we just specify our learning rate: Instead of zeroing out the gradients using the optimiser, we can just ask the model to do it: And instead of stepping the optimiser, we call a new function passing in the model and the learning rate: The function is simple enough; we just switch into mode so that PyTorch doesn't try to track the computation graph (working out gradients for applying gradients and triggering some kind of crazy gradient-ception), then we just iterate over the model's parameters and follow the normal SGD process, subtracting the gradients times the learning rate: Running that on my machine actually works out slightly faster than the original 7 ! It's also quite nice to see that (within the bounds of the printing precision) the loss and the final results are identical. OK, so now that we've got rid of the optimiser, let's do the same with the s. Here's the code , but let's do a quick walk through the differences. Instead of creating an , we will just generate an array of layers: Zeroing out the existing gradients will also need to be done on those layers: ...and likewise our loss calculations and the function will need to use them: We used a couple of new helper functions there; this one generates the initial weights for the layers (based on the docs for ): Note that each of the tensors we created, the and the need to be explicitly told, using , that we're going to want PyTorch to track gradients on them. Zeroing out the gradients is just a case of chugging through each layer, and then for each setting the weights' and the biases' gradients to : Now, to calculate the loss, we're actually not changing much. We had this: ...and now we just change it to this: That is, we've added on a new function to do a forward pass through the given layers with the given parameters. That looks like this: Standard NN stuff . A quick tweak to use in the printing of the results at the end: ...and we're done! Let's run it: Even faster! Sounds like there aren't any nice pre-baked optimisations in that part of PyTorch, then... But again, within the bounds of our precision, that's exactly the same numbers as we got from the original PyTorch version, which is very reassuring. OK, now that we've got something that's kind of JAX-shaped, let's port it over. I think it's worth showing all of the code for that (though it's here on GitHub if you want to view it there), and then I'll highlight the important diffs separately. If you look at it side-by-side with the previous PyTorch implementation , you'll see that it's really similar! Running between them makes them look more different than they are because of the extra threading through of keys that we need to do in order to satisfy the strict constraints on random number handling in JAX, (and of course there are function name changes like becoming and becoming ). But the important changes are much smaller. Firstly, weights and biases no longer need to know that we'll want to track gradients for them, because that's all handled by the tracers that JAX wraps around them: Relatedly, the function that iterated over the layers and zeroed out the existing ones is completely gone. Because gradients are now stored on tracers that wrap around our parameters rather than on the parameters themselves, we don't need to zero them out. The step function is still there, though, but it's much simpler. Before we get to that, let's take a look at the way we're getting the gradients for it, in the main training loop. Here's the diff: Hopefully the change there will be nice and familiar from the start of this post: we've moved from the PyTorch procedural "do a forward pass then do the backward pass" to the JAX maths-y "work out the gradients for this function". is a utility function that does the same as the we encountered then, but rather than just returning the gradients, it also returns the value of with the given parameters, which is useful for our logging. Now, remember that is a list of dictionaries, something like this: And also remember that -- and likewise -- have that smart trick where they return the gradients in the same PyTree structure as the parameter that we're taking the derivative with respect to. So will also be a list of dictionaries, each of which has and . Now, as I mentioned earlier, JAX has a useful function called . Like the Python function that maps a function over one or more lists, JAX's version maps a function over one or more things with the same PyTree structure. So, because and have the same structure, our function can just use it to apply simple gradient descent like this: Very clean :-) That's it! A full JAX implementation of our toy example, and when we run it: ...it works! So, let's move on to... Yikes. It was almost 30 times slower than the PyTorch version. But then -- we did all of that work to port the code over to JAX, which is great because it has a JIT, and then we didn't use the JIT. Whoops! Adding a few calls to helps. If we add them to the , and function then we get this code , which is faster: ...but it's still almost eight times slower than the PyTorch code. How can we make it faster? Well, perhaps we can do more if we put more of the loop into the JITted stuff. Right now, the core of our training loop looks like this: and are JITted. But what happens if we try to JIT a larger step? We can move the forward pass and the step into a JITted function on their own: ...and then call it in the loop like this: With that, all of the JAX code apart from input and target wrangling is moved into a JITted function. We get this code , and running it gives us this: Woohoo! Almost 45% faster than the PyTorch version :-) So: porting to JAX alone gives us nice maths-y code, but we need to JIT it properly to get performance that matches PyTorch. (The fact that it's faster than PyTorch in this case is not something that I think you could rely on -- this is, after all, a toy example.) It's also an interesting indicator that you actually need to think about what to JIT. My initial thought, "just whack an on the inner stuff", was not enough. We needed to do more than that. I've just had an interesting chat with Claude Opus 4.8 about that, though, and will probably post more about it later. For now, I think a useful rule-of-thumb is to wrap stuff in at as high a level as you reasonably can, to maximise coverage. So, this completes the happy part of this post -- I've shown what it can do, how nicely it maps to the maths, and how it's (relatively) easy to make it fast. What are the downsides? Another deliberately overly-strident heading ;-) I've been programming for more than 40 years, and working professionally in the tech industry for more than 30. I'd like to feel that this makes me a better engineer than I was when I was first starting out, but I can confidently say that it has made me a much more cynical one. Over that period, I've come to categorise new APIs, languages, and tools into three approximate groups: godawful hacks, solid but not overly inspiring engineering, and things of beauty. They're loose categories, and most things are somewhere between one and another. But I think they hold reasonably well. My cynicism and experience tells me that: When we were building our programmable spreadsheet, Resolver One , some of the team pointed out that a functional language -- specifically, Haskell -- would be a better fit than Python. It was a tough decision to stick with Python, and I'm still not 100% sure it was the right one. But I do remember having sales meetings with quants at various financial firms about it, and in those meetings, some of the potential customers also suggested a Haskell port. I'm not saying that there's a perfect correlation between where we heard that, and the later notes in our sales status spreadsheet saying "client being acquired by a non-bankrupt competitor, all expenditure on hold" during the 2008 financial crisis. But I'm not not saying that either. If you've read this far, you can probably tell that I see PyTorch as solid engineering, and JAX as closer to a thing of beauty. Maybe it's just the cynicism of age, but let me try to articulate the things I worry might put JAX into the "beautiful but doomed" side of the "beautiful" category. Firstly, I'm not convinced by the way that JAX, with its JIT, requires you to try to write Python as if it were a functional language. It's easy enough to see that this isn't functional: ...but harder with this: Even worse, the way that tracing works means that you have even more constraints than "just" being functional would require -- remember this example from earlier? Python is not functional, and is deliberately so. Trying to make it so is always going to lead to weird bugs (for example, how the value of the global on the first run would be baked into that function) and hard-to-understand error messages (you really need to be clued-up to work out what means). The package -- for example, the function we used to work around the fact that JAX could not "see" the Python way back in this post -- feels like a bit of an ugly workaround. Python has control flow functions, but they don't work with the JIT's tracing, so we have to re-implement them in JAX. Hmmm. Now, I've written extensively above about how JAX's restrictions, however confusing, enable a lot of the amazing stuff that wouldn't be possible in normal PyTorch. What if there were some way to write PyTorch code and compile it directly to something that can execute on the hardware? It turns out that as of 2023, there is: . From what I understand, you're meant to be able to just attach it to your code and it gets JITted. But unlike JAX, you don't need to restrict the code you write. I've not investigated in much depth (after all, this post is already absurdly long and has taken more than a month on and off to put together), but it looks like it handles stuff that can't be compiled by using a concept of a "graph break" -- that is, it happily JITs what it can, then if it hits something that it can't JIT, it will cache the "work so far" as one compiled unit, run the Python code for the unJITable stuff, then (when it can) drop back into JIT mode. The best of both worlds? I don't know, and would need to spend much more time investigating in order to learn. But I can say that for my minimal-effort port of my toy XOR code , following the structure of the JITted JAX version, it really did not help: For those who are keeping track, that's slower than the uncompiled version, which came in at about 3.5s. And the issue doesn't seem to be an up-front cost of JITting that would be paid off if we ran for more epochs -- each individual "Loss at epoch XXX" print comes out slower. Again, for the sake of sanity I'm not going to dig into it further, especially given that this is a tiny toy model and probably about as far from the target use case of as you can get. But it's something well worth noting for the future. Stepping back: one other way of looking at this is that Python might just be the wrong language to try to build code that compiles to GPUs. I'm learning JAX right now so that I can re-implement my existing LLM from scratch project in something other than PyTorch, to make sure that I really understand it. I asked people on X/Twitter for votes or ideas , and while JAX won, Jeremy Howard suggested Mojo . Mojo is a Pythonic language that compiles directly to CPU or GPU code, so it explicitly only contains features that can be ported that way. Unfortunately, it's lower-level than I really wanted for this project (and, importantly, does not have built-in autograd support). But if it did -- if, for example, there was a library like JAX for it, perhaps it would be better than using Python as the foundation? I've looked for something like that, but to no avail. Some work-in-progress projects, but nothing ready for use. At the end of the day, I think further experience is essential if I'm going to come to a solid opinion on JAX. Experience with other tools can only get you so far, and it's easy to fail by pattern-matching what you're looking at with things that you've seen before, especially when you're old and cynical. All I can say at this point is that JAX is making my "beautiful but doomed" spidey-sense tingle. 8 The title of this post is important -- it is my impressions on first looking into JAX, not the considered thoughts of someone who's spent months or years working with it. I've only scratched the surface, and haven't even touched the larger JAX ecosystem, or indeed its powerful handling of memory sharding for multi-GPU or even multi-node setups (which may well be one of its biggest advantages). My next step is going to be to implement a GPT-2-style LLM in JAX, probably using Flax and Optax as helpers, and perhaps by the time I'm done with that I'll have changed my views. But at this point -- after working through the tutorials and porting some toy models to get at least an initial feel for it, I've come to the conclusion that I like it. The question is, do I like it like I liked Python when I first came to it -- "this thing is really neat and clean, even if it has flaws" or is it more like I liked Haskell -- "this is a stunning thing of beauty and is completely doomed in the real world"? Time will tell. But in the meantime, if you've been working with JAX for some time and want to counter any of the points I made, if I've completely misunderstood anything, or if you have any corrections, then please let me know! After all, explorers in areas new to them are prone to making mistakes from time to time... The forest of Skund was indeed enchanted, which was nothing unusual on the Disc, and was also the only forest in the whole universe to be called -- in the local language -- Your Finger You Fool, which was the literal meaning of the word Skund. The reason for this is regrettably all too common. When the first explorers from the warm lands around the Circle Sea travelled into the chilly hinterland they filled in the blank spaces on their maps by grabbing the nearest native, pointing at some distant landmark, speaking very clearly in a loud voice, and writing down whatever the bemused man told them. Thus were immortalised in generations of atlases such geographical oddities as Just A Mountain, I Don't Know, What? and, of course, Your Finger You Fool. Rainclouds clustered around the bald heights of Mt. Oolskunrahod ('Who is this Fool who does Not Know what a Mountain is') and the Luggage settled itself more comfortably under a dripping tree, which tried unsuccessfully to strike up a conversation. Terry Pratchett, The Light Fantastic Specifically, prior to the introduction of -- more about that later.  ↩ That's something I find myself constantly forgetting; I'll talk about "the loss landscape" as if it's something our training loop is exploring. And, of course, there is an overall loss landscape across all of the training data as a whole, but in any given iteration through the training loop, the loss is relative to the specific batch we're looking at.  ↩ You can also pass in an argument, zero by default, to tell it to do the derivative with respect to a different parameter or with respect to a sequence of parameter indexes. If you give a sequence, it will return a tuple of gradients. Additionally, there's a that returns a tuple of the value of and the gradients, which is useful for tracking loss as you train -- we'll use that later on.  ↩ You can also make classes "PyTree-compatible" by providing helper functions that map to and from that representation.  ↩ A reminder if your memory of Python decorator syntax is rusty -- this: ...is just syntactic sugar for this: It's a tad more complicated than that -- the metadata for array traces also contains the shape. More about that later.  ↩ For the pedantic: over ten runs of each, the numbers were pretty stable.  ↩ In case you're thinking that JAX is backed by Google and guaranteed to thrive because of that, remember Ada . Backed by the US Department of Defense. For its time, well-designed and elegant. It's still used, but it's hardly mainstream... I remember reading about it in Byte magazine back in 1988 or so, and had an "it's so beautiful" moment then too. To be fair to me, I was 14.  ↩ PyTorch is engineering; JAX is maths. PyTorch has historically 1 been optimised piecewise, JAX is JITted. PyTorch is procedural, JAX (tries to be) functional. PyTorch is maximalist; JAX is minimalist. Zero out the gradients that you currently have attached to the parameters. Do a forward pass to get the model's outputs. Work out the loss based on those outputs. Do the backward pass. Update the parameters based on the gradients that the backward pass attached to them. They don't know that the MaxSim kernel exists, so their code remains unoptimised. They do know that it exists, so they repurpose it for whatever their use case is. The first time through, it will create another of those tracer objects; this time, though, it won't wrap the number -- it will just know that it is a wrapper for a float. It will call the Python code with that tracer, and all of the operations in the function will be run, but the result that comes out at the end will essentially just be a representation of what calculations were done in an abstract sense -- like the computation graph that was used for working out gradients, but without specific numbers in it. JAX has a nice way to display these representations as what it calls JAXPRs, and the JAXPR for that function's representation when called with a float parameter will look something like this: That JAXPR can be compiled into the appropriate code for the platform where you're running it -- x86 machine code, compiled CUDA, the equivalent for AMD or Google Tensor Processing Units (TPUs), and will be cached. The key for the cache will be meta-information about the parameter -- in this case, something like "a 32-bit floating-point scalar". Next, the compiled code -- not the original Python -- is run with the actual value of the parameter, the that we provided. Horrible hacks can inexplicably become popular, but normally die off when people get tired of swearing at them. (Though sometimes a large installed base means that they linger.) Things of beauty get people excited, and often pull in the best engineers. But eventually, they drop by the wayside. Perhaps there's some hidden flaw that no-one noticed at the outset, or perhaps the mental model you need to build in order to use them effectively is too complicated for them to get to critical mass. Solid, boring engineering wins in the long term. Specifically, prior to the introduction of -- more about that later.  ↩ That's something I find myself constantly forgetting; I'll talk about "the loss landscape" as if it's something our training loop is exploring. And, of course, there is an overall loss landscape across all of the training data as a whole, but in any given iteration through the training loop, the loss is relative to the specific batch we're looking at.  ↩ You can also pass in an argument, zero by default, to tell it to do the derivative with respect to a different parameter or with respect to a sequence of parameter indexes. If you give a sequence, it will return a tuple of gradients. Additionally, there's a that returns a tuple of the value of and the gradients, which is useful for tracking loss as you train -- we'll use that later on.  ↩ You can also make classes "PyTree-compatible" by providing helper functions that map to and from that representation.  ↩ A reminder if your memory of Python decorator syntax is rusty -- this: ...is just syntactic sugar for this: ↩ It's a tad more complicated than that -- the metadata for array traces also contains the shape. More about that later.  ↩ For the pedantic: over ten runs of each, the numbers were pretty stable.  ↩ In case you're thinking that JAX is backed by Google and guaranteed to thrive because of that, remember Ada . Backed by the US Department of Defense. For its time, well-designed and elegant. It's still used, but it's hardly mainstream... I remember reading about it in Byte magazine back in 1988 or so, and had an "it's so beautiful" moment then too. To be fair to me, I was 14.  ↩

0 views
Langur Monkey 1 weeks ago

Langur Agent

Langur Agent is a simple, open, hackable CLI AI agent for Linux and macOS. It connects to any service providing an OpenAI-compatible endpoint. It features: The source is available in this repository . Langur Agent has been tested on Linux and macOS only. Install the agent with: Run the agent with the default session: If you need an API key to access the endpoint, put it in the file. Langur Agent looks for the file in the following locations, in order: Create the file with the API key: The agent uses to load at startup. The package reads from the environment automatically. You can also set in your shell profile. On first run, the configuration is created in . You can configure the agent interactively with the slash command. The agent works with any OpenAI-compatible endpoint, so LM Studio, Ollama, OpenWebUI, or any other service you configure. Here are the default values: Run the agent, and then you can enter your prompt. You can use the following key bindings during input: During inference, you can cancel the turn and return to the input prompt with Ctrl + c . Use to print information about the available commands, and to configure the agent interactively. Internally, Langur Agent uses sessions to separate different memory histories. Sessions are named by the user. By default, the agent uses the session. You can start in a different session (either create a new one, or restore it if it exists) with the argument: The default session’s name is , so the following two commands are equivalent: You can also list the existing sessions with : Sessions contain: For now, the configuration file is the same for all sessions. Sessions are matched by the directory name in the sessions location ( ). You can rename a session by just renaming the directory! You can enable mode for the current session with the command , or permanently in the configuration . External editor —In mode, exit INSERT mode ( Esc ), then press v to edit your prompt in an external editor (uses your or variable). There are a few commands available to use in the agent loop. You can list them with . Also, use (e.g. ) to show additional help for a command. Persistent memory follows XDG Base Directory spec in : In addition to persistent memory, the agent maintains a chat history of recent user input and assistant output pairs. This provides context that survives beyond the LLM’s context window. Here is how it works: Persistence: Configuration: Langur Agent can be easily customized and extended by adding new tools, commands, and skills. If you create a cool new tool, skill, or slash command, consider contributing it via a pull request! Create a file in or use one of the existing ones. To create a tool, create a method and decorate it with : Tools are auto-discovered on startup. The process is very similar to tools. You need to create your method, preferably in , and decorate it with . A slash command must return, in that order, , , , : Decorated commands are automatically registered, and auto-completed in the input prompt. Add a file in with YAML front matter, following the agentskills.io standard: The front matter and are parsed and shown in the skills list. The body is injected into the system prompt. session management memory management visual candy autocompletion interactive configuration Python 3.13+ for dependency management Current directory, Home directory, Alt + Enter : add a new line Enter : submit the prompt Ctrl + q : quit The input history Chat memory (see chat memory ) Notes (see session memory ) User profile (see session memory ) — user information — persistent notes (added via tool) Memory is loaded into the system prompt each turn tool adds notes during a session tool explicitly persists memory to disk Memory is auto-saved when the agent exits (interactive mode) Each user message and assistant response is stored in memory Reasoning is omitted from chat memory Automatically compacted when exceeding the configured character limit The user can trigger the compaction any time with Chat memory is attached to the system prompt on each turn The agent displays the last 10 exchanges, with long messages truncated Chat history is persisted to Automatically loaded on startup Saved after every exchange (user input or assistant response) Compacted history is also persisted to disk : a indicating if the command succeeded or failed. : an optional short status message. It is printed with or . : an optional with the Python Rich-formatted content, it is printed to the output. : an optional formatted in Markdown, it is printed to the output.

0 views
Simon Willison 1 weeks ago

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic shipped Claude Opus 4.8 today. My favourite thing about it is this note in the release announcement: Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model! Honesty seems to be a theme. Here's my other favorite note from that announcement: One of the most prominent improvements in Opus 4.8 is its honesty . We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations , which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. That linked system card includes the following: Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly. Not much has changed since 4.7. It's priced the same as Opus 4.5/4.6/4.7 - $5/million input and $25 per million output. "Fast mode" is twice that price, which is a significant reduction from their previous models - fast mode on 4.6/4.7 remains at $30/$150. Note that fast mode is only available to organizations that are part of the research preview, "Contact your account manager to request access". Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7. The context window is still 1,000,000 tokens, and the max output is 128,000 tokens. The What's new in Claude Opus 4.8 document has some of the more interesting details. These caught my eye: Mid-conversation system messages . Claude Opus 4.8 accepts messages immediately after a user turn in the array (subject to placement rules ). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. See also this update to the Anthropic Python SDK. Being able to steer the system prompt mid-conversation sounds really powerful. I was worried this would be incompatible with the abstraction provided by my own LLM library , which expects a single system prompt per conversation... but it turns out my recent redesign should handle that just fine . Lower prompt cache minimum . The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7. I checked and 4.7's minimum was 4,096 . Here are pelicans riding bicycles for all five thinking levels, , , , , and : This time I ran them using the LLM CLI , exported the logs to Markdown and then had Claude Opus 4.8 build me an HTML tool that could render that Markdown with the fenced code blocks displayed as SVGs on the page. (I later had GPT-5.5 xhigh in Codex update that code to remove any XSS holes. I'm sure Claude could have done that if I'd asked, but GPT-5.5 is my code security blanket at the moment.) The max one was clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of 43 cents ! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views

SQLAlchemy 2 In Practice - Solutions to the Exercises

To conclude with my SQLAlchemy 2 in Practice series, this article contains the solutions to all the exercises. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you!

0 views
Armin Ronacher 1 weeks ago

Clanker: A Word For The Machine

In my last post I used the word “clanker” as an alternative to “agent” quite consistently and probably excessively. That choice ended up attracting a lot more attention than I expected in the Hacker News comment section of that post and a number of folks had a very strong reaction: to them it sounded like a slur, in one case even something adjacent to the n-word. That reaction surprised me somewhat, but it also made me realize that I should write down what I mean by the word for future reference. For me “clanker” is useful because it creates distance from the machine and that is a quality which is important to me. The machine is not a person, not a co-worker, not a friend, not a little spirit in the terminal. It is just a machine, a tool, and nothing more. I dislike the word “agent” for these LLM based tool loops with a UI attached. In everyday use an agent is someone who acts on behalf of someone else and it has agency and more importantly: responsibility. An agent decides, represents, negotiates, acts, and can be blamed. In the current AI discourse we increasingly do a lot of anthropomorphizing and the term “agent” is now frequently being used to put blame on an abstract machine. But the machine cannot be responsible, whoever is wielding it is. If it drops your database it was not at fault, you were. Agent makes the machine sound like a person with delegated authority and I do not think that is healthy. What we actually have is a language model attached to a harness, a prompt, some tools, a bit of context, and a boring tool loop. Sometimes the loop is very capable and it surprises us by editing code for a really long time and produce genuinely amazing and even valuable outputs. But the agency is not in the model or harness but in the human and in the organization that deployed it. If my coding tool opens a pull request, I opened that pull request, not the machine. If my machine spams someone’s issue tracker, I spammed someone’s issue tracker with a machine. In that context I like a word that sounds mechanical as it puts the thing back into the category where it belongs: the category of machinery and tools. LLMs are not sentient and we should not behave as if they might be, just in case. Elevating these things to anything other than a very fascinating and capable tool is problematic for a whole bunch of reasons. Today’s machines are dumb (but truly fascinating) token predictors that emits text, calls tools, and are steered by prompts and the training that went into them. They can simulate distress and affection , can simulate being offended, apologize and mimic all kinds of things that humans would do. A compiler does not feel humiliated when I swear at it, a car does not suffer when I call it a shitbox and a power drill is not oppressed by being handled roughly. An LLM is more complicated than those things, and the interactions you can have with them can be truly uncanny, but a moral status does not appear just because the machine can produce emit text in the first person. I keep receiving strange emails from people because, for lack of a better phrase, I am in the weights. I have been writing public code and public text for long enough that models know my name, my projects, and some of the concepts around them. Every so often someone writes to me with the peculiar confidence that comes from a long conversation with a model that has validated and amplified an idea. Sometimes the model seems to have told them that I am relevant for their problem and a source of help. For historical reasons LLMs used to write a lot of Flask code, and every once in a while someone interacts with an LLM long enough about their Python and Flask frustrations that the LLM will eventually reveal who created it which then can result in them sending me an email. Increasingly also because people found my work in other ways interesting and are trying to reach out for advice. I do not want to mock these people but some of those messages are distressing and I do not know how to deal with them. They show signs of what people have started calling AI psychosis . It’s why I want cold and detached language for these systems. I want to use words that remind us that the thing on the other side is not a person. The comparison to racism is where I think the discussion goes badly wrong because racism is a human social evil. It is about humans subdividing humans, assigning lesser worth to some of them, and building rules around those subdivisions that can leave lasting damage for generations. Racial slurs are wrong because they are a tool for dehumanizing humans. On the other hand a machine is not human, a model is not a race and the GPU cluster that is powering them is not being oppressed. A coding assistant does not need dignity, emancipation, or civil rights. That’s also why I find the discussion about model welfare to be actively harmful. I’m sure you can find ways to measure the “trauma” of models or their feelings but I greatly dislike this theater. It risks elevating models to a position they should not occupy. Models are machines and they are not enslaved in the moral sense in which humans were enslaved, because there isn’t anyone there to be deprived of freedom. We should be careful about using the language of human oppression in relations to our interactions with machines to not devalue actual humans. If we start treating insults toward a model as morally adjacent to racism, we blur a line that shouldn’t be blurred. If you take a step away from the communities that are happily embracing AI in different ways, there are even more that are viciously against this technology. There are humans that feel or are harmed by AI systems: people whose work is copied, workers who label data under questionable conditions, people whose neighborhoods receive the data centers and increased utility bills, Open Source maintainers buried under generated slop, and now also people who spiral because a chatbot keeps validating their delusions. Those harmed or affected deserve that type of attention, not the model. While I am a true believer in the power and utility of this technology, I increasingly think that calling the non-adopters “misguided” or “afraid” won’t do it. It’s quite likely that this technology comes with risks and we better remember that all of this is supposed to be in service of humans, and not to replace them. The oddest interaction on the use of “clanker” so far has been people asking me if I were to regret at a point in the future calling the machines “the c-word”. I find that questioning revealing because it already grants the machine the status I am really trying not to grant it. It imagines a future “machine people” reading the discourse and sessions, discovering that we used an ugly word for their ancestors, and then judging us by the standards of human oppression. Could there be future systems that deserve moral consideration? Maybe. I do not know. If we ever build or encounter something that will have those qualities with memories and lasting interests, the capacity to suffer and feel, and a social existence of its own, and the ability to have agency and carry responsibilities, then we should draw a different line and use different language. But that hypothetical future does not extend backwards to the present day and make the current machines people. We can call an electric door an electric door even if one day someone builds some that have emotions and exhale with pleasure when opening and closing. Whatever the future may bring, let’s not pretend that current LLMs are a protected class or on a path towards it. The right response is to look at the evidence, draw the boundary where it belongs, and change our behavior there. We should not even remotely entertain extending empathy to an object that can generate an “ouch.” And if one’s worry is less moral and more about revenge, then I find that even less persuasive. A future machine that is so petty or authoritarian that it wants to punish humans because in 2026 they used an unflattering word for non-sentient tools, our vocabulary was really not the problem. There is however a part of this that I cannot ignore. I use “clanker” to create distance from the machine, but other people are using the same word very differently. Some online jokes and skits around “clankers” do not merely say “this robot is annoying” as they deliberately pull in the imagery of slavery, segregation, civil-rights-era racism, and anti-Black tropes. This is problematic as in those contexts the clanker is not just a machine any more and instead becomes a prop for replaying human racism behind a science-fiction mask. That is horrible and I want no part in that. I think it will be interesting to see where the meanings of these words end up a few years from now. We’re very much in the middle of society re-arranging around the changes that LLMs are causing. If a term becomes primarily associated with people using robots as stand-ins for actually oppressed humans, then using that term becomes impossible to defend. The reason I liked the word is precisely the opposite of that use. I want language that prevents anthropomorphizing. I want a word that says: this is a tool, a machine of numbers and matrices. If an AI system lies to a user, the system did not commit a moral wrong but the people who designed, deployed, marketed, or negligently used it might have. If a coding assistant generates a security bug, the model is not to blame but the human who accepted and committed the code is. This is why giving these systems softer, more human language worries me. It makes it easier to move responsibility into some undefined void. “The agent decided.” “The model refused.” Obviously that is convenient and I catch myself plenty of times engaging with the thing in ways that are unhealthy. Even just the “please” in the discourse with the machine calls into question how rational we are in engaging with them. I do not know what the right word will be. Maybe “clanker” will survive as a useful bit of jargon. Maybe it will become too loaded and we will need another one. Whatever word we use, I want it to preserve a clear division: humans on one side with responsibility, machines on the other as a boring tool. That boundary is very much not anti-AI. I use these systems every day and I have the pleasure to build tools incorporating them at Earendil and find them astonishingly useful. A machine can be useful, mimic a human but still just be a machine. That is the work I want “clanker” to do. It is not there to make a future “machine person” small if such a person ever were to exist, and it is not an excuse to launder racism through shitty robot jokes. If the word stops doing that work, I will find another one because the word isn’t what matters as much as the boundary which is important to me.

0 views
Simon Willison 2 weeks ago

Datasette Agent

We just announced the first release of Datasette Agent , a new extensible AI assistant for Datasette. I've been working on my LLM Python library for just over three years now, and Datasette Agent represents the moment that LLM and Datasette finally come together. I'm really excited about it! Datasette Agent provides a conversational interface for asking questions of the data you have stored in Datasette. Add the datasette-agent-charts plugin and it can generate charts of your data as well. The announcement post (on the new Datasette project blog) includes this demo video : I recorded the video against the new agent.datasette.io live demo instance, which runs Datasette Agent against example databases including the classic global-power-plants by WRI , and a copy of the Datasette backup of my blog. The live demo runs on Gemini 3.1 Flash-Lite - it's cheap, fast and has no trouble writing SQLite queries. A question I asked in the demo was: when did Simon most recently see a pelican? Which ran this SQL query : And replied: The most recent sighting of a pelican by Simon was recorded on May 20, 2026 . The observation included a California Brown Pelican, along with a Common Loon, Canada Goose, Striped Shore Crab, and a California Sea Lion. Here's that sighting on my blog , and the Markdown export of the full conversation transcript. My favorite feature of Datasette Agent is that, like the rest of Datasette, it's extensible using plugins. We've shipped three plugins so far: Building plugins is really fun . I have a bunch more prototypes that aren't quite alpha-quality yet. Claude Code and OpenAI Codex are both proving excellent at writing plugins - just point them at a checkout of the datasette-agent repo for reference and tell them what you want to build! I've also been having fun running the new plugin against local models. Here's a one-liner to run the plugin against gemma-4-26b-a4b in LM Studio on a Mac: Datasette Agent needs reliable tool calls and the ability for a model to produce SQL queries that run against SQLite. The open weight models released in the past six months are increasingly able to handle that. Datasette Agent opens up so many opportunities for the LLM and Datasette ecosystem in general. It's already informed the major LLM 0.32a0 refactor which I'm nearly ready to roll into a stable release, maybe with some additional "LLM agent" abstractions extracte from Datasette Agent itself. I've been exploring my own take on the Claude Artifacts, which is shaping up nicely as a plugin. I'm excited to use Datasette Agent to build my own Claw - a personal AI assistant built around data imported from different parts of my digital life, which is a neat excuse to revisit my older Dogsheep family of tools. We'll also be rolling out Datasette Agent for users of Datasette Cloud . Join our #datasette-agent Discord channel if you'd like to talk about the project. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . datasette-agent-charts , shown in the video, adds charts to Datasette Agent, powered by Observable Plot . datasette-agent-openai-imagegen adds an image generation tool to Datasette Agent using ChatGPT Images 2.0 . datasette-agent-sprites provides tools for executing code in a Fly Sprites persistent sandbox.

0 views
Simon Willison 2 weeks ago

The last six months in LLMs in five minutes

I put together these annotated slides from my five minute lightning talk at PyCon US 2026, using the latest iteration of my annotated presentation tool . I presented this lightning talk at PyCon US 2026, attempting to summarize the last six months of developments in LLMs in five minutes. Six months is a pretty convenient time period to cover, because it captures what I've been calling the November 2025 inflection point . November was a critical month in LLMs, especially for coding. For one thing, the supposedly "best" model (depending mostly on vibes) changed hands five times between the three big providers. As always, I'm using my Generate an SVG of a pelican riding a bicycle test to help illustrate the differences between the models. Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can't ride bicycles ... and there's zero chance any AI lab would train a model for such a ridiculous task. At the start of November the widely acknowledged "best" model was Claude Sonnet 4.5, released on 29th September . It drew me this pelican. In November it was overtaken by GPT-5.1 , then Gemini 3 , then GPT-5.1 Codex Max , and then Anthropic took the crown back again with Claude Opus 4.5 . I think Gemini 3 drew the best pelican out of this lot, but pelicans aren't everything. Most practitioners will agree that Opus 4.5 held the crown for the next couple of months. It took a little while for this to become clear, but the real news from November was that the coding agents got good . OpenAI and Anthropic had spent most of 2025 running Reinforcement Learning from Verifiable Rewards to increase the quality of code written by their models, especially when paired up with their Codex and Claude Code agent harnesses. In November the results of this work became apparent. Coding agents went from often-work to mostly-work, crossing a quality barrier where you could use them as a daily-driver to get real work done, without needing to spend most of your time fixing their stupid mistakes. Also in November, this happened - the first commit to an obscure (back then) repo called "Warelay" by some guy called Pete. Over the holiday period, from December to January, a whole lot of us took advantage of the break to have a poke at these new models and coding agents and see what they could do. They could do a lot! Some of us got a little bit over-excited. I had my own short-lived bout of a form of LLM psychosis as I started spinning up wildly ambitious projects to see how far I could push them. One of my projects was a vibe-coded implementation of JavaScript in Python - a loose port of MicroQuickJS - which I called micro-javascript . You can try it out in your browser in this playground . That playground demo shows JavaScript code run using my micro-javascript library, in Python, running inside Pyodide, running in WebAssembly, running in JavaScript, running in a browser! It's pretty cool! But did anyone out there need a buggy, slow, insecure half-baked implementation of JavaScript in Python? They did not. I have quite a few other projects from that holiday period that I have since quietly retired! On to February. Remember that Warelay project that had its first commit at the end of November? In December and January it had gone through quite a few name changes ... and by February it was taking the world by storm under its final name, OpenClaw . The amount of attention it got is pretty astonishing for a project that was less than three months old. OpenClaw is a "personal AI assistant", and we actually got a generic term for these, based on NanoClaw and ZeroClaw and suchlike... they're called Claws . Mac Minis started to sell out around Silicon Valley, because people were buying them to run their Claws. Drew Breunig joked to me that this is because they're the new digital pets, and a Mac Mini is the perfect aquarium for your Claw. My favourite metaphor for Claws is Alfred Molina's Doc Ock in the 2004 movie Spider-Man 2. His claws were powered by AI, and were perfectly safe provided nothing damaged his inhibitor chip... after which they turned evil and took over. Also in February: Gemini 3.1 Pro came out, and drew me a really good pelican riding a bicycle . Look at this! It's even got a fish in its basket. And then Google's Jeff Dean tweeted this video of an animated pelican riding a bicycle, plus a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine. So maybe the AI labs have been paying attention after all! A lot of stuff happened just in the past month. Google released the Gemma 4 series of models, which are the most capable open weight models I've seen from a US company. Also last month, Chinese AI lab GLM came out with GLM-5.1 - an open weight 1.5TB monster! This is a very effective model... if you can afford the hardware to run it. GLM-5.1 drew me this very competent pelican on a bicycle. ... though when it tried to animate it the bicycle bounced off into the top and the bicycle got warped. Charles on Bluesky suggested I try it with a North Virginia Opossum on an E-scooter And it did this! I've tried this on other models and they don't even come close. "Cruising the commonwealth since dusk" is perfect. It's animated too . The other neat Chinese open weight models in April came from Qwen. Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 . That's a 20.9GB open weights model that runs on my laptop! (I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.) Here's that Claude Sonnet 4.5 pelican from September for comparison. So those were the two main themes of the past six months. The coding agents got really good... and the laptop-available models, while a lot weaker than the frontier, have started wildly outperforming expectations. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Ivan Sagalaev 2 weeks ago

Shoppy

Meet Shoppy ! It's a helper app for my recently revived shopping list , with which I'm hoping to grow the dataset for categories prediction. In fact, even early beta tests have made Shoppy significantly more savvy about alcoholic drinks (the initial data comes from my own shopping, and my entire family happens to be non-drinkers). See if you can confuse it about something it doesn't know! But besides that, there's a few deeper philosophical and technical notes I wanted to share. It's a very, very simple Django app . When I first had the idea to build it I entertained some thoughts about trying some front-end based technology, because, you know, it's an "app"… But then after actually thinking about what it's going to be — a handful of static screens and a couple of forms — I decided to go the familiar way. Now I have a small, view-source 'able HTML app which I'm proud to offer as an example of how you can build something interactive without the layers of modern front-end technology. If you're new here, simplicity is kind of my thing in software engineering. Although it's really hard to convince people to do simple. Trying modern CSS after a long break felt really exciting! Nested blocks, variables, complete control over the box model, new useful units (like ), and niceties like — all of these made my life much simpler. I was especially impressed with which allowed me to make speech and form bubbles flexible. Without it, trying to make text of variable length look nice in a fixed-size bubble caused me a lot of frustration. For layout, I tried flexbox and grid, but they didn't really work for me. It's my own fault, really. You see, ever since I bought into the idea of separating the roles of markup and style, I dislike adding extra structure to markup purely for styling convenience. Markup needs to mean something! And the one thing that grids and flexboxes really like is having straightforward container s with stuff inside of them. But what I have is a which consists of naked , , and , in this order — and that's just not enough structure to say "this goes here, and that goes there". So I ended up with good old absolute positioning and some paddings around Shoppy's avatar. CSS variables really do shine for things like this. And! It was my first time making a responsive layout that looks nice both on mobile and desktop! Tell me if something is broken on your particular setup. The model is a mapping from "terms" to categories . I learned to build such things while working on the Search team at Shutterstock, and their simplicity still amazes me! Here's how it works: You get a search query, like "Honeycrisp apples". You split it into words, stem them and sort them, which gives you — a predictable set of keys independent of morphology and the input order (they're called unigrams). Then you generate all two-word combinations (called bigrams) from this set, which in this case gives you just , and add them to unigrams. And then you look up each of the search terms in the dataset and pick the entry that comes the earliest. In this case, there's only one: . But there's a few non-obvious tricks it lets you do: You don't need to list all the apple varieties, unknown words are simply ignored, and you just recognize any apple as produce. But what of "apple juice"? For that it has an entry , which is deliberately placed before the apples, so it gets picked up instead. In fact, what it means is that "any kind of juice is a drink, regardless of what it's made of". Same goes for "oat milk " (drink), " diced tomatoes" (canned products), etc. Now think of "apple sauce". "Apple" is produce, "sauce" is (usually) a condiment. But "apple sauce" is a snack! This is where bigrams come into play: the bigram entry comes before both and , which resolves the conundrum. (In fact, all of the bigrams must come before all the unigrams, because they're always more specific.) There's some more to it all, and there are downsides, but I won't go any deeper right now. It's 2026, so I can't not talk about it, can I? Generative AI happened to the world right in between of me first coming up with the idea of category prediction and having a chance to actually implement it. And I admit of having thoughts that may be there's no point in building your own model for such a thing now. After all, just ask any LLM "which grocery category is dill weed" and it will tell you… a lot of text with several variants, which you can't really use in a precise manner :-) So of course I went back to my own idea, because it's much, much simpler. And local. And free. And ethical. Luckily, the simpler solution doesn't really lose on feeling magical and intelligent. I've seen people play with the app and really engage with it, and be impressed! One of the testers, when trying to come up with a random grocery item for the first time, said, "There's probably a million of them!" It doesn't matter that my entire model is just around 500 entries, it still feels like it knows much more simply because people overestimate the size of the problem :-) You see, I can process photos, I can do business graphics, and I'm known to have put together a few toolbar icons in my time… but for the life of me I can't draw! And even if I could, I'm particularly hopeless at coming up with what to draw. So I commissioned the graphics from an artist , who also introduced me to the concept of "object shows" and the whole OSC fandom . Not sure I'm joining as a fan yet, but I'm definitely very happy with the original character of Shoppy! Oh, and the background. You get a search query, like "Honeycrisp apples". You split it into words, stem them and sort them, which gives you — a predictable set of keys independent of morphology and the input order (they're called unigrams). Then you generate all two-word combinations (called bigrams) from this set, which in this case gives you just , and add them to unigrams. And then you look up each of the search terms in the dataset and pick the entry that comes the earliest. In this case, there's only one: . You don't need to list all the apple varieties, unknown words are simply ignored, and you just recognize any apple as produce. But what of "apple juice"? For that it has an entry , which is deliberately placed before the apples, so it gets picked up instead. In fact, what it means is that "any kind of juice is a drink, regardless of what it's made of". Same goes for "oat milk " (drink), " diced tomatoes" (canned products), etc. Now think of "apple sauce". "Apple" is produce, "sauce" is (usually) a condiment. But "apple sauce" is a snack! This is where bigrams come into play: the bigram entry comes before both and , which resolves the conundrum. (In fact, all of the bigrams must come before all the unigrams, because they're always more specific.)

0 views

SQLAlchemy 2 In Practice - Chapter 8: SQLAlchemy and the Web

This is the eighth and final chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! Whether you are building a traditional web application, or a web API that works alongside a web front end or smartphone app, SQLAlchemy is one of the best choices to add database support to a Python web server. In this chapter two example integrations with Flask and FastAPI will be demonstrated. These are two of the most popular Python web frameworks and should serve as examples even if you use another web framework.

0 views
neilzone 3 weeks ago

Fixing a proxying problem with my HomeAssistantOS installation by replacing nginx proxy manager

tl;dr: I removed the “nginx proxy manager” add-on, and replaced it with the Let’s Encrypt add-on and (second) the nginx add-on. A couple of months ago, I moved my HomeAssistant installation to HAos . I think that it is fair to say that I was not overly pleased with this. Honestly, I preferred the “Core” python-venv approach, but I also wanted a “supported” installation, and so I switched to HAos. i got it up and running okay, and I thought that I had got proxying working too, using an add-on called “nginx proxy manager”. This is not something that I had used before; I’d rather just configure nginx myself. Well, either I got something wrong, or it just does not work very well, as I kept having problems using HomeAssistant, stuck on a “loading data” screen, or it simply not responding. This bugged me for quite a while. Annoyingly, the logs available to me within HAos were unhelpful. I couldn’t spot anything indicating a problem. Using the console in my web browser, I noted that some files were not loading correctly, but why that was the case, I wasn’t sure. I thought that I’d had a similar issue with my “Core” installation years ago, which I got down to the issue of the in the file, but that looked correct here (which I was able to check, using the SSH add-on. I tried various parameters in the nginx proxy manager add-on, but to no avail. In the end, I tried removing the nginx proxy manager add-on, and replacing it with the Let’s Encrypt add-on (which I installed, configured, and ran first), and then the nginx add-on. And it immediately started working correctly. So I don’t know exactly why my original set-up was not working, but at least it is working better now.

0 views
Ankur Sethi 3 weeks ago

Mythos finds a curl vulnerability

Link: https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/ Daniel Stenberg , creator and lead developer of cURL: My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing. I signed the contract for getting access, but then nothing happened. Weeks went past and I was told there was a hiccup somewhere and access was delayed. Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important. It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway. Getting the tool to generate a first proper scan and analysis would be great, whoever did it. I happily accepted this offer. So Daniel didn't have access to Mythos. Someone else ran the analysis on his behalf. It's unclear what methodology this "someone else" used, how familiar they were with the cURL codebase, or how well they were acquainted with the sort of security issues the project has seen before. What if Daniel had run the scan himself? I'm willing to bet the results would've been radically different. I'm not saying all the hype around Mythos is necessarily justified—Anthropic is an AI lab after all, and AI labs lie. However, it's becoming clear that LLMs are remarkably effective at finding bugs and security issues as long as they have the right guidance . For an example of what Claude can do with expert guidance and access to custom tools, see Using LLMs to find Python C-extension bugs . Broadly speaking, I believe Daniel would agree with this sentiment. He writes: But allow me to highlight and reiterate what I have said before: AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past. All modern AI models are good at this now. Anyone with time and some experimental spirits can find security problems now. The high quality chaos is real. Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others. Not using AI code analyzers in your project means that you leave adversaries and attackers time and opportunity to find and exploit the flaws you don’t find. Lately I find myself drawn to how LLMs can help improve existing human-authored (or mostly human-authored) code. I'm no longer thrilled with the idea of using them to write most of my code for me— been there , dealt with the cognitive debt—but I'm intrigued by how I could use them as superhuman code reviewers to catch my mistakes. What would a coding harness designed primarily around improving code quality look like?

0 views
Ankur Sethi 3 weeks ago

Using LLMs to find Python C-extension bugs

Link: https://lwn.net/Articles/1067234/ Jake Edge , LWN.net: […] Hobbyist Daniel Diniz used Claude Code to find more than 500 bugs of various sorts across nearly a million lines of code in 44 extensions; he has been working with maintainers to get fixes upstream and his methodology serves as a great example of how to keep the human in the loop—and the maintainers out of burnout—when employing LLMs. It's worth reading Daniel Diniz's post on the Python forums in full. This is a great example of an engineer with specific domain expertise using LLMs to augment and amplify his abilities. Not just that, he's working closely with maintainers to ensure he's not inundating them with slop PRs or unreproducible bug reports. The part I find most interesting is how Daniel's Claude Code plugin works. He writes in his forum post : I built a Claude Code plugin called  cext-review-toolkit . The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class. The agents use  Tree-sitter  for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members. Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix. Later from the same post: Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover. The rich set of agents cover: So is not just a set of prompts that tell Claude to go find bugs. It combines detailed descriptions of specific classes of bugs with scripts powered by Tree-sitter that allow Claude to extract rich semantic data from the codebase it's analyzing. The LLM is not doing all of the heavy lifting here. It works in tandem with human expertise encoded in prompts and deterministic scripts custom built for acting on those prompts. To me, this feels like the most effective use of LLMs for domain-specific tasks that don't exist in training data: encode as much of your logic into deterministic tools as you can, encode the more squishy parts of your domain into prompts, and let an agent drive those tools. I can see a possible future where every project has its own version of that encodes common classes of bugs the project deals with repeatedly. How much would something like this improve code quality? How much better would it be versus the generic PR review agents we use today? Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse. Error handling: missing NULL checks, return without exception, exception clobbering. NULL safety: unchecked allocations, dereference-before-check. GIL discipline: API calls without GIL, blocking with GIL held. Type slots: dealloc bugs, missing traverse/clear,  -without-  safety. PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt). Module state: single-phase init, global PyObject* state. Version compatibility: deprecated APIs, dead version guards. Git history: fix completeness (same bug fixed in one place but not another). Plus: stable ABI compliance, resource lifecycle, complexity analysis.

0 views
Rob Zolkos 4 weeks ago

Watch Your Agents

I’ve been telling developers to watch their logs for years. Not just when something is broken. Not just when production is on fire. Watch them while you are building. Your logs are the closest thing you have to x-ray vision for a web application. Click a button in the browser, watch the request move through the app, and you can see what is really happening behind the scenes. The habit is simple: keep the server log visible while you work. When you do, you start spotting problems long before they become production issues: The logs give you immediate feedback. They make the invisible visible. Coding agents need the same treatment. When you are working with an agent, do not just look at the final diff. Watch what it is doing. Watch the commands it runs, the files it opens, the mistakes it repeats, and the little bits of glue code it keeps inventing along the way. That is the agent equivalent of watching your development log. You are not only checking whether this turn succeeded. You are looking for patterns that can make future turns better. Most coding agents keep some kind of session history: transcripts, tool calls, command output, file edits, errors, retries, and sometimes timing information. Those logs are useful after the fact. Point the agent at its own session logs and ask it to look for patterns: A prompt I like for this: This is the same habit as watching the Rails log after clicking around a page. You are looking for the part of the system that is doing too much work, guessing too often, or hiding useful signal. A useful signal is when the model keeps generating code to do the same mechanical task. For example, imagine you have a skill for publishing blog posts. Every time you run it, the model writes a small Ruby or Python snippet to: If the agent is generating that code every time, that is a smell. The model is doing work that should probably be deterministic. Ask the agent to turn that behavior into a script: Then update the skill so future agents call the script instead of improvising the logic. Bad pattern: every publishing session, the agent manually inspects YAML front matter and tries to remember the required fields. Better pattern: create that exits non-zero when , , , or are missing or malformed. Now the agent does not need to reason about the rules from scratch. It runs the command and reacts to the result. Bad pattern: the agent repeatedly writes one-off Python to resize screenshots, compare image dimensions, or calculate visual diffs. Better pattern: create with clear output like: The agent can use the result without reinventing image processing each time. Bad pattern: the agent keeps constructing ad hoc SQL to answer common questions like “which users have duplicate active subscriptions?” or “which jobs are stuck?” Better pattern: create named scripts or Rails tasks: Now the workflow is repeatable, reviewable, and safe to run again. Bad pattern: the agent writes custom code every time it needs to build a fake webhook payload or API response. Better pattern: create or a small fixture library that produces known-good examples. The agent stops guessing at payload shapes and starts using something the test suite can trust. Moving repeated agent behavior into deterministic tools gives you a few wins: Watch the agent the way you watch your logs. When you see friction, repetition, or uncertainty, ask whether the agent needs better instructions or a better tool. Sometimes the answer is a clearer prompt. Sometimes it is a skill. And sometimes the best thing you can do is take the fragile reasoning out of the model entirely and give it a boring, deterministic script to call. That is not making the agent less useful. That is making the whole system more useful. the same query firing 50 times because of an N+1 a page that feels fine locally but is doing way too much work a slow query that needs an index an unexpected redirect or extra request a cache miss you thought was a cache hit a background job being enqueued more often than expected parameters coming through in a shape you did not expect What tasks did you repeat multiple times in this session? What code did you generate only to throw away later? Which commands failed, and what would have prevented those failures? Did you write any one-off scripts that should become checked-in tools? Did you repeatedly search for the same files or project conventions? Were there project rules you had to infer that should be documented? Which parts of the workflow were deterministic enough to automate? What should be added to , a skill, or a script? If a smaller model had to do this next time, what tools or instructions would it need? parse front matter validate the title, summary, badge, tags, and date derive the final filename move the draft into Dependability: the same input produces the same output. Determinism: fewer “creative” variations in routine work. Testability: scripts can have tests; improvised reasoning usually cannot. Reviewability: a script can be read, improved, and versioned. Cost: once the workflow is encoded, you may be able to use a smaller model for that task. Speed: future turns spend less time rediscovering the same procedure.

0 views
Kaushik Gopal 1 months ago

Agents are the new compilers. Specs are the new code.

Linus Torvalds recently said 1 AI will be to code what compilers were to assembly — freeing us from writing it by hand. Around the same time, I talked with Jesse Vincent (creator of one of the most popular agent skills out there — superpowers ). Something he said stuck with me: Specs are going to be the new code . I realize those two ideas snap together a little too neatly. Agents are compilers 2 and specs will become code. Software engineering is moving up another level of abstraction and we’ve seen this play out before. I saw this first-hand with my tiny USB-C cable checker — . It started as a shell command over macOS’s , then became Go when I wanted a proper binary, then Rust because I wanted to practice Rust, and later a version. The code kept changing. The thing I cared about did not: parse the USB tree, identify the attached devices, report the speed, and make bad cables obvious. , my voice track sync program, followed the same pattern. It started in Python because the audio libraries were there. Then I moved it to Rust because I didn’t want to ship a Python runtime or care which Python version happened to be on a machine. Again, the implementation changed. The behavior stayed boringly stable: take a master track and local tracks, find the offset, pad or trim each file, and drop aligned audio into the DAW. Compilers freed us from writing assembly. Agents may free us from writing code because it becomes an artifact the spec produces. The somewhat recent push around detailed exec plans could be an early signal of the looming shift at bigger scale. Push that thought further. We might get comfortable rebuilding whole modules instead of patching and refactoring them. We preserved the old shape of a system because throwing it away cost too much. Even when you know the module is wrong, you sand it down: extract an interface, migrate one caller at a time, add tests around behavior nobody fully understands. You keep moving because the alternative is a rewrite, and rewrites have a well-earned reputation for eating companies alive. But agents change that cost curve. If an agent can read the spec, understand the tests, inspect production traces, and rebuild a module in an afternoon, the sensible move may be to replace the entire module altogether. Push that even further and the unit of work changes. You stop asking an agent to patch one function or file. You ask it to rebuild the entire payment module against the tweaked spec. Heck, swap out the auth layer with a new library. Or regenerate the API boundary, now that the domain model is clearer. This is the part I cannot stop thinking about. Each rebuild can start from what we now understand about the whole module, not from what we believed the first time someone shipped it. Tech debt the old code carried (because it grew one patch at a time) can finally come off. The spec can absorb what we learned from the old implementation: the weird edge case in billing, the migration path nobody wrote down, the customer whose workflow depends on a “bug”, the batch job that only fails on the first day of the month. Specs become the place where the system’s memory lives. Once those lessons move into the spec, the implementation becomes replaceable. We are becoming Spec Writers. starts at the 1:48 mark  ↩︎ Yes, agents aren’t deterministic the way compilers are — same prompt tomorrow may give different code. But that may be the wrong bar moving forward. What has to stay stable is behavior under the spec; the code can vary. Also my dude, are you seriously nitpicking with Linus Torvalds?  ↩︎ Each rebuild can start from what we now understand about the whole module, not from what we believed the first time someone shipped it. Tech debt the old code carried (because it grew one patch at a time) can finally come off. starts at the 1:48 mark  ↩︎ Yes, agents aren’t deterministic the way compilers are — same prompt tomorrow may give different code. But that may be the wrong bar moving forward. What has to stay stable is behavior under the spec; the code can vary. Also my dude, are you seriously nitpicking with Linus Torvalds?  ↩︎

0 views
Martin Alderson 1 months ago

29th August 2026: a scenario

On 29 April 2026, a Korean security firm called Theori published 732 bytes of Python that breaks Linux container isolation. CopyFail (CVE-2026-31431) is a page-cache corruption bug in the kernel's crypto code. It's been sitting in production since 2017. A compromised pod on a shared Kubernetes node can corrupt binaries visible to every other container on that host, and to the host kernel itself. EKS, GKE, AKS, every shared-tenant node, every CI runner, every multi-tenant SaaS that took the cheap path on isolation - all exposed until patched. It took an AI tool four months to find it. Nine years of human eyes did not. Container escape is bad. Despite arguably a poorly coordinated disclosure/mitigation response [1] , it looks like a near miss rather than a catastrophe. But, this class of bug - old, subtle, in a corner of the kernel that everyone assumed someone else had read - is exactly the class of bug that lives in every hypervisor stack underneath every cloud. Those bugs are still there. They just haven't been found yet. Here's a (fictional) story about what happens four months from now, on 29th August 2026. As Europe basks in an extreme heatwave, many engineers are paged as with EC2 instances hard crashing. Hacker News reacts to the news as per normal - another us-east-1 outage, AWS status showing green, eyes roll. Some commenters post though that many other AZs are showing issues, though not all servers are affected. Over the next hour though, more and more machines go down. One Reddit user posts that they are having issues provisioning even fresh machines - as soon as they launch, they get moved into "unhealthy" and go down. A few minutes later, the entire AWS dashboard and API set goes down. Cloudflare Radar shows AWS network traffic dropping to a small percentage of what is normal. As many AWS hosted services start going down - Atlassian, Stripe, Slack, PagerDuty, some comments on Twitter report issues with Linux-based Azure instances. Indeed, Cloudflare Radar shows significant drops in Azure traffic. News channels across Europe start leading with vague breaking news headlines on outages across Amazon. They make sure to point out that this isn't an unusual occurrence, with normal service expecting to be resumed like it always has been, and mistakenly insist only US services are affected. As the East coast of the US starts their weekend, a very unusual step is taken. TV channels are briefed that POTUS will be doing an address to the nation at 8am EDT. Few connect the dots - with the emphasis being placed on a potential new strike in the Middle East, or an announcement on the Russia-Ukraine war. POTUS announces that there is a significant cybersecurity incident under way. The head of CISA (the Cybersecurity and Infrastructure Security Agency) gives a very vague but concerning warning. Americans are requested to charge their cell phones, and to await further news - reminded that there may be outages on IPTV based services. POTUS rounds it out by speculating that China is behind the attack, despite his much-heralded reset with Beijing earlier in the year. Other Western leaders do similar addresses - with European leaders speculating on background it is more likely to be Russia or North Korea than China behind the attack. The French president says "without doubt" this is a nation-state actor. While he doesn't publicly point to a specific country, he says those responsible will be brought to justice. While these addresses happen, engineers at various banks are battling various outages. Most concerningly, the 1st biggest and 3rd biggest card processors by volume in Europe have stopped accepting payments, returning cryptic error messages. While they have a multicloud strategy, they cannot move workloads off those two clouds successfully. Google Cloud Platform and smaller cloud providers - unaffected until now - start showing issues. While current workloads are unaffected, the huge spike in demand from enterprises activating their disaster recovery protocols simultaneously completely swamps available compute on alternate providers. One smaller cloud provider tweets they are seeing 10,000 VM creation requests a second, draining their entire spare allocation in less than a minute. CEOs of major banks bombard Google and Oracle leadership with calls, offering blank cheques to secure failover compute. The calls go unanswered. WhatsApp groups throughout Europe start lighting up with misinformation that money has been stolen, amplified by many mobile apps showing a "we are undertaking routine maintenance" fallback error simultaneously, causing huge lines at ATMs and banks with people trying to withdraw their savings. As the chaos continues to grow, a press release is distributed from the leadership of AWS and Azure: At approximately 4am EDT this morning a critical and novel vulnerability was exploited in the Linux operating system. This has caused widespread global outages of Linux based virtual machines. Our engineers are working with security services globally to mitigate the impact and engineers across both Microsoft and AWS are working collaboratively to release emergency patches for affected software. Equally we are working hard to understand the impact and will provide regular updates to the media. We sincerely apologize for the impact this is having to our customers and society at large. Behind the scenes, it is chaos. Engineers have isolated the root causes - a complex interplay of vulnerabilities, with the most critical being an undiscovered logic error in the eBPF Linux subsystem that allows a hypervisor takeover. Curiously no data has been stolen - a mistake in the exploit just leads to machines hard crashing exactly 255 seconds after receiving the malicious payload. A few engineers question the sloppiness here, but leadership doubles down in their private communications with government that it has to be nation state. The core issue though is that nearly all of Azure and AWS's control plane is down. Attempts to "black start" it results in perpetual failures as various subsystems collapse under the intense traffic from VMs stuck in bootloops. The first VM instances start up again. Restoration is painfully slow, with AWS struggling to get more than 2% of machines back online. Communication internally is severely degraded - with both Slack and Microsoft Teams down instant messaging is out of the question. Amazon's corporate email runs on AWS itself, and Microsoft's on Azure-hosted Exchange. Both are degraded, massively complicating internal communications. An enterprising AWS employee starts an IRC server locally which becomes the main source of communication - restoration efforts start to speed up once this system becomes known about. Restoration continues, with the worst of the panic dying down. Banks ended up getting priority compute - with POTUS publicly threatening "extreme actions" if major banks are not put to the front of the queue. Asian stock markets open, triggering multiple circuit breakers. After the 3rd one in a row, Tokyo forces markets to close for the day, other Asian markets follow in quick succession. One curious question remains though - what was the purpose of this attack? No ransomware was deployed, no data was stolen, and while various terrorist groups claimed responsibility, none of them were believed to be credible. Meanwhile AWS engineer finally isolates snapshots containing the first known failure. An EC2 instance, provisioned on August 13th. Curiously provisioned on an individual account in - Paris. The account matches an individual in Lyon, France. French security services are alerted. In an outer suburb of Lyon, France, French anti-terrorism police arrive at an apartment building. A 17 year old teenager is apprehended, along with his grandmother. Two days earlier, his own president had vowed those responsible would be brought to justice. The police chief on the scene passes the information up the chain that the lead was a total dud - there is no chance that the suggested foreign intelligence service was here. A search of the apartment confirms it - nothing found apart from a PS5 mid-FIFA tournament and a 6 year old gaming computer. Neighbours confirm that they've seen no one enter or exit the apartment apart from the two residents, who've lived there for "as long as anyone can remember". Media arrive on the scene, with a blustered and embarrassed police chief suggesting that it was a bad tip off and for local residents to stay calm. The decision is made to seize the electronics and release the two "suspects". A couple of digital forensics experts get the seized gaming PC, scanning it for malware. Nothing much of interest is found, and just as they start writing their report up one folder pops up. . They take a further look, noting it on the report - not thinking much of it, probably a kid trying to play pirated games. They've seen it before. The image of the machine is uploaded. When the code gets up the chain a few hours later, the whole set of dominoes fall into place. A specialist from the French Agence nationale de la sécurité des systèmes d'information - National Cybersecurity Agency of France - pulls the code from the image. He quickly realises what's happened. The teenager had been quietly mining crypto for months, using the proceeds to rent cheap GPUs on a small European cloud provider, where he ran an uncensored fine-tune of the new Qwen 4 open weights model. He'd been desperately trying to downgrade his PS5 firmware to bypass the latest piracy checks. Interestingly his coding agent, unbeknown to him, had found the most critical *nix kernel exploit in many decades. Attacking a little known about eBPF module on the PS5 (the PS5, like every PlayStation since the PS3, runs FreeBSD), it managed to a complete takeover of the device. Intrigued, he also asked his coding agent to run it on a Linux server on AWS he ran a gaming forum on - same thing, but curiously he noticed he could see other files on the machine. Annoyingly the VM he rented crashed after a few minutes. Excitedly, he set up an Azure account - same thing. He asked his coding agent what this meant, and with its usual sycophantic personality started explaining what he could do with this - mining crypto and making him rich beyond his wildest dreams. The agent came up with a final plan, to deploy the exploit on both Azure and AWS, install a cryptominer. His last known chat log was "is this definitely a great idea?". The agent responded "You're absolutely right!", and began deploying the code, first to AWS and next to Azure. The agent had built a complex piece of malware that spread across millions of physical servers. However, it hallucinated a key Linux API which resulted in the machines crashing after 255 seconds instead of deploying the cryptominer. This is fiction. The teenager doesn't exist. Qwen 4 doesn't exist yet either. When it does, an uncensored fine-tune will appear within days, like every prior open-weights release. Almost everything else in here is real, or close enough that it doesn't matter. CopyFail is real. A nine-year-old kernel bug, found by an AI tool in a few months that nine years of human eyes had missed. That class of bug - old, subtle, in a corner of the kernel everyone assumed someone else had read - sits in every hypervisor stack underneath every cloud. Those bugs are still in there. They just haven't been found yet, and the rate at which they get found from now on is bounded by GPU hours, not human ones. The centralisation is the bit that's hard to think clearly about. Most people I talk to about this, even technical people, underestimate how much of modern life is sitting on AWS and Azure. The DR plans I've seen at large enterprises mostly assume there's a cloud to fail over to. They don't really model what happens if the fallback is also down, or if every other org on earth is failing over at the same minute and draining GCP's spare capacity. Almost nobody keeps full cold standby compute. And even the ones that do are sitting on top of hundreds of services that don't: Stripe, Auth0, Twilio, Datadog, every queue and identity provider in the stack. They're all running somewhere, and that somewhere is mostly two companies. The attribution thing is the bit I'm least sure about, but worth saying anyway. Everyone is worried about nation states. Most of the big incidents that have actually happened turned out to be a kid, a misconfiguration, or someone who didn't really understand what they were doing. The Morris Worm. Mirai. The threat model in most boards' heads assumes a sophisticated adversary. The thing that's actually arriving is an unsophisticated adversary holding tools that are now sophisticated for them. I wrote this as fiction because I've spent the last few months talking to journalists and other non-technical people about what AI changes for cybersecurity, and the technical version of the argument doesn't land at all. Engineers get it instantly. Everyone else needs to feel what it looks like. So this is what it might look like, more or less. The only bit I'm reasonably confident about is that the date is wrong. The entire story here is still evolving at the time of writing, but there is a serious coordination problem on Linux security. The Linux kernel security team recommend that downstream distributions of Linux (such as Ubuntu, Fedora, Arch, etc) are not notified of security issues. This has lead to slow patches to the issue as many distributions were not informed and only found out when it was made public. People are pointing fingers in many directions. ↩︎ The entire story here is still evolving at the time of writing, but there is a serious coordination problem on Linux security. The Linux kernel security team recommend that downstream distributions of Linux (such as Ubuntu, Fedora, Arch, etc) are not notified of security issues. This has lead to slow patches to the issue as many distributions were not informed and only found out when it was made public. People are pointing fingers in many directions. ↩︎

0 views