Posts in Ai (20 found)

I Will Never Respect A Website

If you like this piece and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I recently put out the timely and important Hater’s Guide To The SaaSpocalypse , another on How AI Isn't Too Big To Fail , and a deep (17,500 word) Hater’s Guide To OpenAI .  Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Soundtrack: Muse — Stockholm Syndrome I think the most enlightening thing about AI is that it shows you how even the most mediocre text inspires some sort of emotion. Soulless LinkedIn slop makes you feel frustration with a person for their lack of authenticity, but you can still imagine how they forced it out of their heads. You still connect with them, even if it’s in a bad way.  AI copy is dead. It is inert. The reason you can spot it is that it sounds hollow. I don’t care if a website says stuff on it because I typed in, just like I don’t care if it responds in a way that sounds human, because it all feels like nothing to me. I am not here to give a website respect, I will not be impressed by a website, nor will I grant a website any extra credit if it can’t do the right thing every time. The computer is meant to work for me. If the computer doesn’t do what I want, I change the kind of computer I use. LLMs will always hallucinate, their outputs are not trustworthy as a result, they cannot be deterministic, and any chance of any mistakes of any kind are unforgivable. I don’t care how the website made you feel: it’s a machine that doesn’t always work, and that’s not a very good machine.  I feel nothing when I see an LLM’s output. Tell me thank you or whatever, I don’t care. You’re a website. Oh you can spit out code? Amazing. Still a website.  Perhaps you’ve found value in LLMs. Congratulations! You should feel no compulsion to have to convince me, nor should you feel any pride in using a particular website. And if you feel you’re being judged for using AI, perhaps you should ask why you feel so vilified? Did the industry do something to somehow warrant judgment? Is there something weird or embarrassing about the product, such as it famously having a propensity to get things wrong? Perhaps it loses billions of dollars? Oh, it’s damaging to the environment too? And people are telling outright lies about it and constantly saying it’ll replace people’s jobs? And the CEOs are all greedy oafish sociopaths?  Did you try being cloying, judgmental, condescending, and aggressive to those who don’t like AI? Oh, that didn’t work? I can’t imagine why.  Sounds embarrassing! You must really like that website.  ChatGPT is a website. Claude is a website. While I guess Claude Code runs in a terminal window, that just means it’s an app, which I put in exactly the same mental box as I do a website.  Yet everything you read or hear or see about AI does everything it can to make you think that AI is something other than a website or an app. People that “discover the power of AI” immediately stop discussing it in the same terms as Microsoft Word, Google, or any other app or website. It’s never just about what AI can do today, but always about some theoretical “AGI” or vague shit about “AI agents” that are some sort of indeterminate level of “valuable” without anyone being able to describe why. Truly useful technology isn’t described in oblique or hyperbolic terms. For example, last week, IBM’s Dave McCann described using a series of “AI agents” to Business Insider Sounds like a website to me.  Sounds like a website using an LLM to summarize stuff to me. Why are we making all this effort to talk about what a website does?  My friend, this isn’t a “series of agents.” It’s an LLM that looks at stuff and spits out an answer. Chatbots have done this kind of thing forever. These aren’t “agents.” “Agents” makes it sound like there’s some sort of futuristic autonomous presence rather than a chatbot that’s looking at documents using technology that’s guaranteed to hallucinate incorrect information . Here’s a fun exercise: replace the word “agent” with “app,” and replace “AI” with “application.” In fact, let’s try that with the next quote: A variety of functions including searching for stuff, looking at stuff, generating stuff, transcribing a meeting, and searching for stuff. Wow! Who gives a fuck. Every “AI agent” story is either about code generation, summarizing some sort of information source, or generating something based on an information source that you may or may not be able to trust.  “Agent” is an intentional act of deception, and even “modern” agents like OpenClaw and its respective ripoffs ultimately boil down to “I can send you a reminder” or “I can transcribe a text you send me.” Yet everybody seems to want to believe these things are “valuable” or “useful” without ever explaining why. A page of OpenClaw integrations claiming to share “real projects, real automations [and] real magic” includes such incredible, magical use cases as “reads my X bookmarks and discusses them with me,” “check incoming mail and remove spam,” “researches people before meetings and creates briefing docs,” “schedule reminders,” “tracking who visits a website” (summarizing information), and “using voice notes to tell OpenClaw what to do,” which includes “distilling market research” (searching for stuff) and “tightening a proposal” (generating stuff after looking at it). I’d have no quarrel with any of this if it wasn’t literally described as magical and innovative. This is exactly the shit that software has always done — automations, shortcuts, reminders, and document work. Boring, potentially useful stuff done in an inefficient way requiring a Mac Mini and hundreds of dollars a day of API calls.  Even Stephen Fry’s effusive review of the iPad from 2010 , in referring to it as a “magical object,” still referred to it as “class,” “a different order of experience,” remarking on its speed, responsiveness, its “smooth glide,” and remarking that it’s so simple . Even Fry, a writer beloved for his effervescence and sophisticated lexicon, was still able to point at the things he liked (such as the design and simplicity) in clear terms. Even in couching it in terms of the future, Fry is still able to cogently explain why he’s excited about the present. Conversely, articles about Large Language Models and their associated products often describe them in one of three ways: This simply doesn’t happen outside of bubbles. The original CNET review of the iPhone — a technology I’d argue literally changed the way that human beings live their lives — still described it in terms that mirrored the reality we live in: I’d argue that technologies like cloud storage, contactless payments, streaming music, and video and digital photography have transformed our societies in ways that were obvious from the very beginning. Nobody sat around cajoling us to accept that we’d need to sunset our Nokia 3210s and get used to touchscreens because it was blatantly obvious that it was better on using the first iPhone.  Nobody ostracized you for not being sufficiently excited about iPhone apps. Git, launched in 2005, is arguably one of the single-most transformational technologies in tech history, changing how software engineers built all kinds of software . And I’d argue that Github, which came a few years later, was equally transformational.  I can’t find a single example of somebody being shamed for not being sufficiently excited, other than people arguing over whether Git was the superior version control software , or saying that  Github, a cloud-based repository for code and collaboration, was obvious in its utility. Those that liked it didn’t feel particularly defensive. Even articles about GitHub’s growth spoke entirely in terms rooted in the present. I realize this was before the hyper-polarized world of post-Musk Twitter, one where venture capital and the tech industry in general was a fraction of the size, but it’s really weird how different it feels when you read about how the stuff that actually mattered was covered. I must repeat that this was a very different world with very different incentives. Today’s tech industry is a series of giant group chats across various social networks and physical locations, with a much-larger startup community (yCombinator’s last batch had 199 people — the first had 8) influenced heavily by the whims of investors and the various cults of personality in the valley. While social pressure absolutely existed, the speed at which it could manifest and mutate was minute in comparison to the rabid dogs of Twitter or the current state of Hackernews. There were fewer VCs, too. In any case, no previous real or imagined tech revolution has ever inspired such eager defensiveness, tribalism or outright aggression toward dissenters, nor such ridiculous attempts to obfuscate the truth about a product outside of cryptocurrency, an industry with obvious corruption and financial incentives.  We’ve never had a cult of personality around a specific technology at this scale. There is something that AI does to people — in the way it both functions and the way that people react to it —  that inspires them to act, defensively, weirdly, tribally. I think it starts with LLMs themselves, and the feeling they create within a user. We all love prompts. We love to be asked questions about ourselves. We feel important when somebody takes interest in what we’re doing, and even more-so when they remember things about it and seem to be paying attention. LLMs are built to completely focus themselves on us and do so while affirming every single interaction.  Human beings also naturally crave order and structure, which means we’ve created frameworks in our head about what authoritative-sounding or looking information looks like, and the language that engenders trust in it. We trust Wikipedia both because it’s an incredibly well-maintained library of information riddled with citations and because it tonally and structurally resembles an authoritative source. Large Language Models have been explicitly trained to deliver information (through training on much of the internet including Wikipedia) in a structured manner that makes us trust it like we would another source massaged with language we’d expect from a trusted friend or endlessly-patient teacher. All of this is done with the intention of making you forget that you’re using a website. And that deception is what starts to make people act strangely. The fact that an LLM can maybe do something is enough to make people try it, along with the constant pressure from social media, peers and the mainstream media.  Some people — such as myself — have used LLMs to do things, seen that making them do said things isn’t going to happen very easily, and walked away because I am not going to use a website that doesn’t do what it says.  As I’ve previously said, technology is a tool to do stuff. Some technology requires you to “get used to it” — iPhones and iPads were both novel (and weird) in their time, as was learning to use the Moonlander ZSK — but in basically every example doesn’t involve you tolerating the inherent failings of the underlying product under the auspices of it “one day being better.” Nowhere else in the world of technology does someone gaslight you into believing that the problems don’t exist or will magically disappear. It’s not like the iPhone only occasionally allowed you to successfully take a photo, and reliable photography was something that you’d have to wait until the iPhone 3GS to enjoy. While the picture quality improved over time, every generation of iPhone all did the same basic thing successfully, reliably, and consistently.  I also think that the challenge of making an LLM do something useful is addictive and transformative. When people say they’ve “learned to use AI,” often they mean that they’ve worked out ways to fudge their prompts, navigate its failures, mitigate its hallucinations, and connect it to various different APIs and systems of record in such a way that it now, on a prompt, does something , and because they’re the ones that built this messy little process, they feel superior — because the model has repeatedly told them that they were smart for doing it and celebrated with them when they “succeeded.”  The term “AI agent” exists as both a marketing term and a way to ingratiate the user. Saying “yeah I used a chatbot to do some stuff” sounds boring, like you’re talking to an app or a website, but “using an AI agent” makes you sound like a futuristic cyber-warrior , even though you’re doing exactly the same thing. LLMs are excellent digital busyboxes for those who want to come up with a way to work differently rather than actually doing work. In WIRED’s article about journalists using AI , Alex Heath boasts that he “feels like he’s cheating in a way that feels amazing”: The linguistics of “transmitting an idea to an AI agent” misrepresent what is a deeply boring and soulless experience. Alex speaks into a microphone, his words are transcribed, then an LLM burps out a draft. A bunch of different services connect to Claude Cowork and a text document (that’s what the “custom set of instructions” is) that says how to write like him, and then it writes like him, and then he talks to it and then sometimes writes bits of the story himself. This is also most decidedly not automation. Heath still must sit and prompt a model again and again. He must still maintain connections to various services and make sure the associated documents in Notion are correct. He must make sure that Granola actually gets the transcriptions from his interview. He must (I would hope) still check both the AI transcription and the output from the model to make sure quotes are accurate. He must make sure his calendar reflects accurate information. He must make sure that Claude still follows his “voice and writing style” — if you can call it that given the amount of distance between him and the product. Well, Alex, you’re not telling anybody anything, your ideas and words come out of a Large Language Model that has convinced you that you’re writing them.  In any case, Heath’s process is a great example of what makes people think they’re “using powerful AI.” Large Language Models are extremely adept at convincing human beings to do most of the work and then credit “AI” with the outcomes. Alex’s process sounds convoluted and, if I’m honest, a lot more work than the old way of doing things. It’s like writing a blog using a machine from Pee-wee’s Playhouse.  I couldn’t eat breakfast that way every morning. I bet it would get old pretty quick. This is the reality of the Large Language Model era. LLMs are not “artificial intelligence” at all. They do not think, they do not have knowledge, they are conjuring up their own training data (or reflecting post-training instructions from those developing them or documents instructing them to act a certain way), and any time you try and make them do something more-complicated, they begin to fall apart, and/or become exponentially more-expensive. You’ll notice that most AI boosters have some sort of bizarre, overly-complicated way of explaining how they use AI. They spin up “multiple agents” (chatbots) that each have their own “skills document” (a text document) and connect “harnesses” (python scripts, text files that tell it what to do, a search engine, an API) that “let it run agentic workflows” (query various tools to get an outcome.”  The so-called “agentic AI” that is supposedly powerful and autonomous is actually incredibly demanding of its human users — you must set it up in so many different ways and connect it to so many different services and check that every “agent” (different chatbot) is instructed in exactly the right way, and that none of these agents cause any problems (they will) with each other. Oh, don’t forget to set certain ones to “high-thinking” for certain tasks and make sure that other tasks that are “easier” are given to cheaper models, and make sure that those models are prompted as necessary so they don’t burn tokens. But the process of setting up all those agents is so satisfying, and when they actually succeed in doing something — even if it took fucking forever and costs a bunch and is incredibly inefficient — you feel like a god! And because you can “spin up multiple agents,” each one ready and waiting for you to give them commands (and ready to affirm each and every one of them), you feel powerful, like you’re commanding an army that also requires you to monitor whatever it does. The reason that LLMs have become so interesting for software engineers is that this is already how they lived. Writing software is often a case of taping together different systems and creating little scripts and automations that make them all work, and the satisfaction of building functional software is incredible, even at the early stages.  Large Language Models perform an impression of automating that process, but for the most part force you, the user, to do the shit that matters, even if that means “be responsible for the code that it puts out.” Heath’s process does not appear to take less time than his previous one — he’s just moved stuff around a bit and found a website to tell him he’s smart for doing so.  They are Language Models interpreting language without any knowledge or thoughts or feelings or ability to learn, and each time they read something they interpret meaning based on their training data, which means they can (and will!) make mistakes, and when they’re, say, talking to another chatbot to tell it what to do next, that little mistake might build a fundamental flaw in the software, or just break the process entirely.  And Large Language Models — using the media — exist to try and convince you that these mistakes are acceptable. When Anthropic launched its Claude For Finance tool , which claims to “automate financial modeling” with “pre-built agents” (chatbots) but really appears to just be able to create questionably-useful models via Excel spreadsheets and “financial research” based on connecting to documents in your various systems, I imagine with a specific system prompt. Anthropic also proudly announced that it had scored a 55.3% on the Finance Agent Test .  I hate to repeat myself, but I will not respect a website, and I will not tolerate something being “55% good” at something if its alleged use case is that it’s an artificial intelligence.  Yet that’s the other remarkable thing about the LLM era — that there are people who are extremely tolerant of potential failures because they believe they’re either A) smart enough to catch them or B) smart enough to build systems that do so for them, with a little sprinkle of “humans make mistakes too,” conflating “an LLM that doesn’t know anything fucking up by definition” with “a human being with experiences and the capacity for adaptation making a mistake.”  I truly have no beef with people using LLMs to speed up Python scripts to do fun little automations or to dig through big datasets, but please don’t try and convince me they’re being futuristic by doing so. If you want to learn Python, I recommend reading Al Sweigart’s Automate The Boring Stuff . Anytime somebody sneers at you and says you are being “left behind” because you’re not using AI should be forced to show you what it is they’ve created or done, and the specific system they used to do so. They should have to show you how much work it took to prepare the system, and why it’s superior to just doing it themselves.  Karpathy also had a recent (and very long) tweet about “ the growing gap in understanding of AI capability ,” involving more word salad than a fucking SweetGreen: Wondering what those “staggering improvements” are?  The one tangible (and theoretical!) example Karpathy gives is an example of how hard people work to overstate the capabilities of LLMs. “Coherently restructuring” a codebase might happen when you feed it to an LLM (while also costing a shit-ton of tokens, but putting that aside), or it might not understand at all because Claude Opus is acting funny that day , or it might sort-of fix it but mess something subtle up that breaks things in the future. This is an LLM doing exactly what an LLM does — it looks at a block of text, sees whether it matches up with what a user said, sees how that matches with its training data, and then either tells you things to do or generates new code, much like it would do if you had a paragraph of text you needed to fact-check. Perhaps it would get some of the facts right if connected to the right system. Perhaps it might make a subtle error. Perhaps it might get everything wrong. This is the core problem with the “checkmate, boosters — AI can write code!” problem. AI can write code. We knew that already. It gets “better” as measured by benchmarks that don’t really compare to real world success , and even with the supposedly meteoric improvements over the last few months, nobody can actually explain what the result of it being better is, nor does it appear to extend to any domain outside of coding. You’ll also notice that Karpathy’s language is as ingratiating to true believers as it is vague. Other domains are left unexplained other than references to “research” and “math.” I’m in a research-heavy business, and I have tried the most-powerful LLMs and highest-priced RAG/post-RAG research tools, and every time find them bereft of any unique analysis or suggestions.  I don’t dispute that LLMs are useful for generating code, nor do I question whether or not they’re being used by software developers at scale. I just think that they would be used dramatically less if there weren’t an industrial-scale publicity campaign run through the media and the majority of corporate America both incentivizing and forcing them to do so.  Similarly, I’m not sure anybody would’ve been anywhere near as excited if OpenAI and Anthropic hadn’t intentionally sold them a product that was impossible to support long-term.  This entire industry has been sold on a lie, and as capacity becomes an issue, even true believers are turning on the AI labs. About a year ago, I warned you that Anthropic and OpenAI had begun the Subprime AI Crisis , where both companies created “priority processing tiers” for enterprise customers (read: AI startups like Replit and Cursor), dramatically increasing the cost of running their services to the point that both had to dramatically change their features as a result. A few weeks later, I wrote another piece about how Anthropic was allowing its subscribers to burn thousands of dollars’ worth of tokens on its $100 and $200-a-month subscriptions, and asked the following question at the end: I was right to ask, as a few weeks ago ( as I wrote in the Subprime AI Crisis Is Here ) that Anthropic had added “peak hours” to its rate limits, and users found across the board that they were burning through their limits in some cases in only a few prompts . Anthropic’s response was, after saying it was looking into why rate limits were being hit so fast , to say that users were ineffectively utilizing the 1-million-token context window and failing to adjust Claude’s “thinking effort level” based on whatever task it is they were doing. Anthropic’s customers were (and remain) furious , as you can see in the replies of its thread on the r/Anthropic Subreddit . To make matters worse, it appears that — deliberately or otherwise — Anthropic has been degrading the performance of both Claude Opus 4.6 and Claude Code itself , with developers, including AMD Senior AI Director Stella Laurenzo, documenting the problem at length (per VentureBeat): Think that Anthropic cares? Think again:  Another developer found that Claude Opus 4.6 was “thinking 67% less than it used to,” though Anthropic didn’t even bother to respond. In fact, Anthropic has done very little to explain what’s actually happening, other than to say that it doesn’t degrade its models to better serve demand . To be clear, this is far from the only time that I’ve seen people complain about these models “getting dumber” — users on basically every AI Subreddit will say, at some point, that models randomly can’t do things they used to be able to, with nobody really having an answer other than “yeah dude, same.”  Back in September 2025, developer Theo Browne complained that Claude had got dumber , but Anthropic near-immediately responded to say that the degraded responses were a result of bugs that “intermittently degraded responses from Claude,” adding the following:  Which begs the question: is Anthropic accidentally making its models worse? Because it’s obvious it’s happening, it’s obvious they know something is happening, and its response, at least so far, has been to say that either users need to tweak their settings or nothing is wrong at all. Yet these complaints have happened for years, and have reached a crescendo with the latest ones that involve, in some cases, Claude Code burning way more tokens for absolutely no reason , hitting rate limits earlier than expected or wasting actual dollars spent on API calls. Some suggest that the problems are a result of capacity issues over at Anthropic, which have led to a stunning (at least for software used by millions of people) amounts of downtime, per the Wall Street Journal : This naturally led to boosters (and, for that matter, the Wall Street Journal) immediately saying that this was a sign of the “insatiable demand for AI compute”: Before I go any further: if anyone has been taking $2.75-per-hour-per-GPU for any kind of Blackwell GPU, they are losing money. Shit, I think they are at $4.08. While these are examples from on-demand pricing (versus paid-up years-long contracts like Anthropic buys), if they’re indicative of wider pricing on Blackwell, this is an economic catastrophe. In any case, Anthropic’s compute constraints are a convenient excuse to start fucking over its customers at scale. Rate limits that were initially believed to be a “ bug ” are now the standard operating limits of using Anthropic’s services, and its models are absolutely, fundamentally worse than they were even a month ago. It’s January 14 2026, and you just read The Atlantic’s breathless hype-slop about Claude Code , believing that it was “bigger than the ChatGPT moment,” that it was an “inflection point for AI progress,” and that it could build whatever software you imagined. While you’re not exactly sure what it is you’re meant to be excited about, your boss has been going on and on about how “those who don’t use AI will be left behind,” and your boss allows you to pay $200 for a year’s access to Claude Pro. You, as a customer, no longer have access to the product you purchased. Your rate limits are entirely different, service uptime is measurably worse, and model performance has, for some reason, taken a massive dip. You hit your rate limits in minutes rather than hours. Prompts that previously allowed you a healthy back-and-forth over a project are now either impractical or impossible.  Your boss now has you vibe-coding barely-functional apps as a means of “integrating you with the development stack,” but every time you feed it a screenshot of what’s going wrong with the app you seem to hit your rate limits again. You ask your boss if he’ll upgrade you to the $100-a-month subscription, and he says that “you’ve got to make do, times are tough.” You sit at your desk trying to work out what the fuck to do for the next four hours, as you do not know how to code and what little you’ve been able to do is now impossible. This is the reality for a lot of AI subscribers, though in many cases they’ll simply subscribe to OpenAI Codex or another service that hasn’t brought the hammer down on their rate limits. …for now, at least. The con of the Large Language Model era is that any subscription you pay for is massively subsidized, and that any product you use can and will see its service degraded as these companies desperately try to either ease their capacity issues or lower their burn rate. Yet it’s unclear whether “more capacity” means that things will be cheaper, or better, or just a way of Anthropic scaling an increasingly-shittier experience.  To explain, when an AI lab like Anthropic or OpenAI “hits capacity limits,” it doesn’t mean that they start turning away business or stop accepting subscribers, but that current (and new) subscribers will face randomized downtime and model issues, along with increasingly-punishing rate limits.  Neither company is facing a financial shortfall as a result of being unable to provide their services (rather, they’re facing financial shortfalls because they’re providing their services to customers. And yet, the only people that are the only people paying that price because of these “capacity limits” are the customers. This is because AI labs must, when planning capacity, make arbitrary guesses about how large the company will get, and in the event that they acquire too much capacity, they’ll find themselves in financial dire straits, as Anthropic CEO Dario Amodei told Dwarkesh Patel back in February :  What happens if you don’t buy enough compute? Well, you find yourself having to buy it last-minute, which costs more money, which further erodes your margins, per The Information : In other words, compute capacity is a knife-catching game. Ordering compute in advance lets you lock in a better rate, but having to buy compute at the last-minute spikes those prices, eating any potential margin that might have been saved as a result of serving that extra demand.  Order too little compute and you’ll find yourself unable to run stable and reliable services, spiking your costs as you rush to find more capacity. Order too much capacity and you’ll have too little revenue to pay for it. It’s important to note that the “demand” in question here isn’t revenue waiting in the wings, but customers that are already paying you that want to do more with the product they paid for. More capacity allows you to potentially onboard new customers, but they too face the same problems as your capacity fills.  This also begs the question: how much capacity is “enough”? It’s clear that current capacity issues are a result of the inference (the creation of outputs) demands of Anthropic’s users. What does adding more capacity do, other than potentially bringing that under control?  This also suggests that Anthropic’s (and OpenAI’s by extension) business model is fundamentally flawed. At its current infrastructure scale, Anthropic cannot satisfactorily serve its current paying customer base , and even with this questionably-stable farce of a product, Anthropic still expects to burn $14 billion . While adding more capacity might potentially allow new customers to subscribe, said new customers would also add more strain on capacity, which would likely mean that nobody’s service improves but Anthropic still makes money. It ultimately comes down to the definition of the word “demand.” Let me explain. Data center development is very slow. Only 5GW of capacity is under construction worldwide (and “construction” can mean anything from a single steel beam to a near-complete building). As a result, both Anthropic and OpenAI are planning and paying for capacity years in advance based on “demand.” “Demand” in this case doesn’t just mean “people who want to pay for services,” but “the amount of compute that the people who pay us now and may pay us in the future will need for whatever it is they do.”  The amount of compute that a user may use varies wildly based on the model they choose and the task in question — a source at Microsoft once told me in the middle of last year that a single user could take up as many as 12 GPUs with a coding task using OpenAI’s o4-mini — which means that in a very real sense these guys are guessing and hoping for the best. It also means that their natural choice will be to fuck over their current users to ease their capacity issues, especially when those users are paying on a monthly or — ideally — annual basis. OpenAI and Anthropic need to show continued revenue growth, which means that they must have capacity available for new customers, which means that old customers will always be the first to be punished. We’re already seeing this with OpenAI’s new $100-a-month subscription, a kind of middle ground between its $20 and $200-a-month ChatGPT subscriptions that appears to have immediately reduced rate limits for $20-a-month subscribers.  To obfuscate the changes further, OpenAI also launched a bonus rate limit period through May 31 2026 , telling users that they will have “10x or 20x higher rate limits than plus” on its pricing page while also featuring a tiny little note that’s very easy for somebody to miss: This is a fundamentally insane and deceptive way to run a business, and I believe things will only get worse as capacity issues continue. Not only must Anthropic and OpenAI find a way to make their unsustainable and unprofitable services burn less money, but they must also constantly dance with metering out whatever capacity they have to their customers, because the more extra capacity they buy, the more money they lose.   However you feel about what LLMs can do, it’s impossible to ignore the incredible abuse and deception happening to just about every customer of an AI service. As I’ve said for years, AI companies are inherently unsustainable due to the unreliable and inconsistent outputs of Large Language Models and the incredible costs of providing the services. It’s also clear, at this point, that Anthropic and OpenAI have both offered subscriptions that were impossible to provide at scale at the price and availability that they were leading up to 2026, and that they did so with the intention of growing their revenue to acquire more customers, equity investment and attention.  As a result, customers of AI services have built workflows and habits based on an act of deceit. While some will say “this is just what tech companies do, they get you in when it’s cheap then jack up the price,” doing so is an act of cowardice and allegiance with the rich and powerful.  To be clear, Anthropic and OpenAI need to do this. They’ve always needed to do this. In fact, the ethical thing to do would’ve been to charge for and restrict the services in line with their actual costs so that users could have reliable and consistent access to the services in question. As of now, anyone that purchases any kind of AI subscription is subject to the whims of both the AI labs and their ability to successfully manage their capacity, which may or may not involve making the product that a user pays for worse. The “demand” for AI as it stands is an act of fiction, as much of that demand was conjured up using products that were either cheaper or more-available. Every one of those effusive, breathless hype-screeds about Claude Code from January or February 2026 are discussing a product that no longer exists. On June 1 2026, any article or post about Codex’s efficacy must be rewritten, as rate limits will be halved .  While for legal reasons I’ll stop short of the most obvious word, Anthropic and OpenAI are running — intentionally or otherwise — deeply deceitful businesses where their customers cannot realistically judge the quality or availability of the service long-term. These companies also are clearly aware that their services are deeply unpopular and capacity-constrained, yet aggressively court and market toward new customers, guaranteeing further service degradations and potential issues with models. This applies even to API customers, who face exactly the same downtime and model quality issues, all with the indignity of paying on a per-million token basis, even when Claude Opus 4.6 decides to crap itself while refactoring something, running token-intensive “agents” to fix simple bugs or fails to abide by a user’s guidelines .  This is not a dignified way to use software, nor is it an ethical way to sell it.  How can you plan around this technology? Every month some new bullshit pops up. While incremental model gains may seem like a boon, how do you actually say “ok, let’s plan ahead” for a technology that CHANGES, for better or for worse, at random intervals? You’re constantly reevaluating model choices and harnesses and prompts and all kinds of other bullshit that also breaks in random ways because “that’s how large language models work.” Is that fun? Is that exciting? Do you like this? It seems exhausting to me, and nobody seems to be able to explain what’s good about it. How, exactly, does this change?  Right now, I’d guess that OpenAI has access to around 2GW of capacity ( as of the end of 2025 ), and Anthropic around 1GW based on discussions with sources. OpenAI is already building out around 10GW of capacity with Oracle, as well as locking in deals with CoreWeave ( $22.4 billion ), Amazon Web Services ( $138 billion ), Microsoft Azure ( $250 billion ), and Cerebras (“ 750MW ”). Meanwhile, Anthropic is now bringing on “multiple gigawatts of Google’s next-generation TPU capacity ” on top of deals with Microsoft , Hut8 , CoreWeave and Amazon Web Services. Both of these companies are making extremely large bets that their growth will continue at an astonishing, near-impossible rate. If OpenAI has reached “ $2 billion a month ” (which I doubt it can pay for) with around 2GW of capacity, this means that it has pre-ordered compute assuming it will make $10 billion or $20 billion a month in a few short years, which fits with The Information’s reporting that OpenAI projects it will make $113 billion in revenue in 2028. And if it doesn’t make that much revenue — and also doesn’t get funding or debt to support it — OpenAI will run out of money, much as Anthropic will if that capacity gets built and it doesn’t make tens of billions of dollars a month to pay for it. I see no scenario where costs come down, or where rate limits are eased. In fact, I think that as capacity limits get hit, both Anthropic and OpenAI will degrade the experience for the user (either through model degradation or rate limit decay) as much as they can.  I imagine that at some point enterprise customers will be able to pay for an even higher priority tier, and that Anthropic’s “Teams” subscription (which allows you to use the same subsidized subscriptions as everyone else) will be killed off, forcing anyone in an organization paying for Claude Code (and eventually Codex) via the API, as has already happened for Anthropic’s enterprise users. Anyone integrating generative AI is part of a very large and randomized beta test. The product you pay for today will be materially different in its quality and availability in mere months. I told you this would happen in September 2024 . I have been trying to warn you this would happen, and I will repeat myself: these companies are losing so much more money than you can think of, and they are going to twist the knife in and take as many liberties with their users and the media as they can on the way down.  It is fundamentally insane that we are treating these companies as real businesses, either in their economics or in the consistency of the product they offer.  These are unethical products sold in deceptive ways, both in their functionality and availability, and to defend them is to help assist in a society-wide con with very few winners. And even if you like this, mark my words — your current way of life is unsustainable, and these companies have already made it clear they will make the service worse, without warning, if they even acknowledge that they’ve done so directly. The thing you pay for is not sustainable at its current price and they have no way to fix that problem.  Do you not see you are being had? Do you not see that you are being used?  Do any of you think this is good? Does any of this actually feel like progress?  I think it’s miserable, joyless and corrosive to the human soul, at least in the way that so many people talk about AI. It isn’t even intelligent. It’s just more software that is built to make you defend it, to support it, to do the work it can’t so you can present the work as your own but also give it all the credit.  And to be clear, these companies absolutely fucking loathe you. They’ll make your service worse at a moment’s notice and then tell you nothing is wrong.  Anyone using a subscription to OpenAI or Anthropic’s services needs to wake up and realize that their way of life is going away — that rate limits will make current workflows impossible, that prices will increase, and that the product they’re selling even today is not one that makes any economic sense. Every single LLM product is being sold under false pretenses about what’s actually sustainable and possible long term. With AI, you’re not just the product, you’re a beta tester that pays for the privilege. And you’re a mark for untrustworthy con men selling software using deceptive and dangerous rhetoric.  I will be abundantly clear for legal reasons that it is illegal to throw a Molotov cocktail at anyone, as it is morally objectionable to do so. I explicitly and fundamentally object to the recent acts of violence against Sam Altman. It is also morally repugnant for Sam Altman to somehow suggest that the careful, thoughtful, determined, and eagerly fair work of Ronan Farrow and Andrew Marantz is in any way responsible for these acts of violence. Doing so is a deliberate attempt to chill the air around criticism of AI and its associated companies. Altman has since walked back the comments , claiming he “wishes he hadn’t used” a non-specific amount of the following words: These words remain on his blog, which suggests that Altman doesn’t regret them enough to remove them. I do, however, agree with Mr. Altman that the rhetoric around AI does need to change.  Both he and Mr. Amodei need to immediately stop overstating the capabilities of Large Language Models. Mr. Altman and Mr. Amodei should not discuss being “ scared ” of their models, or being “uncomfortable” that men such as they are in control unless they wish to shut down their services, or that they “ don’t know if models are conscious .”  They should immediately stop misleading people through company documentation that models are “ blackmailing ” people or, as Anthropic did in its Mythos system card , suggest a model has “broken containment and sent a message” when it A) was instructed to do so and B) did not actually break out of any container. They must stop discussing threats to jobs without actual meaningful data that is significantly more sound than “jobs that might be affected someday but for now we’ve got a chatbot.” Mr. Amodei should immediately cease any and all discussions of AI potentially or otherwise eliminating 50% of white collar jobs , as Mr. Altman should cease predicting when Superintelligence might arrive, as Mr. Amodei should actively reject and denounce any suggestions of AI “ creating a white collar bloodbath .” Those that defend AI labs will claim that these are “difficult conversations that need to be had,” when in actuality they engage in dangerous and frightening rhetoric as a means of boosting a company’s valuation and garnering attention. If either of these men truly believed these things were true, they would do something about it other than saying “you should be scared of us and the things we’re making, and I’m the only one brave enough to say anything.”  These conversations are also nonsensical and misleading when you compare them to what Large Language Models can do, and this rhetoric is a blatant attempt to scare people into paying for software today based on what it absolutely cannot and will not do in the future . It is an attempt to obfuscate the actual efficacy of a technology as a means of deceiving investors, the media and the general public.  Both Altman and Amodei engage in the language of AI doomerism as a means of generating attention, revenue and investment capital, actively selling their software and future investment potential based on their ownership of a technology that they say (disingenuously) is potentially going to take everybody’s jobs.  Based on reports from his Instagram , the man who threw the molotov cocktail at Sam Altman’s house was at least partially inspired by If Anyone Builds It, Everyone Dies, a doomer porn fantasy written by a pair of overly-verbose dunces spreading fearful language about the power of AI, inspired by the fearmongering of Altman himself. Altman suggested in 2023 that one of the authors might deserve the Nobel Peace Prize . I only see one side engaged in dangerous rhetoric, and it’s the ones that have the most to gain from spreading it. I need to be clear that this act of violence is not something I endorse in any way. I am also glad that nobody was hurt.  I also think we need to be clear about the circumstances — and the rhetoric — that led somebody to do this, and why the AI industry needs to be well aware that the society they’re continually threatening with job loss is one full of people that are very, very close to the edge. This is not about anybody being “deserving” of anything, but a frank evaluation of cause and effect.  People feel like they’re being fucking tortured every time they load social media. Their money doesn’t go as far. Their financial situation has never been worse . Every time they read something it’s a story about ICE patrols or a near-nuclear war in Iran, or that gas is more expensive, or that there’s worrying things happening in private credit. Nobody can afford a house and layoffs are constant. One group, however, appears to exist in an alternative world where anything they want is possible. They can raise as much money as they want . They can build as big a building as they want anywhere in the world. Everything they do is taken so seriously that the government will call a meeting about it . Every single media outlet talks about everything they do. Your boss forces you to use it. Every piece of software forces you to at least acknowledge that they use it too. Everyone is talking about it with complete certainty despite it not being completely clear why. As many people writhe in continual agony and fear, AI promises — but never quite delivers — some sort of vague utopia at the highest cost known to man. And these companies are, in no uncertain terms, coming for your job.  That’s what they want to do. They all say it. They use deceptively-worded studies that talk about “AI-exposed” careers to scare and mislead people into believing LLMs are coming for their jobs, all while spreading vague proclamations about how said job loss is imminent but also always 12 months away . Altman even says that jobs that will vanish weren’t real work to begin with , much as former OpenAI CTO Mira Murati said that some creative jobs shouldn’t have existed in the first place . These people who sell a product with no benefit comparable on any level to its ruinous, trillion-dollar cost are able to get anything they want at a time when those who work hard are given a kick in the fucking teeth, sneered at for not “using AI” that doesn’t actually seem to make their lives easier, and then told that their labor doesn’t constitute “real work.” At a time when nobody living a normal life feels like they have enough, the AI industry always seems to get more. There’s not enough money for free college or housing or healthcare or daycare but there’s always more money for AI compute.  Regular people face the harshest credit market in generations but private credit and specifically data centers can always get more money and more land .  AI can never fail — it can only be failed. If it doesn’t work, you simply don’t know how to “use AI” properly and will be “ at a huge disadvantage " despite the sales pitch being “this is intelligent software that just does stuff.”  AI companies can get as much attention as they need, their failings explained away, their meager successes celebrated like the ball dropping on New Years Eve, their half-assed sub-War Of The Worlds “Mythos” horseshit treated like they’ve opened the gates of Hell .  Regular people feel ignored and like they’re not taken seriously, and the people being given the most money and attention are the ones loudly saying “we’re richer than anyone has ever been, we intend to spend more than anyone has ever spent, and we intend to take your job.”  Why are they surprised that somebody mentally unstable took them seriously? Did they not think that people would be angry? Constantly talking about how your company will make an indeterminate amount of people jobless while also being able to raise over $162 billion in the space of two years and taking up as much space on Earth as you please is something that could send somebody over the edge.  Every day the news reminds you that everything sucks and is more expensive unless you’re in AI, where you’ll be given as much money and told you’re the most special person alive. I can imagine it tearing at a person’s soul as the world beats them down. What they did was a disgraceful act of violence.  Unstable people in various stages of torment act in erratic and dangerous ways. The suspect in the molotov cocktail incident apparently had a manifesto where he had listed the names and addresses of both Altman and multiple other AI executives, and, per CNBC, discussed the threat of AI to humanity as a justification for his actions. I am genuinely happy to hear that this person was apprehended without anyone being hurt.  These actions are morally wrong, and are also the direct result of the AI industry’s deceptive and manipulative scare campaign, one promoted by men like Altman and Amodei, as well as doomer fanfiction writers like Yudowsky, and, of course, Daniel Kokotajlo of AI 2027 — both of whom have had their work validated and propagated via the New York Times.  On the subject of “dangerous rhetoric,” I think we need to reckon with the fact that the mainstream media has helped spread harmful propaganda, and that a lack of scrutiny of said propaganda is causing genuine harm.  I also do not hear any attempts by Mr. Altman to deal with the actual, documented threat of AI psychosis, and the people that have been twisted by Large Language Models to take their lives and those of others . These are acts of violence that could have been stopped had ChatGPT and similar applications not been anthropomorphized by design, and trained to be “friendly.”  These dangerous acts of violence were not inspired by Ronan Farrow publishing a piece about Sam Altman. They were caused by a years-long publicity campaign that has, since the beginning, been about how scary the technology is and how much money its owners make.  I separately believe that these executives and their cohort are intentionally scaring people as a means of growing their companies, and that these continual statements of “we’re making something to take your job and we need more money and space to do it” could be construed as a threat by somebody that’s already on edge.  I agree that the dangerous rhetoric around AI must stop. Dario Amodei and Sam Altman must immediately cease their manipulative and disingenuous scare-tactics, and begin describing Large Language Models in terms that match their actual abilities, all while dispensing with any further attempts to extrapolate their future capabilities. Enough with the fluff. Enough with the bullshit. Stop talking about AGI. Start talking about this like regular old software, because that’s all that ChatGPT is.  In the end, if Altman wants to engage with “good-faith criticism,” he should start acting in good faith. That starts with taking ownership of his role in a global disinformation campaign. It starts with recognizing how the AI industry has sold itself based on spreading mythology with the intent of creating unrest and fear.  And it starts with Altman and his ilk accepting any kind of responsibility for their actions. I’m not holding my breath. As if their ability to try to do some of a task allows them to do the entire task.   As if their ability to do tasks is somehow impressive or a justification for their cost. An excuse for why they cannot do more hinged on something happening in the future.

0 views
Neil Madden Yesterday

Mythos and its impact on security

I’m sure by now you’ve all read the news about Anthropic’s new “Mythos” model and its apparently “dangerous” capabilities in finding security vulnerabilities. I’m sure everyone reading this also has opinions about that. Well, here are a few of mine. Firstly, it’s tempting to dismiss the announcement as pure marketing hype. Anthropic are rumoured to be approaching IPO , so obviously a lot of hype is expected, and we’ve seen the “dangerous” card played before with GPT-2 . Throughout the history of computer security, both new tools and security researchers themselves have often been branded as dangerous or irresponsible. If you have sympathy for this viewpoint, then I hate to break it to you: Mythos is not inserting vulnerabilities into software, they were there all along. (Vibe-coding notwithstanding). That’s not to say that Mythos doesn’t represent a potentially interesting breakthrough. (Although apparently many existing small models are able to reproduce its findings, at least in part). And that’s not to say that releasing Mythos would not have some risk: potentially quite a large risk in some cases, and its ability to synthesise actual exploits is concerning. All security tools that find vulnerabilities come with a risk, but they also come with an upside: letting defenders find vulnerabilities too (ideally, first). Anthropic quote costs of around $10,000–20,000 for each vulnerability they found. You can quibble around those costs and I’m sure they’ll come down over time, but at the moment I think it’s fair to say that this won’t be run over every single software project out there. If it’s going to be used by bad actors, then it’ll probably still be somewhat targeted at high-impact systems. I’m sure we’ll see some new zero-day exploits of edge devices and probably an uptick in ransomware attacks, but it’s not like edge devices don’t regularly get exploited anyway . (Spoiler: many security products are shockingly poorly designed and implemented). But on the plus side, I can see Mythos and similar models being an excellent add-on to your annual pentest engagement. At those costs, you’re not going to run it on every build pipeline, and there’s probably going to be a certain amount of expertise required to get the most from it in a limited budget. As with all new tools, eventually the findings will plateau. There’s only so many times you can run the same tool over the same source code and come up with new findings. That’s not to say that there won’t still be vulnerabilities (there almost certainly will), but just that the tool will not be able to find them. As a former AI researcher myself (before modern ML exploded), I find this aspect of the Mythos write-up quite interesting. Most security tools suffer from problems with false positives, and LLMs are of course famous for that: they are “bullshit machines” . Putting it in slightly less pejorative terms, I would call them abduction machines: they generate plausible hypotheses to explain some set of observations. (Training an LLM is induction, but what they do at runtime is closer to abduction). In the case of a chatbot, the “observations” are the token context window, and the hypotheses are the plausible next token completions. In the case of vulnerability hunting, the observations are the source code and a prompt asking to look for a vulnerability and the hypotheses are the generated potential vulnerabilities. Despite knowing how this works, it is still kind of magic to me that the latter emerges from the former (plausible vulnerabilities from merely predicting the most likely next tokens given the context). Broadly speaking, the better the model the more likely those hypotheses are to be accurate. But they are still wrong an awful lot of the time, and false positives are the death of productivity. We’ve all seen reports of open source projects being overwhelmed by “slop” AI-generated vulnerability reports . But recently, that seems to have changed and a larger quantity of high-quality reports are being submitted to many high-profile projects. What changed? I think the clue is front and centre in Anthropic’s write-up: use of Address Sanitizer (ASan) as an oracle to weed-out false positives. I think this is a crucial dividing line that separates successful from unsuccessful uses of AI. This is why “agentic” (grr) AI is relatively successful at software development. The models aren’t inherently much better at writing code than any other task, but there already exists a large body of automated “bullshit correctors”: type checkers, linters, automated test suites, etc. (Many of which use techniques from earlier waves of “symbolic” AI research, just saying…) These oracles provide a clear signal about whether a hypothesis generated by the LLM is bullshit or not. (I would hypothesise that LLMs are likely to produce better code in languages with more sophisticated type systems). Hence why we see quite a lot of progress and marketing for AI systems in such use-cases, despite those markets being relatively small compared to the AI company’s massive valuations and funding. I’m guessing investors are not going to be pleased to have stumped up billions for a slice of a dev tools company? But software development does seem somewhat unique in this regard. Getting back to vulnerability hunting and oracles. This is the same situation that fuzzers face: a fuzzer is generally only really good at finding vulnerabilities when there is a good oracle to decide if you’ve found one or not (PDF link). Like Mythos, fuzzers are very good at finding crashes and (via oracles like ASan) memory safety issues, but they are not going to find subtle violations of user expectations. Mythos is clearly more than just a fuzzer though, it’s also looking at the source code and doing somewhat sophisticated “analysis” of potential weaknesses. But I think the problem of needing an oracle will remain. Without an oracle, I’m sure that Mythos would still find genuine vulnerabilities, but they will be overwhelmed by slop false positives, which will drown out the signal in the noise. For me then, I think this is the most interesting open question for LLM-based vulnerability finders: which classes of bugs can we write (or train) good oracles for? I think potentially quite a lot, but definitely not everything. I think humans will still have an edge in finding complex bugs for a long time to come. Obviously Anthropic’s take is that you should use AI tools to find all the bugs first. You could dismiss that as an obvious attempt to cash-in, and it even has shades of a protection racket . But I do think that Mythos and models like it are probably worth using, as an add-on to a human penetration test or similar. But really Mythos is just the latest tool revealing the continuing poor state of software security. Such tools continue to find a frightening array of vulnerabilities, because there are a terrifying quantity of them out there, and we keep adding more. The situation is not really going to be improved by throwing more AI at the problem. If anything, the rise of vibe-coding is likely to be increasing this trend. As I’ve covered here before, even apparent experts write total garbage security services when assisted by an LLM. If you want secure software, you need to slow down and think carefully and deeply, not rush ever faster to churn out more and more junk software. In the short term, we can just keep doing the things we know how to do: thinking about security earlier in the design process, incorporating basic security tools and testing into build pipelines, ensuring you can patch CVEs quickly (but not too quickly ), etc etc. Longer-term, we know how to solve some classes of vulnerabilities altogether. For example, we know that memory-safe programming languages eliminate whole swathes of potential issues, including many of the sort that Mythos is good at discovering. We’ve known this for decades but still write lots of software in unsafe languages. Numerous reports and government proclamations are slowly shifting that, but we still have a very long way to go. Capability-based security would solve many other classes of vulnerabilities, whether in CPUs , Operating Systems, or in supply chains . These are not easy fixes, and would require a massive investment over many years. Profit-driven companies are not going to pursue them without regulatory pressure, and are largely going in the opposite direction. Such fundamental changes would not solve everything: there will still be vulnerabilities, but fewer and of less severity if we do it right. Strong security mechanisms can provide a multiplicative reduction in risk . But if we really do want more secure software, and a foundation to our digital society that we can actually trust, then I don’t see an alternative. Finding and fixing individual vulnerabilities will never deliver that, however good the tools get.

0 views
Stratechery Yesterday

OpenAI’s Memos, Frontier, Amazon and Anthropic

Breaking down OpenAI's internal memo about taking on Anthropic in the enterprise.

0 views
Stratechery 2 days ago

Mythos, Muse, and the Opportunity Cost of Compute

Listen to this post : In January 2025, Doug O’Laughlin at Fabricated Knowledge declared that o1 and reasoning models marked the end of Aggregation Theory: I believe that there is no practical limit to the improvements of models other than economics, and I think that will be the real constraint in the future. It is reasonable that if we spent infinite dollars on a model, it would be improved. The problem is whether infinite dollars would make sense for a business. That is going to be the key question for 2025. How do the economics of AI make this work? One of the core assumptions about the internet has just been broken. Marginal costs now exist again, meaning that most hyperscalers will become increasingly capital-intensive. The era of Aggregation Theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking. And this will be the big problem that will be reckoned with this year. Hyperscaler’s business models are mainly underpinned by the marginal cost being zero. So, as long as you set up the infrastructure and fill an internet-scale product with users, you can make money. This era will soon be over, and the future will be much weirder and more compute-intensive. Looking back on the 2010s, we will probably consider them a naive time in the long arc of technology. One of our fundamental assumptions about this period is unraveling. This will be the single most significant change in the technology landscape going forward. Aggregation Theory was, if I may say so myself, the single best way to understand the 2010s, particularly consumer tech. It explained the dynamics undergirding Google and Facebook’s dominance, as well as the App Store and Amazon’s e-commerce business; it was also a useful ( albeit incomplete ) framework to understand an entire host of consumer services like Uber, Airbnb, and Netflix. It’s worth pointing out, however, that some of the critical insights undergirding Aggregation Theory are much older, and are embedded in the fundamental nature of tech itself. They are, as O’Laughlin notes, rooted in the concept of zero marginal costs. Marginal costs are how much it costs to make one more unit of a good. Consider a widget-making factory: Land and machines are clearly fixed costs; you have to have both to get started, and you are paying for both whether or not you make one more widget. Raw material, on the other hand, is clearly a marginal cost: if you make one more widget, you need one more widget’s worth of raw material. When it comes to physical goods, electricity and humans are also marginal costs: you need more or fewer of them depending on whether you make more or fewer widgets. Where marginal costs matter is that they provide a price floor. Companies will operate unprofitably because profit and loss is an accounting concept that incorporates depreciation, i.e. your fixed costs. For example, imagine that a company spent $1,000 on a factory to make widgets that have a marginal cost of $10: as long as the price of widgets is >$10 the company will make them even if they don’t earn enough money to cover their depreciation costs (i.e. they operate at a loss) because at least they are still making a marginal profit on each widget (what the company may not do is invest in any more fixed costs, and, eventually, will probably go bankrupt from interest on the debt that likely financed those fixed costs). I explain all of this precisely because it’s almost completely immaterial to tech. First, there generally are no raw material costs, because the outputs are digital. Second, because there are no raw material costs, and because the fixed costs are so large, electricity and humans are generally treated as fixed costs, not marginal costs: of course you will run your servers all of the time and at full capacity, because every scrap of additional revenue you can generate is worth it. AI very much fits in this paradigm: the output is digital, and while AI chips use a lot of electricity, the cost is a fraction of the cost of the chips themselves, which is to say that no one with AI chips is making marginal cost calculations in terms of utilizing them. They’re going to be used! Rather, the decision that matters is what they will be used for. Consider Microsoft: last quarter the company missed the Street’s Azure growth expectations not because there wasn’t demand, but because the company decided to use its capacity for its own products. CFO Amy Hood said on the company’s earnings call : I think it’s probably better to think about the Azure guidance that we give as an allocated capacity guide about what we can deliver in Azure revenue. Because as we spend the capital and put GPUs specifically, it applies to CPUs, the GPUs more specifically, we’re really making long-term decisions. And the first thing we’re doing is solving for the increased usage in sales and the accelerating pace of M365 Copilot as well as GitHub Copilot, our first-party apps. Then we make sure we’re investing in the long-term nature of R&D and product innovation. And much of the acceleration that I think you’ve seen from us and products over the past a bit is coming because we are allocating GPUs and capacity to many of the talented AI people we’ve been hiring over the past years. Then, when you end up, is that, you end up with the remainder going towards serving the Azure capacity that continues to grow in terms of demand. And a way to think about it, because I think, I get asked this question sometimes, is if I had taken the GPUs that just came online in Q1 and Q2 in terms of GPUs and allocated them all to Azure, the KPI would have been over 40. And I think the most important thing to realize is that this is about investing in all the layers of the stack that benefit customers. And I think that’s hopefully helpful in terms of thinking about capital growth, it shows in every piece, it shows in revenue growth across the business and shows as OpEx growth as we invest in our people. The cost that Microsoft is contending with here is not marginal cost, but rather opportunity cost: compute spent in one area cannot be used in another area; in the case of these earnings, Microsoft was admitting that they could have made their Azure number if they wanted to, but chose to prioritize their own workloads because, as CEO Satya Nadella noted later in the call, those have higher gross margin profiles and higher lifetime value. It’s opportunity costs, not marginal costs, that are the challenge facing hyperscalers. How much compute should go to customers, and which ones? How much should be reserved for internal workloads? Microsoft needs to balance Azure — both for its enterprise customers and OpenAI — and its software business; Amazon needs to balance its e-commerce business, AWS, and its strategic investments in both Anthropic and OpenAI. Google has to balance GCP, its own strategic investment in Anthropic, and its consumer businesses. Last week Anthropic released announced Mythos, its most advanced model. And, in somewhat typical Anthropic fashion, it did so by focusing on its dangers; from the introductory post for Project Glasswing , the company’s initiative for leveraging Mythos to address security: We formed Project Glasswing because of capabilities we’ve observed in a new frontier model trained by Anthropic that we believe could reshape cybersecurity. Claude Mythos Preview is a general-purpose, unreleased frontier model that reveals a stark fact: AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities. Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. The fallout—for economies, public safety, and national security—could be severe. Project Glasswing is an urgent attempt to put these capabilities to work for defensive purposes. In an Update last week I analogized Anthropic’s “disaster-porn-as-marketing-tool” approach to The Boy Who Cried Wolf ; what’s important about that analogy is not just that the boy raised false alarms, but also that, in the end, the wolf did come. To that end, I wrote two weeks ago about the myriad of security issues that underpin all software, and my optimism that AI would solve these issues in the long run, even if it made things much worse in the short run. In other words, it’s actually not important whether or not Mythos represents a major security threat: if this model doesn’t, a future model will; to that end, I do support leveraging Mythos to proactively find and fix bugs before bad actors can find and exploit them. At the same time, it’s also worth noting that there are other reasons for Anthropic to not make Mythos widely available, limiting access to a finite number of companies with a high capacity and willingness to pay. The first are those opportunity costs: Anthropic is already short on compute serving its current models; X was overrun with complaints and debates this weekend about Anthropic allegedly dumbing down Claude over the last month or so . Making Mythos more widely available — particularly to subscription plans that don’t pay per usage — would make the situation much worse. In other words, Anthropic isn’t facing a marginal cost problem, but an opportunity cost problem: where to allocate its compute. Of course this could become a margin problem: I suspect that Anthropic is going to overcome its conservatism in terms of compute by acquiring more compute from hyperscalers and neoclouds, and paying dearly for the privilege. The key to handling those costs will be to charge more for Claude going forward; that, by extension, means maintaining pricing power, which leads to a second benefit of not releasing Mythos broadly. Anthropic certainly faces competition from OpenAI; for both frontier labs, however, the real competition in the long run are open source models. Right now those primarily come from China, and a key ingredient in fast-following frontier models is distillation; from Anthropic’s blog : We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions. These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently. I absolutely believe this is a real problem, and wrote as much when DeepSeek R1 was released last year . I also think it’s in the interest of everyone other than the frontier labs to pretend that it isn’t; open source models are not subject to the frontier labs’ markup or compute constraints, which is exactly why it benefits most companies to have them available, whether or not they are distilled. Of course that doesn’t mean they are free to run: you still need to provide the compute. Notice, however, how that makes stopping distillation even more of a priority for the frontier labs: first, they want to protect their margins. Second, however, their biggest cost is opportunity cost: the customers they can’t serve because they don’t have enough compute. To the extent they can make compute less useful for their potential customers — by stopping open source models from distilling their models — is the extent to which they can acquire that compute for themselves at more favorable rates. Mythos wasn’t the only new model announced last week: Meta released the first fruit of their new frontier lab as well. From the company’s blog post : Today, we’re excited to introduce Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts. To support further scaling, we are making strategic investments across the entire stack — from research and model training to infrastructure, including the Hyperion data center… Muse Spark offers competitive performance in multimodal perception, reasoning, health, and agentic tasks. We continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows. Muse Spark isn’t state of the art, but it’s in the game, and overall a positive first impression from Meta Superintelligence Labs. What is most notable to me, however, is the extent to which the last nine months of AI have made clear that CEO Mark Zuckerberg made the right call to embark on that “ground-up overhaul of [Meta’s] AI efforts”. The trigger for O’Laughlin’s post that I opened this Article with was reasoning, where models using more tokens led to better answers; since then agents have exponentially increased token demand , as they can use LLMs continuously without a human in the loop. This is a huge driver in sky-rocketing demand for Claude, as well as OpenAI’s Codex. Moreover, this use case is so potentially profitable that not only is Anthropic’s revenue sky-rocketing, but OpenAI is pivoting its focus to enterprise. Indeed, you can make the argument that one of OpenAI’s biggest challenges is the fact it has such a popular consumer product in ChatGPT. I, with my Aggregation Theory lens, have long maintained that that userbase was a big advantage for OpenAI, but that assumed that the company could effectively monetize it, which is why I have argued so vociferously for an advertising model . OpenAI has big projections for exactly that, but until that materializes, that big consumer base is a big opportunity cost in terms of OpenAI’s focus and compute. The company has, to its credit and in the face of widespread skepticism, made significant investments in more compute, but the temptation to allocate more and more compute to agentic use cases that enterprises will pay for, even at the expense of the consumer business, will be very large. This puts Meta in a unique position relative to everyone else in the industry: unlike any of the hyperscalers or the frontier labs, Meta does not have an enterprise or cloud business to worry about. That means that serving the consumer market comes with no opportunity costs. Of course those opportunity costs would be much smaller anyways, given that Meta already has an at-scale advertising business to monetize usage. In other words, Meta may actually face less competition in winning the consumer space than it might have seemed a few months ago, simply because that is their primary focus — and because they have their own model, which means they don’t need to worry about not having access to the frontier labs (much of this analysis applies to Google, of course). This, by the same token, is why Meta should open source Muse, just like they did Llama. The entities that will be the most hurt by widespread availability of a frontier model are other frontier labs, who will see their pricing power reduced and face increased competition for compute. This will make it even harder for them to bear the opportunity cost of pursuing the consumer market, leaving it for Meta. So is “the era of Aggregation Theory…behind us”? On one hand, the insight that the way to create and maintain value will come from owning the customer is almost certainly going to continue to be the case. On the consumer side owning customers leads to advertising which provides the revenue to provide services to customers. On the enterprise side — which, I would note, has never been an arena where Aggregation Theory was meant to be applied — I think it’s likely that both Anthropic and OpenAI continue to move up the stack and deliver features that compete with software providers directly (an approach that is also in line with not making leading edge models publicly available). On the other hand, O’Laughlin’s observation that we are and will continue to be compute constrained is an important one: companies will not be able to assume they can serve everyone, because serving one set of customers imposes the opportunity cost of not serving another. This won’t, at least in theory, last forever: at some point AI will be “good enough” for enough use cases that there will be enough compute capacity to take advantage of the fact that there really aren’t meaningful marginal costs entailed in serving AI; that theoretical future, however, feels further away than ever. OpenAI is betting that this compute constraint — and the deals they have made to overcome it — will matter more than Anthropic’s current momentum with end users. From Bloomberg : OpenAI told investors this week that its early push to dramatically increase computing resources gives it a key advantage over Anthropic PBC at a moment when its longtime rival is gaining ground and mulling a potential public offering. The ChatGPT maker said it has outpaced Anthropic by “rapidly and consistently” adding computing capacity to support wider adoption of its software, according to a note the company sent to some of its investors after Anthropic announced a more powerful AI model called Mythos. The ambitious infrastructure build-out, criticized by some as too costly, has enabled OpenAI to better keep pace with rising demand for AI products, the memo states. I’m less certain that this will be dispositive. When it comes to AI, distribution and transaction costs are still free — the two preconditions for Aggregators — which means that the winners should be those with the most compelling products. Those products will win the most users, providing the money necessary to source the compute to serve them; consider Anthropic’s deal to secure a meaningful portion of TPU supply , which, given the capacity constraints at TSMC, is ultimately an example of taking supply from Google. I suspect that Anthropic can take more, including already built hyperscaler and neocloud capacity. Yes, that compute will be more expensive, but if demand is high enough the necessary cash flow will be there. In other words, my bet is that owning demand will ultimately trump owning supply, suggesting that the underlying principles of Aggregation Theory lives on. To put it another way, I think that OpenAI will need to win with better products, not just more compute; then again, if more compute is the key to better products, then does supply matter most? Regardless, they’ll certainly be focused on delivering both to the enterprise customers who are driving Anthropic’s astonishing growth. The real cost may be the consumer market they currently dominate, given that Meta has nothing to lose and everything to gain. You need land for the factory You need machines for the factory You need electricity to operate the machines You need humans to operate the machines You need the raw material for the widgets

0 views

One Developer, Two Dozen Agents, Zero Alignment

Why we need collaborative AI engineering

0 views

A little tool to visualise MoE expert routing

I've been curious for a while about what's actually happening inside Mixture of Experts models when they generate tokens. Nearly every frontier model these days (Qwen 3.5, DeepSeek, Kimi, and almost certainly Opus and GPT-5.x) is a MoE - but it's hard to get an intuition for what "expert routing" actually looks like in practice. So I built a small tool to visualise it: moe-viz.martinalderson.com You can pick between a few different prompts, watch the generation animate out, and see exactly which experts fire at each layer for each token. The top panel shows routing as the token is generated, the bottom panel builds up a cumulative heatmap across the whole generation. I built this by modifying the llama.cpp codebase to output more profiling data, with Claude Code's help. So it may have serious mistakes, but it was a really fun weekend project. The thing that really surprised me: for any given (albeit short) prompt, ~25% of experts never activate at all. But it's always a different 25% - run a different prompt and a different set of experts goes dormant. That's a much more interesting result than I expected. Interestingly Gemma 26BA4 runs really well with the "CPU MoE" feature - 4b params is not a lot to run on a fairly fast CPU and having KV cache on GPU really helps. I think there's a lot of performance improvements that could be done with MoE inference locally as well - eg caching certain experts on GPU vs CPU. If you're interested in learning more about LLM inference internals I'd certainly recommend pointing your favourite coding agent at the llama.cpp codebase and getting it to explain the various parts - it really helped me learn a lot.

0 views

From Painfully Explicit to Implicit in Lean

Note: AI was used to edit this post. As a proof-of-human thought and input, I am also publishing the original draft which was written fully before asking AI to edit the post with me. This post is aimed at Lean language beginners that are interested in writing proofs in Lean, but still feel lost when reading Lean code. A very simplified mental model of Lean is that at the core there are two systems:

0 views

Deep Dish Swift 2026 recap

I gave the opening keynote at Deep Dish Swift 2026 today! It was so fun. The event is very iOS-oriented and nearly 300 people in the Chicago area, and it was cool meeting so many folks who I hadn’t met in person before. Before the event, I actually spoke on the podcast for the event, called Slices ! I love all of the pizza puns this event has so much. The vibe throughout the podcast and the conference was fun, indie, and just cool. My talk, speaking of pizza puns… started with a pizza pun, where I led with, “I thought this conference was about eating deep dish swiftly…” and my slides were literally just about pizza. The talk itself was actually about AI, the state of the industry, and how we need to share our knowledge and support others… but again the visuals were all pizza. This was honestly one of the hardest-to-write talks I’ve ever done. Doing the technical AI part of the talk while making my slides plausibly like a real talk about pizza was far more challenging than I expected it to be. But, I’m happy with the outcome! I’ve never actually had so many people approach me just to ask me about my talk-writing process before, with one attendee saying, “I’ve never seen a metaphor so gracefully navigated,” which is the most flattering thing I’ve ever heard someone say to me while I’m wearing a shirt covered in pepperoni.

0 views
Stratechery 5 days ago

2026.15: Myth and Mythos

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Sharp Tech video is on why OpenAI’s enterprise pivot makes sense. Anthropic Anthropic Anthropic . In the current AI era, it feels like a new company is crowned the winner every few months, and right now Anthropic is wearing the crown. However, a point I make on Sharp Tech is that Anthropic’s exponential growth includes the part of the curve everyone misses: the company has been on this once-barely-visible trajectory for nearly two years now. Now the company has what is undoubtedly the most powerful model in the world, so powerful, in fact, that Anthropic says it can’t release it publicly. There’s reason for cynicism, given Anthropic’s history, but the part of the “Boy Cries Wolf” myth everyone forgets is that the wolf did come in the end. — Ben Thompson The New York Times and Another Paradigm Shift. If you’re interested in media, this week’s Stratechery Interview with New York Times CEO Meredith Kopit Levien is a fantastic listen. The  Times  has nailed the internet era better than media company in the world, and they’ve succeeded by making deliberate choices — a paywall before it was cool, a clear point of view, integrated business and editorial strategies — to differentiate themselves from a sea of commoditized content in an era of aggregators and content abundance. That playbook worked wonders for the Times in the previous generation of the internet, and I enjoyed hearing Levien’s thoughts on updating it for an era dominated by AI and video.  — Andrew Sharp The New Yorker  Explains Sam Altman. This week’s Sharp Text hit a few different beats, including thoughts on the Strait of Hormuz and a fun bit of E-ZPass history, but I opened with a take on the sprawling Sam Altman profile from the New Yorker . The 16,000 word profile is certainly an exhaustive recital of questions that have been asked about Altman for more than a decade, but better topics went unexplored. It’s frustrating — and representative of too much tech coverage — that so much effort went into what’s effectively a well-written Wikipedia entry, anchored by a predetermined conclusion, and ignoring more dramatic questions than whether Sam Altman is a good person. — AS OpenAI Buys TBPN, Tech and the Token Tsunami — OpenAI’s purchase of TBPN makes no sense, which may be par for the course for OpenAI. Then, AI is breaking stuff, starting with tech services. Anthropic’s New TPU Deal, Anthropic’s Computing Crunch, The Anthropic-Google Alliance — Anthropic needs compute, and Google has the most: it’s a natural partnership, particularly for Google. Anthropic’s New Model, The Mythos Wolf, Glasswing and Alignment — Anthropic says its new model is too dangerous to release; there are reasons to be skeptical, but to the extent Anthropic is right, that raises even deeper concerns. An Interview with New York Times CEO Meredith Kopit Levien About Betting on Humans With Expertise — An interview with New York Times Company CEO Meredith Kopit Levien about human expertise as a moat against Aggregators and AI. Hormuz, Rushmore and a Sam Altman Story That Missed the Story — On the New Yorker’s profile of Sam Altman, the future in the Middle East, and the power of E-ZPass history . OpenAI Buys TBPN Mythos, Altman, New York Times VLIW: The “Impossible” Computer Gas Turbine Blades and their Heat-Defying Single-Crystal Superalloys A Ceasefire and Reports of PRC Pressure; Another Politburo Investigation; Mythos, DeepSeek, and a Token Crunch An Exclusive Hornets-Suns Report and Mail on LeBron, Wemby, the Pistons, ABS in the NBA, Bulls Fandom for Kids Malone to Carolina and Karnisovas Out in Chicago, Cooper and Kon Battling to the Finish, A Jokic-Wemby Classic in Denver Mythos and Project Glasswing, The Year of Anthropic Continues Apace, Q&A on the NYT, Altman, De-globalization

0 views

Premium: The Hater's Guide to OpenAI

Soundtrack: The Dillinger Escape Plan — Setting Fire To Sleeping Giants In what The New Yorker’s Andrew Marantz and Ronan Farrow called a “tense call” after his brief ouster from OpenAI in 2023, Sam Altman seemed unable to reckon with a “pattern of deception” across his time at the company:  No, he cannot. Sam Altman is a deeply-untrustworthy individual, and like OpenAI lives on the fringes of truth, using a complaint media to launder statements that are, for legal reasons, difficult to call “lies” but certainly resemble them. For example, back in November 2025, Altman told venture capitalist Brad Gerstner that OpenAI was doing “well more” than $13 billion in annual revenue when the company would do — and this is assuming you believe CNBC’s source — $13.1 billion for the entire year . I guarantee you that, if pressed, Altman would say that OpenAI was doing “well more than” $13 billion of annualized revenue at the time, which was likely true based on OpenAI’s stylized math, which works out as so (per The Information): This means that, per CNBC’s reporting, OpenAI barely scratched $10 billion in revenue in 2025, and that every single story about OpenAI’s revenue other than my own reporting (which came directly from Azure) massively overinflates its sales. The Information’s piece about OpenAI hitting $4.3 billion in revenue in the first half of 2025 should really say “$3.44 billion,” but even then, my own reporting suggests that OpenAI likely made a mere $2.27 billion in the first half of last year, meaning that even that $10 billion number is questionable. It’s also genuinely insane to me that more people aren’t concerned about OpenAI, not as a creator of software, but as a business entity continually misleading its partners, the media, and the general public. To put it far more bluntly, the media has failed to hold OpenAI accountable, enabling and rationalizing a company built on deception, rationalizing and normalizing ridiculous and impossible ideas just because Sam Altman said them. Let me give you a very obvious example. About a month ago, per CNBC , “...OpenAI reset spending expectations, telling investors its compute target was around $600 billion by 2030.” This is, on its face, a completely fucking insane thing to say, even if OpenAI was a profitable company. Microsoft, a company with hundreds of billions of dollars of annual revenue, has about $42 billion in quarterly operating expenses .  OpenAI cannot afford to pay these agreements. At all. Hell, I don’t think any company can! And instead of saying that, or acknowledging the problem, CNBC simply repeats the statement of “$600 billion in compute spend,” laundering Altman and OpenAI’s reputation as it did (with many of the same writers and TV hosts) with Sam Bankman-Fried . CNBC claimed mere months before the collapse of FTX that it had grown revenue by 1,000% “during the crypto craze,” with its chief executive having “ ...survived the market wreckage and still expanded his empire .” You might say “how could we possibly know?” and the answer is “read CNBC’s own reporting that said that Bankman-Fried intentionally kept FTX in the Bahamas ,” which said that Bankman-Fried had intentionally reduced his stake in Canadian finance firm Voyager ( which eventually collapsed on similar terms to FTX ) to avoid regulatory disclosures around (Bankman-Fried’s investment vehicle) Alameda’s finances. This piece was written by a reporter that has helped launder the reputation of Stargate Abilene , claiming it was “online” despite only a fraction of its capacity actually existing.  The same goes for OpenAI’s $300 billion deal with Oracle that OpenAI cannot afford and Oracle does not have the capacity to serve . These deals do not make any logical sense, the money does not exist, and the utter ridiculousness of reporting them as objective truths rather than ludicrous overpromises allowed Oracle’s stock to pump and OpenAI to continue pretending it could actually ever have hundreds of billions of dollars to spend. OpenAI now claims it makes $2 billion a month , but even then I have serious questions about how much of that is real money considering the proliferation of discounted subscriptions (such as ones that pop up when you cancel that offer you three months of discounted access to ChatGPT Plus ) and free compute deals, such as the $2500 given to Ramp customers , millions of tokens in exchange for sharing your data , the $100,000 token grants given to AI policy researchers , and the OpenAI For Startups program that appears to offer thousands (or even tens of thousands) of dollars of tokens to startups . While I don’t have proof, I would bet that OpenAI likely includes these free tokens in its revenues and then counts them as part of its billions of dollars of sales and market spend . I also think that revenue growth is a little too convenient, accelerating only to match Anthropic, which recently “hit” $30 billion in annualized revenue under suspicious circumstances . I can only imagine OpenAI will soon announce that it’s actually hit $35 billion in annualized revenue , or perhaps $40 billion in annualized revenue , and if that happens, you know that OpenAI is just making shit up.  Regardless, even if OpenAI is actually making $2 billion a month in revenue, it’s likely losing anywhere from $4 billion to $10 billion to make that revenue. Per my own reporting from last year, OpenAI spent $8.67 billion on inference to make $4.329 billion in revenue , and that’s not including training costs that I was unable to dig up — and those numbers were before OpenAI spent tens of millions of dollars in inference costs propping up its doomed Sora video generation product , or launched its Codex coding environment. In simpler terms, OpenAI’s costs have likely accelerated dramatically with its supposed revenue growth. And all of this is happening before OpenAI has to spend the majority of its capital. Oracle has, per my sources in Abilene, only managed to successfully build and generate revenue from two buildings out of the eight that are meant to be done by the end of the year, which means that OpenAI is only paying a small fraction of the final costs of one Stargate data center. Its $138 billion deal with Amazon Web Services is only in its early stages, and as I explained a few months ago in the Hater’s Guide To Microsoft , Redmond’s Remaining Performance Obligations that it expects to make revenue from in the next 12 months have remained flat for multiple quarters, meaning that OpenAI’s supposed purchase of “ an incremental $250 billion in Azure compute ” are yet to commence. In practice, this means that OpenAI’s expenses are likely to massively increase in the coming months. And while the “ $122 billion ” funding round it raised — with $35 billion of it contingent on either AGI or going public (Amazon), and $60 billion of it paid in tranches by SoftBank and NVIDIA — may seem like a lot, keep in mind that OpenAI had received $22.5 billion from SoftBank on December 31 2025 , a little under four months ago.  This suggests that either OpenAI is running out of capital, or has significant up-front commitments it needs to fulfil, requiring massive amounts of cash to be sent to Amazon, Microsoft, CoreWeave ( which it pays on net 360 terms ) and Oracle.  And if I’m honest, I think the entire goal of the funding round was to plug OpenAI’s leaky finances long enough to take it public, against the advice of CFO Sarah Friar. One under-discussed part of Farrow and Marantz’s piece was a quote about OpenAI’s overall finances, emphasis mine : As I wrote up earlier in the week , OpenAI CFO Sarah Friar does not believe, per The Information , that OpenAI is ready to go public, and is concerned about both revenue growth slowing and OpenAI’s ability to pay its bills: To make matters worse, Friar also no longer reports to Altman — and god is it strange that the CFO doesn’t report to the CEO! — and it’s actually unclear who it is she reports to at all, as her current report, Fiji Simo, has taken an indeterminately-long leave of medical absence . Friar has also, per The Information, been left out of conversations around financial planning for data center capacity. These are the big, flashing warning signs of a company with serious financial and accounting issues, run by Sam Altman, a CEO with a vastly-documented pattern of lies and deceit. Altman is sidelining his CFO, rushing the company to go public so that his investors can cash out and the larger con of OpenAI can be dumped onto public investors. And beneath the surface, the raw economics of OpenAI do not make sense. You’ll notice I haven’t talked much about OpenAI’s products yet, and that’s because I do not believe they can exist without venture capital funding them and the customers that buy them. These products only have market share as long as other parties continue to build capacity or throw money into the furnace. To explain: While OpenAI is not systemically necessary , the continued enabling and normalization of its egregious and impossible promises has created an existential threat to multiple parties named above. Its continued existence requires more money than anybody has ever raised for a company — private or public — and in the event it’s allowed to go public, I believe that both retail investors and large equity investors like SoftBank will be left holding the bag. OpenAI has a fundamental lack of focus as a business, despite how many articles have claimed over the last year that it’s working on a “SuperApp” and has some sort of renewed plan to take on whoever it is that OpenAI perceives as the competition in any given calendar month.  Everything OpenAI does is a reaction to somebody else. Its Atlas browser was a response to Perplexity’s Comet browser , its first ( of multiple! ) Code Reds in 2025 was a reaction to Google’s Gemini 3, and its rapid deployment of its Codex model and platform was to compete with Anthropic’s Claude Code . I’ve read about this company and the surrounding industry for hours a day for several years, and I can’t think of a single product that OpenAI has launched first . Even its video-generating social network app Sora was beaten to market by five days by Meta’s putrid and irrelevant “Vibes.” Actually, that’s not true. OpenAI did have one original idea in 2025 — the launch of GPT-5, a much-anticipated new model launch that included a “model router” to make it “more efficient,” except it turned out that it boofed on benchmarks and that the model router actually made it (as I reported last year) more expensive , which led to the router being retired in December 2025 .  I tend to be pretty light-hearted in what I write, but please take me seriously when I say I have genuine concerns about the dangers posed by OpenAI. I believe that OpenAI is an incredibly risky entity, not due to the power of its models or its underlying assets, but due to Sam Altman’s ability to con people and find others that will con in his stead. Those responsible for rooting out con artists — regulators, investors, and the media — have not simply failed , but actively assisted Altman in this con. Here’re the crucial elements of the con: Sam Altman is a dull, mediocre man that loves money and power. He appears to be superficially charming, but his actual skill is ingratiating himself with others and having them owe him favors, or feel somehow indebted to him otherwise. He remembers people’s names and where he met them, and is very good at emailing people, writing checks, or finding reasons for somebody else to write a check. He is not technical — he can barely code and misunderstands basic machine learning ( to quote Futurism ) — but is very good at making the noises that people want to hear, be they big scary statements that confirm their biases or massive promises of unlimited revenue that don’t really make any rational sense. While OpenAI might have started on noble terms, it has since morphed into a massive con led by the Valley’s most-notable con artist.  I realize that those who like AI might find this offensive, but what else do you call somebody who makes promises they can’t keep ($300 billion to Oracle, $200 billion of revenue by 2030), spreads nonsensical financials (promises to spend $600 billion in compute), makes announcements of deals that don’t exist (see: NVIDIA’s $100 billion funding and the entire Stargate project), and speaks in hyperbolic terms to pump the value of his stock (such as basically every time he talks about Superintelligence). Altman has taken advantage of a tech and business media that wants to see him win, a market divorced from true fundamentals, desperate venture capitalists at the end of their rope , hyperscalers that have run out of hypergrowth ideas , and multiple large companies like Oracle and SoftBank that are run by people that can’t do maths. OpenAI is a psuedo-company that can only exist with infinite resources, its software sold on lies, its infrastructure built and paid for by other parties, and its entire existence fueled by compounding layers of leverage and risk.  OpenAI has never made sense, and was only rationalized through a network of co-conspirators. OpenAI has never had a path to profitability, and never had a product that was worthy of the actual cost of selling it. The ascension of this company has only been possible as part of an exploitation of ignorance and desperation, and its collapse will be dangerous for the entire tech industry. Today I’ll explain in great detail the sheer scale of Sam Altman’s con, how it was exacted, the danger it poses to its associated parties, and how it might eventually collapse. This is the Hater’s Guide To OpenAI, or Sam Altman, Freed.  OpenAI’s ChatGPT Subscriptions are, like every LLM product, deeply unprofitable, which means that OpenAI needs constant funding to keep providing them. I have found users of OpenAI Codex who have been able to burn between $1,000 and $2,000 in the space of a week on a $200-a-month subscription, and OpenAI just reset rate limits for the second time in a month. This isn’t a real business. OpenAI’s API customers (the ones paying for access to its models) are, for the most part, venture-backed startups providing services like Cursor and Perplexity that are powered by these models. These startups are all incredibly unprofitable, requiring them to raise hundreds of millions of dollars every few months ( as is the case with Harvey , Lovable, and many other big-name AI firms), which means that a large chunk — some estimate around 27% of its revenue — is dependent on customers that stop existing the moment that venture capital slows down. OpenAI’s infrastructure partners like CoreWeave and Oracle are taking on anywhere from a few billion to over a hundred billion dollars’ worth of debt to build data centers for OpenAI, putting both companies in material jeopardy in the event of OpenAI’s failure to pay or overall collapse. 67% of CoreWeave’s 2025 revenue came from Microsoft renting capacity to rent to OpenAI , and $22 billion (32%) of of CoreWeave’s $66.8 billion in revenue backlog , which requires it to build more capacity to fill.  Oracle took on $38 billion in debt in 2025 , and is in the process of raising another $50 billion more as it lays off thousands of people , with said debt’s only purpose being building data center capacity for OpenAI. OpenAI’s lead investor SoftBank is putting its company in dire straits to fund the company, with over $60 billion invested in the company so far, existentially tying SoftBank’s overall financial health to both OpenAI’s stock price and SoftBank’s ability to continue paying (or refinancing) its loans. SoftBank took on a year-long $15 billion bridge loan in 2025 , had to sell its entire stake in NVIDIA , and expand its ARM-stock-backed margin loan to over $11 billion to give OpenAI $30 billion in 2025, and then took on another $40 billion bridge loan a few weeks ago to fund the $30 billion it promised for OpenAI’s latest funding round . Creating a halo of uncertainty around the actual efficacies of LLMs, to the point that a cult of personality grew around a technology that obfuscated its actual outcomes and efficacies to the point that it could be sold based on what it might do rather than what it actually does . Creating a halo of “genius” around Altman himself, aided by constant and vague threats of human destruction with the suggestion that only Altman could solve them. Normalizing the idea that it’s both necessary and important to let a company burn billions of dollars. Normalizing the idea that it’s okay that a company has perpetual losses, and perpetuating the idea that these losses are necessary for innovation to continue at large.

0 views

Has Mythos just broken the deal that kept the internet safe?

For nearly 20 years the deal has been simple: you click a link, arbitrary code runs on your device, and a stack of sandboxes keeps that code from doing anything nasty. Browser sandboxes for untrusted JavaScript, VM sandboxes for multi-tenant cloud, ad iframes so banner creatives can't take over your phone or laptop - the modern internet is built on the assumption that those sandboxes hold. Anthropic just shipped a research preview that generates working exploits for one of them 72.4% of the time, up from under 1% a few months ago. That deal might be breaking. From what I've read Mythos is a very large model. Rumours have pointed to it being similar in size to the short lived (and very underwhelming) GPT4.5 . As such I'm with a lot of commentators in thinking that a primary reason this hasn't been rolled out further is compute. Anthropic is probably the most compute starved major AI lab right now and I strongly suspect they do not have the compute to roll this out even if they wanted more broadly. From leaked pricing, it's expensive as well - at $125/MTok output (5x more than Opus, which is itself the most expensive model out there). One thing that has really been overlooked with all the focus on frontier scale models is how quickly improvements in the huge models are being achieved on far smaller models. I've spent a lot of time with Gemma 4 open weights model, and it is incredibly impressive for a model that is ~50x smaller than the frontier models. So I have no doubt that whatever capabilities Mythos has will relatively quickly be available in smaller, and thus easier to serve, models. And even if Mythos' huge size somehow is intrinsic to the abilities (I very much doubt this, given current progress in scaling smaller models) it has, it's only a matter of time before newer chips [1] are able to serve it en masse. It's important to look to where the puck is going. As I've written before, LLMs in my opinion pose an extremely serious cybersecurity risk. Fundamentally we are seeing a radical change in how easy it is to find (and thus exploit) serious flaws and bugs in software for nefarious purposes. To back up a step, it's important to understand how modern cybersecurity is currently achieved. One of the most important concepts is that of a sandbox . Nearly every electronic device you touch day to day has one (or many) layers of these to protect the system. In short, a sandbox is a so called 'virtualised' environment where software can execute on the system, but with limited permissions, segregated from other software, with a very strong boundary that protects the software 'breaking out' of the sandbox. If you're reading this on a modern smartphone, you have at least 3 layers of sandboxing between this page and your phone's operating system. First, your browser has (at least) two levels of sandboxing. One is for the JavaScript execution environment (which runs the interactive code on websites). This is then sandboxed by the browser sandbox, which limits what the site as a whole can do. Finally, iOS or Android then has an app sandbox which limits what the browser as a whole can do. This defence in depth is absolutely fundamental to modern information security, especially allowing users to browse "untrusted" websites with any level of security. For a malicious website to gain control over your device, it needs to chain together multiple vulnerabilities, all at the same time. In reality this is extremely hard to do (and these kinds of chains fetch millions of dollars on the grey market ). Guess what? According to Anthropic, Mythos Preview successfully generates a working exploit for Firefox's JS shell in 72.4% of trials. Opus 4.6 managed this in under 1% of trials in a previous evaluation: Worth flagging a couple of caveats. The JS shell here is Firefox's standalone SpiderMonkey - so this is escaping the innermost sandbox layer, not the full browser chain (the renderer process and OS app sandbox still sit on top). And it's Anthropic's own benchmark, not an independent one. But even hedging both of those, the trajectory is what matters - we're going from "effectively zero" to "72.4% of the time" in one model generation, on a real-world target rather than a toy CTF. This is pretty terrifying if you understand the implications of this. If an LLM can find exploits in sandboxes - which are some of the most well secured pieces of software on the planet - then suddenly every website you aimlessly browse through could contain malicious code which can 'escape' the sandbox and theoretically take control of your device - and all the data on your phone could be sent to someone nasty. These attacks are so dangerous because the internet is built around sandboxes being safe. For example, each banner ad your browser loads is loaded in a separate sandboxed environment. This means they can run a huge amount of (mostly) untested code, with everyone relying on the browser sandbox to protect them. If that sandbox falls, then suddenly a malicious ad campaign can take over millions of devices in hours. Equally, sandboxes (and virtualisation) are fundamental to allowing cloud computing to operate at scale. Most servers these days are not running code against the actual server they are on. Instead, AWS et al take the physical hardware and "slice" it up into so called "virtual" servers, selling each slice to different customers. This allows many more applications to run on a single server - and enables some pretty nice profit margins for the companies involved. This operates on roughly the same model as your phone, with various layers to protect customers from accessing each other's data and (more importantly) from accessing the control plane of AWS. So, we have a very, very big problem if these sandboxes fail, and all fingers point towards this being the case this year. I should tone down the disaster porn slightly - there have been many sandbox escapes before that haven't caused chaos, but I have a strong feeling that this is going to be difficult. And to be clear, when just AWS us-east-1 goes down (which it has done many , many , times ) it is front page news globally and tends to cause significant disruption to day to day life. This is just one of AWS's data centre zones - if a malicious actor was able to take control of the AWS control plane it's likely they'd be able to take all regions simultaneously, and it would likely be infinitely harder to restore when a bad actor was in charge, as opposed to the internal problems that have caused previous problems - and been extremely difficult to restore from in a timely way. Given all this it's understandable that Anthropic are being cautious about releasing this in the wild. The issue though, is that the cat is out of the bag. Even if Anthropic pulled a Miles Dyson and lowered their model code into a pit of molten lava, someone else is going to scale an RL model and release it. The incentives are far, far too high and the prisoner's dilemma strikes again. The current status quo seems to be that these next generation models will be released to a select group of cybersecurity professionals and related organisations, so they can fix things as much as possible to give them a head start. Perhaps this is the best that can be done, but this seems to me to be a repeat of the famous "obscurity is not security" approach which has become a meme in itself in the information security world. It also seems far fetched to me that these organisations who do have access are going to find even most of the critical problems in a limited time window. And that brings me to my final point. While Anthropic are providing $100m of credit and $4m of 'direct cash donations' to open source projects, it's not all open source projects. There are a lot of open source projects that everyone relies on without realising. While the obvious ones like the Linux kernel are getting this "access" ahead of time, there are literally millions of pieces of open source software (nevermind commercial software) that are essential for a substantial minority of systems operation. I'm not quite sure where the plan leaves these ones. Perhaps this is just another round in the cat and mouse cycle that reaches a mostly stable equilibrium, and at worst we have some short term disruption. But if I step back and look how fast the industry has moved over the past few years - I'm not so sure. And one thing I think is for certain, it looks like we do now have the fabled superhuman ability in at least one domain. I don't think it's the last. Albeit at the cost of adding yet more pressure onto the compute crunch the AI industry is experiencing ↩︎ Albeit at the cost of adding yet more pressure onto the compute crunch the AI industry is experiencing ↩︎

0 views
Giles's blog 6 days ago

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of them would help? I found a list of solid-looking interventions, and in my last post I came to the conclusion that the improvements in loss I had seen with all of them -- with two possible exceptions -- seemed unlikely to be in the noise. What would happen if I tried to put them into a new model? Let's start by looking at the results that we have for the interventions so far -- this is the table I've been using as I go through them, but I've updated it to contain the loss figures for each model to six decimal places instead of three, and made each model name link to the associated post. I've also corrected the loss for the model, which was mistakenly using the training loss at the end of the run rather than the loss on the test set 1 . As I've mentioned before, simply moving to training in the cloud improved things markedly, getting loss down from 3.944 to 3.691526; I suspect this was due to having a closer-to-optimal batch size (more about that in my next post). What to do about the other interventions, though? It seemed clear that two of them were not helping: weight tying, and the one using the figure for weight decay that I'd (I suspect incorrectly) derived from a paper by Cerebras Research. The "no-AMP" run (which would be better described as "full-fat float32") had a small positive effect, but was so costly in terms of both time and money that it wasn't worthwhile. So we had five interventions to try: How would they stack up? It seemed pretty unlikely that their independent contributions would just sum up neatly so that we got a total improvement of 0.013209 + 0.022141 + 0.048586 + 0.050244 + 0.089609 = 0.223789 (though that would certainly be nice!). One question to consider was how independent they were. For any set of interventions, you can imagine them being independent and adding up nicely, or pulling in separate directions so that the combined effect is worse than the sum, or pulling in the same direction so that they amplify each other. My intuition was that gradient clipping and removing dropout were pretty independent, at least conceptually. They might affect other interventions indirectly (eg. via changing the training run's use of the random number generator) but they'd be unlikely to have a direct effect. QKV bias I was less sure about, but it seemed -- again, just intuitively -- at least reasonably independent of the others, with one important exception (which I'll get into below). By contrast, weight decay and the learning rate interact together quite strongly, at least in standard gradient descent, and I'd tested them in isolation. The result for changing the weight decay to 0.01 was based on a fixed learning rate of 0.0004, and the result for scheduling the learning rate was based on a weight decay of 0.1. That felt like an issue, and definitely needed some thought. Additionally, there were some issues with which interventions might have not had a real effect, and instead just been the results of the use of randomness. While my analysis of how that might have affected things was somewhat limited by the number of test runs I could afford to do, it did show up two plausible issues: After some thought, I came up with a plan. If I were doing this properly and scientifically, I suppose I'd try every combination of interventions, but that would be ruinously expensive 2 , so a sensible minimal set of training runs felt like this: When those completed, I'd find the test set loss for both models. I'd choose the best run, and then do another run with those settings, but with weight decay switched back to the original value of 0.1. I chose to revert weight decay rather than the learning rate stuff because this was the one I was least sure about -- the updated "GPT-2" value of 0.01 is very unusual by today's standards, and I'd come to it via a rather circuitous route -- see the post for more details. The best of the three runs would be the winning combination of interventions. Again, this was not an exhaustive plan 3 . But it seemed to make sense. Let's see how it turned out. Just to recap, this one had these interventions against the baseline: It did not have QKV bias. You can see the config here . Here's the loss chart over the course of the training run: As normal with learning rate scheduling, I also charted that to make sure it was doing the right thing (you can see that it was): And I also tracked the gradient norms -- you can see that there was some clipping happening near the start of the run: At the end of the run, it reported this: That's a slightly lower final train loss than normal, and it took 3h10m, which is faster than usual, but about the same as the other train we did without dropout -- that makes sense, as the process of zeroing out random activations isn't free. I downloaded the model -- here it is -- and then ran the smoke test: ...and got its loss on the test set: Not bad at all -- the best result we've had so far, albeit not quite up to the standard of the original GPT-2 weights. Now the next one, with QKV bias. This one had these interventions: You can see the config here . Here's the loss chart: ...the learning rate: ...the gradient norms (note that we had more clipping, about halfway through): ...and the final printout at the end. That final train loss is slightly higher, which is normally an indicator that the test loss will be higher, but we'll have to see. Time to download the model -- here it is -- and on to the smoke test: ...and then the moment of truth -- what was its loss on the test set? As I suspected from the training loss at the end, slightly worse than the run without QKV bias. So, that meant that we should do the next run, with a weight decay of 0.1, with no QKV bias. Given the above results, this one had these interventions vs the baseline: Weight decay was back to the baseline value of 0.1, rather than the value of 0.01 used in the previous two runs, and QKV bias was switched back off. You can see the config here . Here's the loss chart: You can see that it's much choppier than the previous two runs; that initially surprised me, as the higher weight decay means that we're regularising the model more than we were with those, which I thought would "calm things down". But on reflection, I had it backward. Hand-waving a bit, a more regularised model is fitting less closely every detail to the data it has seen, considering the typical stuff more than it does the outliers. That means that when something a bit more out-of-distribution appears, it might not have yet learned how to integrate it into its model of the world. Well, it sounds plausible, anyway :-) On to the learning rate (just to double-check), and it's fine: And again, the gradient norms: ...which similarly to the loss chart show more occasions where gradients spiked and had to be clipped -- even towards the end of the training run this time. The final printout at the end: Once again, although the final train loss is not definitive, it tends to be indicative of the test loss. It's in between the last two runs, so we'd expect the test loss to be likewise in between theirs: Time to download the model -- here it is -- and on to the smoke test: Hmm. At least vaguely coherent, though I'm not 100% convinced. It looks like ads for personal injury lawyers have crept into FineWeb somehow... Still, it's time for the test loss (drumroll): As predicted from the train loss, it's in between the two runs above. Let's put these three runs into the results table: As a reminder: You can see that adding on QKV bias actually made the model worse than the learning-rate-only intervention. That pushes me slightly away from the "it's all about the initial weights" direction; perhaps instead the bias adds some kind of stability that the learning rate scheduling also provides, and they fight against each other? Unfortunately I think the only way to pick it apart would be to do a full set of runs, switching each intervention on and off independently, and that would be too costly. The fact that the weight decay change from 0.1 to 0.01 actually did help when combined with the learning rate change and scheduling was a bit of a surprise; because they're both coupled when we think about standard gradient descent, I was expecting them to be too intertwined for my tests of them in isolation to have been valid. Quite pleased that it didn't work out that way, though, because sweeping across values for different parameters is much easier than it would be if they were connected. However, at this point it occurs to me that it might be because we're using the AdamW optimiser. As I understand it, its big difference versus Adam is that it decouples weight decay. I don't have a solid mental model of what that means exactly (will read up and post about it eventually), but it certainly seems pertinent here. Anyway, I have to say, I'm both pleased with and disappointed by these results. Pleased because we got a result by putting interventions together that was better than any of them in isolation, but disappointed that the end result wasn't even better. The difference between 's loss, at 3.691526, and original GPT-2 small's, at 3.5, was 0.191526. Our best result, for , was 3.577761, so an improvement of 0.113765. That's about 60% of the way there. That said, by sheer chance, while trying out the different sizes of cloud machines, I'd got from a loss of 3.944 training locally to the baseline's value of 3.691526 -- I suspect due to the fact that training in the cloud meant that I could use batch sizes of 96. So a different way of looking at it is that we should include that in the calculations too. From 3.944 to 3.5, the gap with GPT-2 small was 0.444. And we went from 3.944 to 3.577761, an improvement of 0.366239. And that means that we managed to get 82% of the improvement we needed. On the other hand, it means that in terms of my improvements, 0.252474 came from a happy accident, while all of my careful work on interventions only got me 0.113765. :-( Anyway, I think that for now, I'll have to rest happy with that as a result -- and next time around, let's see if we can get to the same level of improvement locally, using gradient accumulation. Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩ Gradient clipping. QKV bias (that is, adding bias to the attention weight matrices). Changing weight decay to the GPT-2 value (0.01 rather than the 0.1 that is typical nowadays). Removing dropout Updating the learning rate from 0.0004 to 0.0014, but also scheduling it so that it varies over the course of the training run. Adding gradient clipping looked like it might have been within the training run noise. Adding QKV bias would have had a large effect on the model's initial weights. All of the others would have started with essentially the same weights (apart from weight tying, though even that would have had the same values for the initial weights apart from the tied ones). But adding the bias would have completely changed them, and its effect size was comfortably within the range of differences you might expect from that. Start a training run with all of the interventions apart from QKV bias. In parallel (Lambda instance availability permitting) run another one, with all of the interventions including QKV bias. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. QKV bias switched on. Gradient clipping at 3.5 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, but no QKV bias was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, with QKV bias was gradient clipping at 3.5, dropout removed, and the learning rate intervention, but no QKV bias, and no change to weight decay . Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩

0 views

What does it mean to create with AI?

For some weird reason, I always had some kind of slight “mental hesitation” with the meaning of data encoding versus decoding . Which one goes in what direction? To be honest, I have the same kind of weirdness with other concepts: daylight saving time for instance (are we gaining or losing an hour? I can never tell, sometimes even for many days after a change). So I wanted to create a diagram to illustrate the dichotomy between encoding and decoding, for a course I’m creating on software engineering. So one way to “create with AI” would be to ask one: “Can you please create a diagram to illustrate the difference between data encoding and decoding”.

0 views
Simon Willison 1 weeks ago

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Meta announced Muse Spark today, their first model release since Llama 4 almost exactly a year ago . It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on meta.ai (Facebook or Instagram login required). Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows". The model is exposed as two different modes on meta.ai - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro. I prefer to run my pelican test via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly. Here's the pelican I got for "Instant": And this one for "Thinking": Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model output an SVG directly (with code comments) whereas the Thinking model wrapped it in a thin HTML shell with some unused JavaScript libraries. Which got me curious... Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style. But what else can it do? I asked it: what tools do you have access to? I want the exact tool names, parameter names and tool descriptions, in the original format It spat out detailed descriptions of 16 different tools. You can see the full list I got back here - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks. Here are highlights derived from that response: Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. Let's talk about that one. Here's the description in full: Description: Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects. Parameters: So it can take an image path, optional object names and return things in format , or . Sounds to me like Meta's Segment Anything . I decided to take it for a spin. generate a photo of a raccoon sitting on a trash can wearing trash as a hat Here you go — king of the alley, in full regalia: Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too? My guess is that image generation in Meta AI is powered by their Emu model , or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's ). The tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried: use python OpenCV to analyze that image and find out neat things about it Annoyingly it did not show me the code it ran, but it did create this image: Pretty neat! We can both generate images and then use them with Python tools. Next, the tool. Knowing the names of the tools makes it easy to refer to them in prompts: Now try the container.visual_grounding tool The response ( full response here ): Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision: Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect? Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly mode, but how about the others? I prompted: yes draw bounding boxes, can it do masks too? Yes — bounding boxes work great. Masks? Not directly in ; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks. Here's the bbox output: What you’re seeing: Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers. About masks: since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options: Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG. Here's the HTML for that bbox visualization . Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can see that all here . The count mode is particularly interesting - here's that in full: So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools. Which means... it can count pelicans too! Here's that overlay exported as HTML . Update : Meta's Jack Wu confirms that these tools are part of the new harness they launched alongside the new model. On Twitter Alexandr Wang said : this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions. I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency: [...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison. So are Meta back in the frontier model game? Artificial Analysis think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively. I'm waiting for API access - while the tool collection on meta.ai is quite strong the real test of a model like this is still what we can build on top of it. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. object_names (array, required) image_path (string, default "") format_type ("bbox"|"point"|"count"|null, default "bbox") title (string|null)[string] raccoon : big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal coffee cup : (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space banana peel : (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side newspaper : (305,223)-(548,315) — left ear coverage trash can lid : (107,800)-(1000,998) — spans almost the full image width at the bottom face : (305,283)-(750,537) — tight crop around the mask eyes : both ~40×37 pixel boxes, perfectly level at y=379 GrabCut — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask SAM-style color threshold — use the bbox plus k-means to isolate the coffee cup or banana peel

0 views

AI Is Really Weird

If you like this piece and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I just put out a massive Hater’s Guide To The SaaSpocalypse , as well as last week’s deep dive into How AI Isn't Too Big To Fail .  Subscribing helps directly support my free work, and premium subscribers don’t see this ad in their inbox. I can’t get over how weird the AI bubble has become. Hyperscalers are planning to spend over $600 billion on data center construction and GPUs predominantly bought from NVIDIA, the largest company on the stock market, all to power generative AI, a technology that’s so powerful that none of them will discuss how much it’s making them, or what it is we’re all meant to be so excited.  To make matters weirder , Microsoft, a company that spent $37.5 billion in capital expenditures in its last quarter on AI , recently updated the terms and conditions of its LLM-powered “Copilot” service to say that it was “for entertainment purposes only,” discussing a product that apparently has 15 million users as part of enterprise Microsoft 365 subscriptions , and is sold to both local and national governments overseas , including the US federal government . That’s so weird! What’re you doing Microsoft? What do you mean it’s for entertainment purposes? You’re building massive data centers to drive this!  Well, okay, you’re building them at some point. As I discussed a few weeks ago, despite everybody talking about the hundreds of gigawatts of data centers being built “to power AI,” only 5GW are actually “under construction,” with “under construction” meaning anything from “we’ve got some scaffolding up” to “we’re about to hand over the keys to the customer.”  But isn’t it weird we’re even building those data centers to begin with? Why? What is it that AI does that makes it so essential — or, rather, entertaining — that we keep funding and building these things? Every day we hear about “the power of AI,” we’re beaten over the head with scary propaganda saying “AI will take our jobs,” but nobody can really explain — outside of outright falsehoods about “AI replacing all software engineers” — what it is that makes any of this worthy of taking up any oxygen let alone essential or a justification for so many billions of dollars of investment. Instead of providing an actual answer of some sort , AI boosters respond by saying it’s “just like the dot com bubble” — another weird thing to do considering 168,000 people lost their jobs as the NASDAQ dropped by 80% in two years , and only 16% of the world even used the internet , and those that did in America had an average internet speed of 50 kilobits per second ( and only 52% of them had access in 2000 anyway ). Conversely, to quote myself: And with that incredibly easy access , only 3% of households pay for AI . Boosters will again use this talking point to say that “we’re in the early days,” but that’s only true if you think that “early days” means “people aren’t really using it yet.”  Yet the “early days” argument is inherently deceptive. While the Large Language Model hype cycle might have only begun in 2022, the entirety of the media and markets have focused their attention on AI, along with hundreds of billions of dollars of venture capital and nearly a trillion dollars of hyperscale capex investment . AI progress isn’t hampered by a lack of access, talent, resources, novel approaches, or industry buy-in, but by a single-minded focus on Large Language Models, a technology that has been so obviously-limited from the very beginning that Gary Marcus was able to call it in 2022 .  Saying it’s “the early days” also doesn’t really make sense when faced with the rotten and incredibly unprofitable economics of AI. The early days of the internet were not unprofitable due to the underlying technology of serving websites , but the incredibly shitty businesses that people were building. Pets.com spent $400 per customer in customer acquisition costs , millions of dollars on advertising, and had hundreds of employees for a business with a little over $600,000 in quarterly revenue — and as a result, nothing about its failure was about “the early days of the internet” at all, as was the case with Kozmo, or any number of other dot com flameouts.  Similarly, internet infrastructure companies like Winstar collapsed because they tried to grow too fast and signed stupid deals rather than anything about the underlying technology’s flaws. For example, in 1998, Lucent Technologies signed its largest deal — a $2 billion “equipment and finance agreement” — with telecommunications company Winstar , which promised to bring in “$100 million in new business over the next five years” and build a giant wireless broadband network, along with expanding Winstar’s optical networking. Eager math-heads in the audience will be able to see the issue of borrowing $2 billion to make $100 million over five years, as will eager news-heads laugh at WIRED magazine in 1999 saying that Winstar’s “small white dish antennas…[heralded] a new era and new mind-set in telecommunications.” Winstar died two years later because its business was built to grow at a rate that its underlying product couldn’t support . In the end, microwave internet (high-speed internet delivered via radio waves) has become an $8 billion-a-year industry , despite everybody’s excitement. In any case, anytime that somebody tells you that we’re in “the early days of AI” has either been conned or is in the process of conning you, as they’re using it to deflect from issues of efficacy or underlying economic weakness.  In fact, that’s a great place to go next. Probably the weirdest thing about this entire era is how nobody wants to talk about the fact that AI isn’t actually doing very much, and that AI agents are just chatbots plugged into an API. Per Redpoint Ventures’ Reflections on the State of the Software and AI Market , “the agent maturity curve is still early, but the TAM implications are enormous,” with agents able to “...run discretely for minutes, [and] execute end-to-end tasks with some oversight.” What tasks, exactly? Who knows! Truly, nobody seems able to say. To paraphrase Steven Levy at WIRED , 2025 was meant to be the year of AI agents, but turned out to be the year of talking about AI agents. Agents were/are meant to be autonomous pieces of software that go off and do distinct tasks. In reality, it’s kind of hard to say what those tasks are. “AI agent” now refers to literally anything anybody wants it to, but ultimately means “chatbot that has access to some systems.”  The New York Times’ Ezra Klein recently talked to the entity currently inhabiting former journalist and Anthropic co-founder Jack Clark recently about “how fast AI agents would rip through the economy,” but despite speaking for over an hour, the closest we got was “it wrote up a predator-prey simulation (a complex-sounding but extremely-common kind of webgame that Anthropic likely ingested through its training material )” and “chatbots that talk to each other about tasks,” and if you think I’m kidding, this is how he described it: Anyway, this is all bad, because multiple papers have now shown that, and I quote, agents are “...incapable of carrying out computational and agentic tasks beyond a certain complexity,” with Futurism adding that said complexity was pretty low . The word “agent” is meant to make you think of powerful autonomous systems that carry out complex and minute tasks, when in reality it’s…a chatbot. It’s always a fucking chatbot. It might be a chatbot with API access or a chatbot that generates a plan that another chatbot looks at and says something about, but it’s still chatbots talking to chatbots. When you strip away the puffery, nobody seems to actually talk about what AI does.  Let’s take a look at CNBC’s piece on Goldman Sachs’ supposed contract with Anthropic to build “autonomous systems for time-intensive, high-volume back-office work”: …okay, but like, what does it do? Right, brilliant. Great. Love it. What tasks? What is the thing you’re paying for? Okay, great, we have two things it might do in the future , and that’s “employee surveillance” (?) and making pitchbooks. The upshot is that, with the help of the agents in development, clients will be onboarded faster and issues with trade reconciliation or other accounting matters will be solved faster, Argenti said. Onboarding? Chatbot. “Issues with trade reconciliation”? Chatbot connected to a knowledge base, like we’ve had for years but worse and more expensive. Oh, and “other accounting matters” will be solved faster, always with the future tense with these guys. How about Anthropic and outsourcing body shop giant InfoSys’ “AI agents for telecommunications and other regulated industries ”? Let’s go through the list of tasks and say what they mean, my comments in bold: How about OpenAI’s “Frontier” platform for businesses to “ build, deploy and manage AI agents that do real work” ?  Shared context? Chatbot. Onboarding? Chatbot. Hands-on learning with feedback? Chatbot. Clear permissions and boundaries? Chatbot setting. Let’s check out the diagram! Uhuh. Great. What real-world tasks? Uhhh.  Reason over data? Chatbot. “Complex tasks”? No idea, it doesn’t say. “Working with files”? Doesn’t say how it works with files, but I’d bet it can analyze, summarize and create charts based on them that may or may not have errors in them, and based on my experience of trying to get these things to make charts (as a test, I’d never use them in my actual work), it doesn’t seem to be able to do that. “Evaluation and optimization loops”? Unclear, because we have no idea what the tasks are. What are the agents planning, acting, or executing on? Again, no idea.  Yet the media continues to perpetuate the myth of some sort of present or future “agentic AI” that will destroy all employment. A few weeks ago, CNBC mindlessly repeated that ServiceNow CEO Bill McDermott believed that agents would send college grad unemployment over 30% . NowAssist , ServiceNow’s AI platform, is capable of — you guessed it! — summarization, conversational exchanges, content creation, code generation and search, a fucking chatbot just like the other chatbots.  A few weeks ago, The New York Times wrote about how “AI agents are fun, useful, but [not to] give them your credit card,” saying that they can “do more than just chat…they can edit files, send emails, book trips and cause trouble”: Sure sounds like you connected a chatbot to your email there Mr. Heyneman.  Let’s go through these: Yes, you can string together chatbots with various APIs and have the chatbot be able to activate certain systems. You could also do the same with a button you bought on Etsy connected to your computer via USB if you really wanted to. The ability to connect something to something else does not mean that anything useful happens at the end, and LLMs are extremely bad at the kind of deterministic actions that define the modern knowledge economy, especially when choosing to do them based on their interpretation of human language. AI agents do not, as sold, actually exist. Every “AI agent” you read about is a chatbot talking to another chatbot connected to an API and a system of record, and the reason that you haven’t heard about their incredible achievements is because AI agents are, for the most part, fundamentally broken.  Even OpenClaw, which CNBC confusingly called a “ ChatGPT moment ,” is just a series of chatbots with the added functionality of requiring root access to your computer and access to your files and emails. Let’s see how CNBC described it back in February :  Hmmm interesting. I wonder if they say what that means: Reading this, you might be fooled into believing that OpenClaw can actually do any of this stuff correctly, and you’d be wrong! OpenClaw is doing the same chatbot bullshit, just in a much-more-expensive and much-more convoluted way, requiring either a well-secured private space or an expensive Mac Mini to run multiple AI services and do, well, a bunch of shit very poorly. The same goes for things like Perplexity’s “Computer,” which it describes as “an independent digital worker that completes and workflows for you,” which means, I shit you not, that it can search, generate stuff (words, code, images), and integrate with Gmail, Outlook, Github, Slack, and Notion, places where it can also drop stuff it’s generated. Yes, all of this is dressed up with fancy terms like “persistent memory across sessions” (a document the chatbot reads and information it can access) with “authenticated integrations” (connections via API that basically any software can have). But in reality, it’s just further compute-intensive ways of trying to fit a square peg in a round hole, by which I mean having a hallucination-prone chatbot do actual work. The only reason Jensen Huang is talking about OpenClaw is that there’s nothing else for Jensen Huang to talk about: That’s wild, man. That’s completely wild. What’re you talking about? What can NemoClaw or OpenClaw or whatever-the-fuck actually do? What is the actual output? That’s so fucking weird! I can already hear the haters in my head screaming “ but Ed, coding models! ” and I’m kind of sick of talking about them, because nobody can actually tell me what I’m meant to be amazed or surprised by.  To be clear, LLMs can absolutely write code, and can absolutely create software, but neither of those mean that the code is good, stable or secure, or that the same can be said of the software they create. They do not have ideas, nor do they create unique concepts — everything they create is based on training data fed to it that was first scraped from Stack Overflow, Github and whatever code repositories Anthropic, OpenAI, and Google have been able to get their hands on.  It’s unclear what the actual economic or productivity effects are, other than an abundance of new code that’s making running companies harder. Per The New York Times :  As I wrote a few weeks ago , LLMs are good at writing a lot of code , not good code, and the more people you allow to use them, the more code you’re going to generate, which means the more time you’re either going to need to review that code, or the more vulnerabilities you’re going to create as a result. Worse still, hyperscalers like Meta and Amazon are allowing non-technical people to ship code themselves, which is creating a crisis throughout the tech industry.  Worse still , LLMs allow shitty software engineers that would otherwise be isolated by their incompetence to feign enough intelligence to get by, leading to them actively lowering the quality of code being shipped. Per the Times: The Times also notes that because LLM coding works better on a device rather than a web interface, “...engineers are downloading their entire company’s code to their laptops, creating a security risk if the laptop goes missing.”  Speaking frankly, it appears that LLMs can write code, and create some software, but without any guarantee that said code will compile, run, be secure, performant, or easy to read and maintain. For an experienced and ethical software engineer, LLMs can likely speed them up somewhat , though not in a way that appears to be documented in any academic sense, other than it makes them slower .  And I think it’s fair to ask what any of this actually means. What’s the advantage of having an LLM write all of your code? Are you shipping faster? Is the code better? Are there many more features being shipped? What is the actual thing you can point at that has materially changed for the better?  Software engineers don’t seem happier, nor do they seem to be paid more, nor do they seem to be being replaced by AI, nor do we have any examples of truly vibe coded software companies shipping incredible, beloved products.  In fact, I can’t think of a new piece of software I’ve used in the last few years that actually impressed me outside of Flighty . Where’s the beef? What am I meant to be looking at? What’re you shipping that’s so impressive? Why should I give a shit? Isn’t it weird that we’re even having this conversation? Shouldn’t it be obvious by now? This week, economist Paul Kedrosky told me on the latest episode of my show Better Offline that AI is “...nowhere to be seen yet in any really meaningful productivity data anywhere,” and only appears in the non-residential fixed investments side of America’s GDP, at (and I quote again) “...levels we last saw with the railroad build out or with rural electrification.” That’s so fucking weird! NVIDIA is the largest company on the US stock market and has sold hundreds of billions of dollars of GPUs in the last few years, with many of them sold to the Magnificent Seven, who are building massive data centers and reopening nuclear power plants to power them, and every single one of them is losing money doing so, with revenues so putrid they refuse to talk about them!   And all that to make…what, Gemini? To power ChatGPT and Claude? What does any of this actually do that makes any of those costs actually matter? And as I’ve discussed above, what, literally, does this software do that makes any of this worth it?   Ask the average AI booster — or even member of the media — and they’ll say something about “lots of code being written by AI,” or “novel discoveries” (unrelated to LLMs) or “LLMs finding new materials ( based on an economics paper with faked data )” or “people doing research,” or, of course, “that these are the fastest-growing companies of all time.” That “growth” is only possible because all of the companies in question heavily subsidize their products , spending $3 to $15 for every dollar of revenue. Even then, only OpenAI and Anthropic seem to be able to make “billions of dollars of revenue,” a statement that I put in quotes because however many billions there might be is up for discussion. Back in November 2025 , I reported that OpenAI had made — based on its revenue share with Microsoft — $4.329 billion between January and September 2025, despite The Information reporting that it had made $4.3 billion in the first half of the year based on disclosures to shareholders .  While a few outlets wrote it up, my reporting has been outright ignored by the rest of the media. I was not reached out to by or otherwise acknowledged by any other outlets, and every outlet has continued to repeat that OpenAI “made $13 billion in 2025,” despite that being very unlikely given that it would have required it to have made $8 billion in a single quarter. While I understand why — I’m an independent, after all — these numbers directly contradict existing reporting, which, if I was a reporter, would give me a great deal of concern about the validity of my reporting and the sources that had provided it.  Similarly, when Anthropic’s CFO said in a sworn affidavit that it had only made $5 billion in its entire existence , nobody seemed particularly bothered, despite reports saying it had made $4.5 billion in 2025 , and multiple “annualized revenue” reports — including Anthropic’s own — that added up to over $6.6 billion .  Though I cannot say for certain, both of these situations suggest that Anthropic and OpenAI are misleading their investors, the media and the general public. If I were a reporter who had written about Anthropic or OpenAI’s revenues previously, I would be concerned that I had published something that wasn’t true, and even if I was certain that I was correct, I would have to consider the existence of information that ran counter to my own. I would be concerned that Anthropic or OpenAI had lied to me, or that they were lying to someone else, and work diligently to try and find out what happened. I would, at the very least, publish that there was conflicting information. The S-1 will give us the truth, I guess.  Let’s talk for a moment about margins , because they’re very important to measuring the length of a business.  Back in February in my Hater’s Guide To Anthropic, I raised concerns that Dario Amodei was using a different way to calculate margins than other companies do .  Amodei told the FT in December 2024 that he didn’t think profitability was based on how much you spent versus how much you made: He then did the same thing in an interview with John Collison in August 2025 : Almost exactly six months later on February 13, 2026’s appearance on the Dwarkesh Podcast, Dario would once again try and discuss profitability in terms other than “making more money than you’ve spent”: The above quote has been used repeatedly to suggest that Anthropic has 50% gross margins and is “profitable,” which is extremely weird in and of itself as that’s not what Dario Amodei said at all. Based on The Information’s reporting from earlier in the year , Anthropic’s “gross margin” was 38%.” Yet things have become even more confusing thanks to reporting from Eric Newcomer, who ( in reporting on an investor presentation by Coatue from January ) revealed that Anthropic’s gross margin was “45% in the quarter ended Sep-25,” with the crucial note that — and I quote — “Non-GAAP gross margins [are] calculated by Anthropic management…[are] unaudited, company-provided, and may not be comparable to other companies.” This means that however Anthropic calculates its margins are not based on Generally Accepted Accounting Principles , which means that the real margins probably suck ass , because Anthropic loses billions of dollars a year, just like OpenAI. Yet one seemingly-innocent line in there gives me even more pause: “Model payback improving significantly as revenue scales faster than R&D training costs.” This directly matches with Dario Amodei’s bizarre idea that “...If you consider each model to be a company, the model that was trained in 2023 was profitable. You paid $100 million, and then it made $200 million of revenue.” Yes, I know it’s a “stylized fact” or whatever, but that’s what he said, and I think that their IPO might have a rude surprise in the form of a non-EBITDA margin calculation that makes even the most-ardent booster see red. This week, The Wall Street Journal published a piece about OpenAI and Anthropic's finances that included one of the most-offensive lines in tech media history: Two thoughts: As I said a few months ago about training costs: The Journal also adds that both Anthropic and OpenAI are showing investors two versions of their earnings — one with training costs, and one without — without adding the commentary that this is extremely deceptive or, at the very least, extremely unusual. The more I think about it the more frustrated I get. Having two sets of earnings is extremely dodgy! Especially when the difference between them is billions of dollars. This should be immediately concerning to every financial journalist, the reddest of red flags, the biggest sign that something weird is happening… …but because this is the AI industry, the Journal runs propaganda instead: That “fast-growing” part is only possible because both Anthropic and OpenAI subsidize the compute of their subscribers , allowing them to burn $3 to $15 for every dollar of subscription revenue. And no, this is nothing like Uber or Amazon , that’s a silly comparison, click that link and read what I said and then never bring it up again. I realize my suspicion around Anthropic’s growth has become something of a meme at this point, but I’m sorry, something is up here. Let’s line it all up: Anthropic was making $9 billion in annualized revenue at the end of 2025, or approximately $750 million in a 30-day period. Per Newcomer , as of December 2025, this is how Anthropic’s revenue breaks down: Per The Information , Anthropic also sells its models through Microsoft, Google and Amazon, and for whatever reason reports all of the revenue from their sales as its own and then takes out whatever cut it gives them as a sales and marketing expense: The Information also adds that “...about 50% of Anthropic’s gross profits on selling its AI via Amazon has gone to Amazon,” and that “...Google typically takes a cut of somewhere between 20% and 30% of net revenue, after subtracting infrastructure costs.”  The problem here is that we don’t know what the actual amounts of revenue are that come from Amazon or Google (or Microsoft, for that matter, which started selling Anthropic’s models late last year), which makes it difficult to parse how much of a cut they’re getting. That being said, Google ( per DataCenterDynamics/The Information ) typically takes a cut of 20% to 30% of net revenue after subtracting the costs of serving the models . Nevertheless, something is up with Anthropic’s revenue story.  Let’s humour Anthropic for a second and say that what it’s saying is completely true: it went from making $750 million in monthly revenue in January to $2.5 billion in monthly revenue in April 2026. That’s remarkable growth, made even more remarkable by the fact that — based on its December breakdown — most of it appears to have come from API sales. That leap from $750 million to $1.16 billion between December and February feels, while ridiculous , not entirely impossible , but the further ratchet up to $2.5 billion is fucking weird! But let’s try and work it out.  On February 5 2026, Anthropic launched Opus 4.6 , followed by Claude Sonnet 4.6 on February 17 2026.  Based on OpenRouter token burn rates , Opus 4.5 was burning around 370 billion tokens a week. Immediately on release, Opus 4.6 started burning way, way more tokens — 524 billion in its first week, then 643 billion, then 634 billion, then 771 billion, then 822 billion, then 976 billion, eventually going over a trillion tokens burned in the final week of March.  In the weeks approaching its successor’s launch, Sonnet 4.5 burned between 500 billion and 770 billion tokens. A week after launch, 4.6 burned 636 billion tokens, then 680 billion, then 890 billion, and, by about a month in, it had burned over a trillion tokens in a single week.  Reports across Reddit suggest that these new models burn far more tokens than their predecessors with questionable levels of improvement.  The sudden burst in token burn across OpenRouter doesn’t suggest a bunch of people suddenly decided to connect to Anthropic and other services’ models , but that the model themselves had started to burn nearly twice the amount of tokens to do the same tasks. At this point, I estimate Anthropic’s revenue split to be more in the region of 75% API and 25% subscriptions, based on its supposed $2.5 billion in annualized revenue (out of $14 billion, so a little under 18%) in February coming from “Claude Code” (read: subscribers to Claude, there’s no “Claude Code” subscription).  If that’s the case, I truly have no idea how it could’ve possibly accelerated so aggressively, and as I’ve mentioned before , there is no way to reconcile having made $5 billion in lifetime revenue as of March 9, 2026, having $14 billion in annualized revenue on February 12 2026, and having $4.5 billion in revenue for the year 2025. Things get more confusing when you hear how Anthropic calculates its annualized revenues, per The Information : So, Anthropic is annualizing based on the last four weeks of API revenue times 13, a number that’s extremely easy to manipulate using, say, launches of new products. In simpler terms, Anthropic is cherry-picking four-week windows of API spend — ones that are pumped by big announcements and new model releases — and annualizing them. The one million token context window is a big deal, too, having been raised from 200,000 tokens in previous models. With Opus and Sonnet 4.6, Anthropic lets users use up to one million tokens of context, which means that both models can now carry a very, very large conversation history, one that includes every single output, file, or, well, anything that was generated as a result of using the model via the API. This leads to context bloat that absolutely rinses your token budget.   To explain, the context window is the information that the model can consider at once. With 4.6, Anthropic by default allows you to load in one million tokens’ worth of information at once, which means that every single prompt or action you take has the model load one million tokens’ worth of information at once unless you actively “trim” the window through context editing .  Let’s say you’re trying to work out a billing bug in a codebase via whatever interface you’re using to code with LLMs. You load in a 350,000 token codebase, a system prompt (IE: “you are a talented software engineer,” here’s an example ), a few support tickets, and a bunch of word-heavy logs to try and fix it. On your first turn (question), you ask it to find the bug, and you send all of that information through. It spits out an answer, and then you ask it how to fix the bug…but “asking it to fix the bug” also re-sends everything, including the codebase, tickets and logs. As a result, you’re burning hundreds of thousands of tokens with every single prompt. Although this is a simplified example, it’s the case across basically any coding product, such as Claude Code or Cursor. While Cursor uses codebase indexing to selectively fetch pieces of the codebase without constantly loading it into the context window, one developer using Claude inside of Cursor watched a single tool call burn 800,000 tokens by pulling an entire database into the context window , and I imagine others have run into similar problems. To be clear, Anthropic charges at a per-million-token rate of $5 per million input and $25 per million output, which means that those casually YOLOing entire codebases into context are burning shit tons of cash (or, in the case of subscribers, hitting their rate limits faster). if Anthropic actually made $2.5 billion in a month — we’ll find out when it files its S-1! — it likely came not from genuine growth or a surge of adoption, but in its existing products suddenly costing a shit ton more because of how they’re engineered.  The other possibility is the nebulous form of “enterprise deals” that Anthropic allegedly has, and the theory that they somehow clustered in this three-month-long period, but that just feels too convenient.   If 70% of Anthropic’s revenue is truly from API calls, this would suggest: I don’t see much evidence of Anthropic creating custom integrations that actually matter, or — and fuck have I looked! — any real examples of businesses “doing stuff with Claude” other than making announcements about vague partnerships.  There’s also one other option: that Silicon Valley is effectively subsidizing Anthropic through an industry-wide token-burning psychosis. And based on some recent news, there’s a chance that’s the case. As I discussed a few weeks ago, Silicon Valley has a “tokenmaxxing” problem , where engineers are encouraged by their companies to burn as many tokens as possible, at times by their peers, and at others by their companies. The most egregious — and honestly, worrying! — version of this came from The Information’s recent story about Meta employees competing on an internal leaderboard to see who can burn the most tokens, deliberately increasing the size of their prompts and the amount of concurrent sessions ( along with unfettered and dangerous OpenClaw usage ) to do so:   The Information reports that the dashboard, called “Claudeonomics” (despite said dashboard covering other models from OpenAI, Google, and xAI), has sparked competition within Meta, with users burning a remarkable 60 trillion tokens in the space of a month, with one individual averaging around 281 billion tokens, which The Information remarks could cost millions of dollars. Meta’s company-mandated psychosis also gives achievements for particular things like using multiple models or high utilization of the cache. Here’s one very worrying anecdote: One poster on Twitter says that there are people at Meta running loops burning tokens to rise up the leaderboards, and that Meta’s managers also measure lines of code as a success metric.  The Information says that, considering Anthropic’s current pricing for its models, that 60 trillion tokens could be as much as $900 million in the space of a month, though adds that this assumes that every token being burned was on Claude Opus 4.6 (at $15 per 1 million tokens).  I personally think this maths is a bit fucked, because it assumes that A) everybody is only using Claude Opus, B) that none of that token burn runs through the cache (which it obviously does, and the cache charges 50%, as pointed out by OpenCode co-founder Dax Radd ), and C) that Meta is entirely using the API (versus paying for a $200-a-month Claude Max subscription for each user).  Digging in further, it appears that a few years ago Meta created an internal coding tool called CodeCompose , though a source at Meta tells me that developers use VSCode and an assistant called Devmate connected to models from Anthropic, OpenAI and xAI. One engineer on Reddit — albeit an anonymous one! — had some commentary on the subject: If we assume that Meta is an enterprise customer paying API rates for its tokens, it’s reasonable to assume — at even a low $5-per-million average — that it’s spending $300 million or more a month on API calls. As Radd also added, there’s likely a discount involved. He suggested 20%, which I agree with. Even if it’s $300 million, that’s still fucking insane. That’s still over three billion dollars a year. If this is what’s actually happening, and this is what’s contributing to Anthropic’s growth, this is not a sustainable business model, which is par for the course for Anthropic, a company that has only lost billions of dollars. Encouraging workers to burn as many tokens as possible is incredibly irresponsible and antithetical to good business or software engineering. Writing great software is, in many cases, an exercise in efficiency and nuance , building something that runs well, is accessible and readable by future engineers working on it, and ideally uses as few resources as it can. TokenMaxxing runs contrary to basically all good business and software practices, encouraging waste for the sake of waste, and resulting in little measurable productivity benefits or, in the case of Meta, anything user-facing that actually seems to have improved. Venture capitalist Nick Davidov mentioned yesterday that sources at Google Cloud “started seeing billions of tokens per minute from Meta, which might now be as big as a quarter of all the token spend in Anthropic.” While I can’t verify this information ( and Davidoff famously deleted his photos using Claude Cowork while attempting to reorganize his wife’s desktop ), if that’s the case, Meta is a load-bearing pillar of Anthropic’s revenue — and, just as importantly, a large chunk of Anthropic’s revenue flows through Google Cloud , which means A) that Anthropic’s revenue truly hinges on Google selling its models, and B) that said revenue is heavily-inflated by the fact that Anthropic books revenue without cutting out Google’s 20%+ revenue share. In any case, TokenMaxxing is not real demand, but an economic form of AI psychosis. There is no rational reason to tell somebody to deliberately burn more resources without a defined output or outcome other than increasing how much of the resource is being used. I have confirmed with a source at that there is no actual metric or tracking of any return on investment involved in token burn at Meta, meaning that TokenMaxxing’s only purpose is to burn more tokens to go higher on a leaderboard, and is already creating bad habits across a company that already has decaying products and leadership. To make matters worse, TokenMaxxing also teaches people to use Large Language Models poorly. While I think LLMs are massively-overrated and have their outcomes and potential massively overstated, anyone I know who actually uses them for coding generally has habits built around making sure token burn isn’t too ridiculous, and various ways to both do things faster without LLMs and ways to be intentional with the models you use for particular tasks. TokenMaxxing literally encourages you to do the opposite — to use whatever you want in whatever way you want to spend as much money as possible to do whatever you want because the only thing that matters is burning more tokens. Furthermore, TokenMaxxing is exactly the kind of revenue that disappears first. Zuckerberg has reorganized his AI team four or five times already, and massively shifted Meta’s focus multiple times in the last five years, proving that at the very least he’ll move on a whim depending on external forces. After laying off tens of thousands of people in the last few years , Meta has shown it’s fully capable of dumping entire business lines or groups with a moment’s notice, and while moving on from AI might be embarrassing , that would suggest that Mark Zuckerberg experiences shame or any kind of emotion other than anger. This is the kind of revenue that a business needs to treat with extreme caution, and if Meta is truly spending $300 million or more a month on tokens, Anthropic’s annualized revenues are aggressively and irresponsibly inflated to the point that they can’t be taken seriously, especially if said revenue travels through Google Cloud, which takes another 20% off the top at the very least.  Though the term is pretty new, the practice of encouraging your engineers to use AI as much as humanly possible is an industry-wide phenomena, especially across hyperscalers like Amazon, Microsoft and Google, all of whom until recently directly have pushed their workers to use models with few restraints. Shopify and other large companies are encouraging their workers to reflexively rely on AI, with performance reviews that include stats around your token burn and other nebulous “AI metrics” that don’t seem to connect to actual productivity. I’m also hearing — though I’ve yet to be able to confirm it — that Anthropic and other model providers are forcing enterprise clients to start using the API directly rather than paying for monthly subscriptions.  Combined with mandates to “use as much AI as possible,” this naturally increases the cost of having software engineers, which — and I say this not wanting anyone to lose their jobs — does the literal opposite of replacing workers with AI. Instead, organizations are arbitrarily raising the cost of doing business without any real reason.  Because we’re still in the AI hype cycle, this kind of wasteful spending is both tolerated and encouraged, and the second that financial conditions worsen or stock prices drop due to increasing operating expenses, these same companies will cut back on API spend, which will overwhelmingly crush Anthropic’s glowing revenues. I think it’s also worth asking at this point what is is we’re actually fucking doing.   We’re building — theoretically — hundreds of gigawatts of data centers, feeding hundreds of billions of dollars to NVIDIA to buy GPUs, all to build capacity for demand that doesn’t appear to exist, with only around $65 billion of revenue (not profit) for the entire generative AI industry in 2025 , with much of that flowing from two companies (Anthropic and OpenAI) making money by offering their models to unprofitable AI startups that cannot survive without endless venture capital, which is also the case for both AI labs. Said data centers make up 90% of NVIDIA’s revenue, which means that 8% or so of the S&P 500’s value comes from a company that makes money selling hardware to people that immediately lose money on installing it. That’s very weird! Even if you’re an AI booster, surely you want to know the truth , right?  The most-prominent companies in the AI industry — Anthropic and OpenAI — burn billions of dollars a year, have margins that get worse over time , and absolutely no path to profitability, yet the majority of the media act as if this is a problem that they will fix, even going as far as to make up rationalizations as to how they’ll fix it, focusing on big revenue numbers that wilt under scrutiny. That’s extremely weird, and only made weirder by members of the media who seem to think it’s their job to defend AI companies ’ bizarre and brittle businesses. It’s weird that the media’s default approach to AI has, for the most part, been to accept everything that the companies say, no matter how nonsensical it might be. I mean, come on! It’s fucking weird that OpenAI plans to burn $121 billion in the next two years on compute for training its models , and that the media’s response is to say that somehow it will break even in 2030, even though there’s no actual explanation anywhere as to how that might happen other than vague statements about “efficiency.” That’s weird! It’s really, really weird! It’s also weird that we’re still having a debate about “the power of AI” and “what agents might do in the future” based on fantastical thoughts about “agents on the internet ” that do not exist, cannot exist, and will never exist, and it’s fucking weird that executives and members of the media keep acting as if that’s the case. It’s also weird that people discussing agents don’t seem to want to discuss that OpenAI’s Operator Agent does not work , that AI browsers are fundamentally broken , or that agentic AI does not do anything that people discuss. In fact, that’s one of the weirdest parts of the whole AI bubble: the possibility of something existing is enough for the media to cover it as if it exists, and a product saying that it will do something is enough for the media to believe it does it. It’s weird that somebody saying they will spend money is enough to make the media believe that something is actually happening , even if the company in question — say, Anthropic — literally can’t afford to pay for it . It’s also weird how many outright lies are taking place, and how little the media seems to want to talk about them. Stargate was a lie! The whole time it was a lie! That time that Sam Altman and Masayoshi Son and Larry Ellison stood up at the white house and talked about a $500 billion infrastructure project was a lie! They never formed the entity ! That’s so weird! Hey, while I have you, isn’t it weird that OpenAI spent hundreds of millions of dollars to buy tech podcast TBPN “to help with comms and marketing”? It’s even weirder considering that TBPN was already a booster for OpenAI!  It’s also weird that a lot of AI data center projects don’t seem to actually exist, such as Nscale’s project to make “one of the most powerful AI computing centres ever” that is literally a pile of scaffolding , and that despite that announcement the company was able to raise $2 billion in funding . It’s also weird that we’re all having to pretend that any of this matters. The revenues are terrible, Large Language Models are yet to provide any meaningful productivity improvements, and the only reason that they’ve been able to get as far as they have is a compliant media and a venture capital environment borne of a lack of anything else to invest in .  Coding LLMs are popular only because of their massive subsidies and corporate encouragement, and in the end will be seen as a useful-yet-incremental and way too expensive way to make the easy things easier and the harder things harder, all while filling codebases full of masses of unintentional, bloated code. If everybody was forced to pay their actual costs for LLM coding, I do not believe for a second that we’d have anywhere near the amount of mewling, submissive and desperate press around these models.  The AI bubble has every big, flashing warning sign you could ask for. Every company loses money. Seemingly every AI data center is behind schedule, and the vast majority of them aren’t even under construction . OpenAI’s CFO does not believe that it’s ready to go public in 2026 , and Sam Altman’s reaction has been to have her report to somebody else other than him, the CEO. Both OpenAI and Anthropic’s margins are worse than they projected. Every AI startup has to raise hundreds of millions of dollars, and their products are so weak that they can only make millions of dollars of revenue after subsidizing the underlying cost of goods to the point of mass unprofitability .   And it’s really weird that the mainstream media has a diametric view — that all of this is totally permissible under the auspices of hypergrowth, that these companies will simply grow larger, that they will somehow become profitable in a way that nobody can actually describe, that demand for AI data centers will exist despite there being no signs of that happening. I get it. Living in my world is weird in and of itself. If you think like I do, you have to see every announcement by Anthropic or OpenAI as suspicious — which should be the default position of every journalist, but I digress — and any promise of spending billions of dollars as impossible without infinite resources. At the end of this era, I think we’re all going to have to have a conversation about the innate credulity of the business and tech media, and how often that was co-opted to help the rich get richer. Until then, can we at least admit how weird this all is? Telecommunications: AI agents will help carriers modernize network operations, simplify customer lifecycle management, and improve service delivery—bringing intelligent automation to one of the most operationally complex and regulated industries in the world. Meaningless. Automation of what?  Financial services: AI agents will help firms detect and assess risk faster, automate compliance reporting, and deliver more personalized customer interactions, such as tailoring financial advice based on a client's full account history and market conditions. Chatbot! “More-personalized interactions” are a chatbot with a connection to a knowledge system, as is any kind of “tailored financial advice.” Compliance reporting? Summarizing or pulling documents from places, much like any LLM can do, other than the fact that it’ll likely get shit wrong, which is bad for compliance. Manufacturing and engineering: Claude will help accelerate product design and simulation, reducing R&D timelines and enabling engineers to test more iterations before production. I assume this refers to people using Claude Code to do coding, which is what it does. Software development: Teams will use Claude Code to write, test, and debug code, helping developers move faster from design to production. Claude Code. Enterprise operations: Claude Cowork will help teams automate routine work like document summarization, status reporting, and review cycles. Literally a chatbot that deleted every single one of a guy’s photos when he asked it to organize his wife’s desktop . “Gather information” — search tool, part of chatbots for years. “Write reports” — generative AI’s most basic feature, with no details on quality. “Edit files” — to do what exactly? Chatbot feature. “Send and receive messages through email and text” — generating and reading text, connected to an email account.  “Delegate work” — what work? No need to get specific!  Are you fucking kidding me? If you simply remove billions of dollars in costs, OpenAI is profitable! Why do you think these companies are going to break even anytime soon? You have absolutely no basis for doing so other than leaks from the company!  Anthropic said on February 12, 2026 it had hit $14 billion in annualized revenue . This would work out to roughly $1.16 billion in a 30-day period, let’s assume from January 11 2026 to February 11 2026. Anthropic’s CFO said it had made “exceeding $5 billion” in lifetime revenue on March 9 2026. On March 3, 2026 Dario Amodei said it had hit $19 billion in annualized revenue.  This would work out to $1.58 billion in a 30-day period. Let’s assume this is for the period from February 2 2026 to March 2 2026. On April 6, 2026, Anthropic said it had hit $30 billion in annualized revenue . This works out to about $2.5 billion in a 30-day period. Let’s assume that said period is March 6 2026 to April 6 2026. Anthropic’s $14 billion in annualized revenue from February 16, 2026 includes both the launch of Claude Opus 4.6 , as well as the height of the OpenClaw hype cycle where people were burning hundreds of dollars of tokens a day .  This announcement also included the launch of Anthropic’s 1 million token context window in Beta for Opus 4.6 Anthropic’s $19 billion in annualized revenue from March 3, 2026 included both the launch of Claude Opus 4.6 and Claude Sonnet 4.6 . This period includes around half of the January 16 to February 16 2026 window from the previous $14 billion annualized number, and the launch of the beta of the 1 million token context window for Sonnet 4.6. To be clear, the betas required you to explicitly turn on the 1 million token context window, and had higher pricing around long context. Anthropic’s $30 billion in annualized revenue from April 6 2026 included two weeks’ worth of massive token burn from the launches of Sonnet and Opus 4.6. This includes a few days of the previous window (March 3 to April 5). This also included the general availability of the 1-million token context window , enabling it by default, billed at the standard pricing. Massive new customers that are making payments up front, which makes this far from “recurring” revenue. Massive new customers are spending tons of money immediately, burning hundreds of millions of dollars a month in tokens, and paying Anthropic handsomely for them.

0 views
Martin Fowler 1 weeks ago

Feedback Flywheel

Rahul Garg finishes his series on reducing the friction in AI-Assisted Development. He proposes a structured feedback practice that harvests learnings from AI sessions and feeds them back into the team's shared artifacts, turning individual experience into collective improvement.

0 views
Stratechery 1 weeks ago

Anthropic’s New Model, The Mythos Wolf, Glasswing and Alignment

Anthropic says its new model is too dangerous to release; there are reasons to be skeptical, but to the extent Anthropic is right, that raises even deeper concerns.

0 views
Brain Baking 1 weeks ago

A Commentary On GenAI Inspected Through Different Lenses

The amount of concerning reports related to generative AI is rising at an alrming rate, yet all we do is make ourselves more dependent on the brand new technology. Why? It’s not just that we’re lazy—we are!—there are many more variables involved. As part of my quest to try and understand what the heck is going on and what is becoming of one of my prime professional fields: software engineering, I read and read and read. And then I read and read and read. And then I became disappointed and depressed. I see colleagues jumping the gun, others being more prudent. I see industry discovering there’s yet another buck to be made. I see students forgoing learning at all. I wanted to try to form my own judgement of genAI in its modern form by looking at it from four different viewpoints: that of the software engineer, that of the teacher, that of the creativity researcher, and that of the concerned civilian living in this capitalist world.. References can be found at the end of this article. Does anyone remember Dan North’s Programming is not a craft post from 2011? I do, and I often think about it. With the advent of genAI, North’s port might be even more polarising: Well, congrats to you, you’ve won the lottery: here’s a tool that immediately can add customer value. If you don’t care about the inner code quality, you can have genAI generate (slop) code faster than you can think. If you love the impact of software itself, you’ll love Claude Code et al. Are you perhaps an enterprise software engineer? In that case you’ll be able to scaffold and generate CRUD crap even faster, hooray! But wait a minute. You obviously won’t take true ownership of this code: you’ll want to impress your clients with the results, but keep the lid closed at all times. The less ownership and feeling of responsibility, the easier it comes to completely let go of all the breaks and just accept any future changes without code reviewing at all. People who are now claiming they will keep themselves in the loop as an architectural reviewer don’t need to lie to themselves. After the nth time pressing the green button, and as the technology further evolves, you’ll wind up eventually accepting the slop anyway. Verification burnout will pop up next: because it’s not your own code you’re attempting to so carefully review, it actually takes more instead of less effort, increasing your stress level instead of reducing it! Does the code quality really matter if all clients see is the end product? As a gamer, I just want the game to run smoothly, I don’t care about the spaghetti. Or do I? I do, implicitly—the more spaghetti, the less smoothly it’ll run. The more holes, the more soft locks and crashes. So programming might or might not be a craft, but as Cal Newport and Robert M. Pirsig say: the concept of Quality is important! Maybe it’s time to become a goose farmer instead. The only thing left for you to do is to move to a depressing quality control position instead of crafting something yourself. No more “I built this”, but “I managed its orchestration”. Depending on how you view this, It’s either a promotion or demotion. I tend to agree with the latter. Why? Because we humans are the Homo Faber , the ones who like to control their fate and environment with the use of tools. Yes, genAI certainly is a tool, but it’s a tool that takes away all other tools. Instead of kneading dough by hand, feeling it, knowing when to ferment and when to bake, we’re forced to oversee the industrial Wonder Bread production process. Instead of manipulating leather to create a pair of shoes, we’re being employed by Nike to watch shoes being made by machines. This somehow reminds me of David Graeber’s bullshit jobs where useless paper pushing is prevalent but also called a “revolution” when it comes to a professional purpose. I beg to differ. Humans want to make things. They want to be proud of the things they made. The fact that the open source community rejects this slop code is a telling sign: if you’re programming in the open, your peers who also think highly of software development will keep you in check. But when it’s “for (enterprise) work”, we don’t care, generate away, I’m not the true owner anyway. If programming is a craft, then the recently leaked Claude Code CLI source code will be a big joke to you, where constructs are endlessly repeated, and spaghetti is topped up with more spaghetti. Code that is being generated doesn’t even seem to be made to be (re)read: how then, are we expecting to maintain it, or guarantee its security? By letting the agent maintain it and guarantee its security, I can hear you say? What is there left to say? I’ve already asserted that genAI tools are worse than Stack Overflow . Sure, mindless copy-pasting has long existed before this AI storm, but not on this scale. GenAI is able to provide a working solution to an assignment faster than I can come op with the assignment itself. Suddenly, all our traditional evaluation systems and grading workflows became useless: scoring high on a checklist is just a matter of pasting the requirements into Claude. We try to adapt by requiring oral defences, having students explain what they did and why, and asking them to walk us through a small imaginative change. The result is a spectacular fall in grades from previous years: they are just not able (1) to explain the code they did not make but generated and (2) to make small adjustments as they skipped the hard part: the learning and understanding. Yet in the hallways, I hear lots of students bragging to each other about how they let ChatGPT do their homework. Congrats. We’ll see each other again in September for your second try. We often forget something else very important: peer pressure . About a year ago, on the train I overheard a few girls on their way to a university lecture chatting about their homework. One of them complained: “I put in all that hard work, but all the others are just using ChatGPT to do it. Next time, I’m not doing all that, I’m also just using AI, that’s not fair!”. I should have gotten up to congratulate her: the only one actively learning is the one putting in the hard work! There is no shortcut to becoming proficient. There is only hard work. Sure, the more you prompt your way through your curriculum, the more proficient you’ll become with the tool, but ask yourself: did you learn what you wanted to learn or did you learn to prompt? When I was an undergraduate, I used to fill A4 pages with summaries of courses to help me study. Just before the exams, I could quickly glance over these pages to remembers the core concepts. Some students sold their summaries to others. Now, genAI can generate summaries for you. But smart students will know this will only fool yourself: the purpose of the summaries is to make them : to study and gradually fill the pages. Not to acquire a summary. The journey is the destination. When my summaries were done, I could just as well throw them away: they were just a tool to help with the hard work. Yet it’s next to impossible to explain this to a student who only sees how easy it is to jump to an outcome by leveraging AI. Maybe legislation will help here? (Not really; see below) In case all this is not clear: students are becoming dumber yet the programming projects they hand in are becoming better than ever. As the inventor of the framework presented in The Creative Programmer , I thought it would be interesting to take a look at the seven domains and how genAI fits in these. In The Creative Programmer , I present seven distinct but heavily intertwined themes that define the way we are creative when we solve a programming problem: I might be overly focusing on the negative here and have to recognise the possible advantages of having genAI as a tool available in our creative toolbox—but only when we learn to yield it properly and with moderation, which is not exactly what we are doing lately, is it. In an interesting systematic literature review (2025) with lots of references to other academic material if that’s what you’re looking for, Holzner et al. conclude with: […] human-GenAI collaboration shows small but consistent gains in creative output across tasks and contexts. However, collaboration with GenAI reduces the diversity of ideas, indicating a risk of creative outputs that could become more homogeneous. More same-ness; exactly what we need when it comes to creativity, right? The more we use genAI, the more creatively we will be able to prompt, but the less creative we will be in actually applying a solution to the problem. We no longer create: we generate. We know that genAI will do everything in its power to keep you locked within that chat box. Its tendency to talk to your mouth, agree with your statements, and serve you whatever you want to hear creates biases and dependencies. It’s not unlike a drug that slowly but surely diminished your critical thinking, and thus, creativity. This is where the true nature of humans are unfolded: when it comes to earning something for themselves, ethics suddenly becomes a very malleable subject. On the morality, ethics, and privacy, everyone agrees that genAI is what Ron Gilbert calls a train wreck . This bears no further explanation from me: Microsoft slurped all GitHub repositories dry without taking any licenses into account, the book that I painstakingly produced in almost two years was ingested OpenAI’s systems in about two seconds, … Yet at the same time, everyone also consistently ignores all these topics in favour of their own self-interest. Why, I wonder? Everyone knows they should eat less meat. Yet almost nobody does. Everyone knows Microsoft (and probably other big tech companies) power genocide yet the adoption rate of Windows as an OS is still 95%. Why? Everyone knows the climate is going to shits yet we happily turn the other way and take the plane on a weekend trip to sip some wine and do some shopping in Italy. As Gretea Thunberg said: knowing is not enough . For GenAI, similar patterns emerge. We know it’s bad for us, yet we happily close our eyes and use it anyway. Why, I wonder? The power of a drug, the pull, the ease at which something can be done without breaking too much sweat? Here’s a possible answer I suggested before: because humans are inherently lazy. As long as Belgian supermarkets keep on stocking apples from New Zealand and Belgium, most people won’t care and just pick up whatever. As long as we keep handing out company cars and making infrastructure geared towards car drivers, most people will be driving to work instead of biking. A possible answer to the problem then might be governmental legislation to protect people living in a society from making the wrong choices. And I’m 100% sure that will work! Yet legislation is always (1) either happening way too late; or (2) minimised or manipulated by the people who wield the power because they have bought out key politicians to prevent laws like this from happening. Hence my depression. In the case of GenAI, a technology that evolves at lightning speed and is taking the world by storm, legislation will be way too late. To prove my point, in an attempt to modernise, many Belgian governmental instances already “embraced” the technology and made many blunders in doing so. The EU is currently evaluating the options. Meanwhile, the San Francisco bros are laughing. Prompt engineering is the most degenerative thing that ever happened to engineering . It’s a capitalist’s way to minimise the cost of the human. Yet I don’t see genAI disappearing any time soon. Companies and decision makers smelled the green and won’t let go. I don’t understand how capitalism works, but I know it’s been growing in power ever since we centralised cane sugar plantations with the help of slavery. GenAI is evidently yet another product of capitalism. The companies I’ve worked for wanted more and more profit each year: even though they were sometimes satisfied with last year’s profit, the target for the next year was always increased no matter what. GenAI is already responsible for thousands of layoffs in an attempt to even more aggressively push profit up. To what end, I wonder? Why? To our own detriment. It seems that our cognition is for sale, and the sale has already been made. You know what they say: no returns are accepted. Peer pressure to use genAI on the job is already prevalent as it “gets things done faster”, so quite logically also brings in money faster. Let’s worry about durability and maintenance later, shall we. Also, I’ve seen colleagues fall into the trap of obsessive agent babysitting. Whether at work, on the lunch break, or in the very late evenings: you’ve got to keep those agents spinning! Squeeze the maximum out of your tokens because they squeeze the maximum out of you. There goes our work-life balance, coming from the tools that are supposed to take over our work so we can focus more on the life part. So as long as I remain in a position to be able to choose whether I can put in the work myself for my (hobby) programming projects, I will. As long as I am in a position to bike instead of drive, to be a vegetarian instead of meat-eater, or in short, to be a concerned civilian, I will. And so should you. Even though that won’t stop this devolution from happening at all. Sure I will occasionally consult Gemini et al. to ask it a specific question regarding a broken config file that has me scratching my head. But I treat these queries as specialised internet searchers, not as a way to evade the hard work completely. I’ve become Albert Camus’s pessimist. I’m genuinely afraid of how our kids will turn out if we don’t act quickly to save our youth. Yet I won’t stop being an activist. Reading List I’d rather link to personal blog posts instead of academic publications here as we’re dealing with something that impacts us on a personal level and by the time the relevant 2026 studies are published, the landscape will have changed yet again. The following folks expressed their experience and opinion on genAI: Related topics: / genai / By Wouter Groeneveld on 8 April 2026.  Reply via email . Technical Knowledge—if we don’t have any knowledge, we won’t have the creative ability to combine them. Guess what; GenAI is actively deskilling us. The more you generate, the less you actively learn, harming your creative ability to solve problems. Creativity requires a rich mental toolbox to draw from. By prompting, you’re not exactly filling that toolbox. Communication–I see both a good and a bad thing here: if your colleagues aren’t immediately available, rubber ducking with an AI agent might help identifying that problem. On the other hand, it’s also awfully easy to stay locked inside that comfortable genAI chatbox. Why ask anyone when it talks to your mouth? Constraints—If you manage to constraint yourself (ha!) to only ask AI for 10 possible ways to approach a problem you don’t know how to approach without having it solve the problem for you , this might help you learn how to approach certain heavily constrained environments. Unfortunately, it’s very easy to just have it generate the solution as well, rendering a possible learning path useless. Critical Thinking—The more we use genAI, the less critical we are and the more likely we are accepting whatever comes out of it. Validating the the source material outside of that chatbox suddenly requires a lot of willpower. I’ve even heard people changing their entire preferred technology stack to something more popular because genAI is better at it. That’s very sad. Curiosity—Judge for yourself. What does reliance on genAI tell you about your curiosity to discover other things? Creative state of Mind—without Cal Newport’s “Deep Work”, there won’t be an “aha!” moment. The 90% transpiration, 10% inspiration is suddenly turned on its head: Claude is the one sweating for us, even at night, while all we do is press the green button and write “LGTM!”. Maybe we should take the time to read Newport’s new book Slow Productivity . Creative Techniques—GenAI itself as a technique might belong in this section; but the question is; are we the one yielding the tool or is the tool yielding us? Nolan Lawsom; How I use AI agents to write code . A clear conflicted state: it’s okay to generate away at work, but “I also don’t use AI for my open-source work, because it just feels… ick. The code is ‘mine’ in some sense, but ultimately, I don’t feel true ownership over it, because I didn’t write it”. John Allsopp: The Structure of Engineering Revolutions Dave Gauer; A programmer’s loss of social identity Cory Zue; Software got weird Doug Belshaw; Claude’s Constitution and the trap of corporate AI ethics Tom Hall; Towards a Slow Code Manifesto Rishi Baldawa; The Reviewer Isn’t the Bottleneck Information/superhighway.net; On The Need For Understanding Antoine Leblanc; Chatbot psychosis (Mastodon) “this is the main reason why i believe that chatbot addiction / chatbot psychosis is a LOT more widespread than we realise: people with a clear understanding of the ethical issues try claude once, it does a thing correctly enough, they get one-shot, and they start posting like if sephiroth was on linked-in, ethical concerns be damned. it keeps happening.” Exactly. Sean Boots; Generative AI vegetarianism Simon Willison; Perhaps not Boring Technology after all Sophie from Localghost; Stop Generating Start Thinking Micaheal Harley; AI Stance Lauren Woolsey; AI Sucks And You Shouldn’t Use It Ron Gilbert; My Dinner With AI Matthew Lamont; Generative AI is an Evil Technology Arne Brasseur; The AI Divide (Mastodon) Zach Manson; CoPilot Edited an Ad Into My PR Michael Taggart; I Used AI. It Worked. I Hated It. Bob Nystrom; The Value of Things . GenAI can have utility but not meaning. Jonny; Dismantling Claude Code source (Mastodon) . Another train wreck, as expected. Cal Newport; In Defense of Thining Hamilton Greene; Why I’m moving from F# to C# Senator Bernie Sanders vs. Claude (YouTube) Joel Chrono; Not having to work would be nice (but not like this)

0 views
Armin Ronacher 1 weeks ago

Mario and Earendil

Today I’m very happy to share that Mario Zechner is joining Earendil . First things first: I think you should read Mario’s post . This is his news more than it is ours, and he tells his side of it better than I could. What I want to do here is add a more personal note about why this matters so much to me, how the last months led us here, and why I am so excited to have him on board. Last year changed the way many of us thought about software. It certainly changed the way I did. I spent much of 2025 building, probing, and questioning how to build software, and in many more ways what I want to do. If you are a regular reader of this blog you were along for the ride. I wrote a lot, experimented a lot, and tried to get a better sense for what these systems can actually do and what kinds of companies make sense to build around them. There was, and continues to be, a lot of excitement in the air, but also a lot of noise. It has become clear to me that it’s not a question of whether AI systems can be useful but what kind of software and human-machine interactions we want to bring into the world with them. That is one of the reasons I have been so drawn to Mario’s work and approaches. Pi is, in my opinion, one of the most thoughtful coding agents and agent infrastructure libraries in this space. Not because it is trying to be the loudest or the fastest, but because it is clearly built by someone who cares deeply about software quality, taste, extensibility, and design. In a moment where much of the industry is racing to ship ever more quickly, often at the cost of coherence and craft, Mario kept insisting on making something solid. That matters to me a great deal. I have known Mario for a long time, and one of the things I admire most about him is that he does not confuse velocity with progress. He has a strong sense for what good tools should feel like. He cares about details. He cares about whether something is well made. And he cares about building in a way that can last. Mario has been running Pi in a rather unusual way. He exerts back-pressure on the issue tracker and the pull requests through OSS vacations and other means. The last year has also made something else clearer to me: these systems are not only exciting, they are also capable of producing a great deal of damage. Sometimes that damage is obvious; sometimes it looks like low-grade degradation everywhere at once. More slop, more noise, more disingenuous emails in my inbox. There is a version of this future that makes people more distracted, more alienated, and less careful with one another. That is not a future I want to help build. At Earendil, Colin and I have been trying to think very carefully about what a different path might look like. That is a big part of what led us to Lefos . Lefos is our attempt to build a machine entity that is more thoughtful and more deliberate by design. Not an agent whose main purpose is to make everything a little more efficient so that we can produce even more forgettable output, but one that can help people communicate with more care, more clarity, and joy. Good software should not aim to optimize every minute of your life, but should create room for better and more joyful experiences, better relationships, and better ways of relating to one another. Especially in communication and software engineering, I think we should be aiming for more thought rather than more throughput. We should want tools that help people be more considerate, more present, and more human. If all we do is use these systems to accelerate the production of slop, we will have missed the opportunity entirely. This is also why Mario joining Earendil feels so meaningful to me. Pi and Lefos come from different starting points. There was a year of distance collaboration, but they are animated by a similar instinct: that quality matters, that design matters, and that trust is earned through care rather than captured through hype. I am very happy that Pi is coming along for the ride. Me and Colin care a lot about it, and we want to be good stewards of it. It has already played an important role in our own work over the last months, and I continue to believe it is one of the best foundations for building capable agents. We will have more to say soon about how we think about Pi’s future and its relationship to Lefos, but the short version is simple: we want Pi to continue to exist as a high-quality, open, extensible piece of software, and we want to invest in making that future real. As for our thoughts of Pi’s license, read more here and our company post here .

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩

0 views