Posts in Ai (20 found)

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views

Has the cost of building software just dropped 90%?

Agentic coding tools are dramatically reducing software development costs. Here's why 2026 is going to catch a lot of people off guard.

0 views
Higashi 3 days ago

AI should only run as fast as we can catch up

Recently I have spoke with two of my friends who all had fun playing with AI. Last month, I met with Eric, a fearless PM at a medium size startup who recently got into vibe coding with Gemini. After getting familiarized with Gemini, Eric was genuinely amazed by how AI quickly turns prompt into playable web applications. It served great purpose as a first prototype to communicate ideas to designers and engineers. But Eric really wanted to skip those steps and directly ship it to prod. But he couldn’t really understand that Gemini actually built a single-page HTML file that merely looks like a working app. Sadly, one cannot build a reliable enterprise product out of this. And there is really no effective way for Eric to catch up on these technical details and outpace the engineering team himself. Last week, I had coffee with Daniel, a senior staff engineer who recently grew fond of AI coding and found it to be the true force multiplier. Daniel was skeptical of AI at first, but lately he hasn’t wrote a single line of code for months already. What he does is just precisely prompt the AI to create new components in an existing framework (involving Kafka, postgres, AuthN/Z, and k8s infra stuff) and adhering to certain preexisting paradigms. He would just spot-check the correctness of AI’s work and quickly spin up local deployments to verify it’s indeed working. Later, he pushes the changes through code review process and lands those features. All without writing a single line of code and it’s production ready just as if he wrote them himself. To Daniel, building and shipping things fast and scalable is simpler than ever. After speaking with Eric and Daniel, I suddenly feel that there is an overarching theme around the use of AI that we can probably interpolate out of the stories here. And after pondering for a weekend, I think I can attempt to describe it now: it’s the problem of reliable engineering - how can we make AI work reliably . With the AI superpower, one can task it to do all crazy things on the internet with just typing a few lines of prompt. AI always thinks and learns faster than us, this is undeniable now. However, to make AI work actually useful (not only works, but reliable and trustworthy), we also need to catch up with what the AI does as quickly as possible. It’s almost like - we need to send the AI off to learn and think as fast as possible, but we also need to catch up as soon as possible to make it all relevant. And the speed we catch up things is critical to whether AI can help us effectively do these tasks. For the case of Daniel, he can spot-check and basically just skim through AI’s work and know for sure it’s doing the right thing with a few simple tests steps to verify, hence his results are more reliable. Whereas for Eric, he needs to basically learn software development from the bottom up to comprehend what the AI has done, and that really doesn’t give him the edge to outpace engineering teams to ship features reliably by himself. To generalize the problem again, I think for all the tasks we do, we can break them down into two parts: learning/creation and verification. Basically doing the task and checking if the task is done right. Interestingly, this gives us a good perspective to our relationship with AI on performing such tasks. Effort wise, if verification « learning/creation , one can very effectively check AI’s work and be confident about its reliability. If verification ~= learning/creation , one spends equal amount of time checking AI’s work. It’s not a big win, maybe AI becomes a good automation script to cut down some boilerplate. If verification » learning/creation , one cannot be sure about AI’s work that easily, and we are in the vibe-land. A very good example of the first category is image (and video) generation. Drawing/rendering a realistic looking image is a crazily hard task. Have you tried to make a slide look nicer? It will take me literally hours to center the text boxes to make it look “good”. However, you really just need to take a look at the output of Nano Banana and you can tell if it’s a good render or a bad one based on how you feel. The verification is literally instantaneous and effortless because it’s all encoded as feeling or vibes in your brain. “Does this look right?” probably can be answered in the span of milliseconds by your vision cortex. There is also no special knowledge required - human beings have been evaluating visual images since birth , hardwired into our instincts. The significant cost asymmetry can greatly explain why AI image generation exploded. If we can look for similar scenarios, we can probably identify other “killer” use cases of AI as well. However, if we go down into the bottom of the spectrum where verification becomes more intense - requiring domain knowledge, technical expertise, industry know-hows to tell if the AI is producing slop or not, we will enter this dark age of piling verification debt. More things are being created, but we are lagging behind to check if any of it actually works to our satisfaction. If an organization keeps vibe-coding without catching up with verification, those tasks can quickly end up as “debts” that needs to be verified. When verification becomes the bottleneck, dangerous things can happen if we still want to move fast - we will risk ourselves running unverified code and having unexpected side effects that are yet to be validated. It can also apply to other fields - imagine asking AI to craft a new vaccine and you don’t want to wait for FDA to use it. I’ve come across a few blog posts that talks about Verification Debt already. I think it’s genuinely a good problem for technical leaders to have in their mind in this era. AI can only reliably run as fast as we check their work. It’s almost like a complexity theory claim. But I believe it needs to be the case to ensure we can harvest the exponential warp speed of AI but also remain robust and competent, as these technologies ultimate serve human beings, and us human beings need technology to be reliable and accountable, as we humans are already flaky enough ;) This brings out the topic of Verification Engineering. I believe this can be a big thing after Context Engineering (which is the big thing after Prompt Engineering). By cleverly rearranging tasks and using nice abstractions and frameworks, we can make verification of AI performed tasks easier and use AI to ship more solid products the world. No more slop. I can think of a few ideas to kickoff verification engineering: I believe whoever figures out ways to effectively verify more complex tasks using human brains, can gain the most benefit out of the AI boom. Maybe we need to discard traditional programming languages and start programming in abstract graph-like dataflow representations where one can easily tell if a thing is done right or wrong despite its language or implementation details. Maybe our future is like the one depicted in Severance - we look at computer screens with wiggly numbers and whatever “feels right” is the right thing to do. We can harvest these effortless low latency “feelings” that nature gives us to make AI do more powerful work. How to craft more technicall precise prompts to guide AI to surgically do things, rather than vibing it. How to train more capable technical stakeholders who can effectively verify and approve what AI has done. How to find more tasks that are relatively easy to verify but rather hard to create. How to push our theoretical boundaries of what things we can succinctly verify (complexity theory strikes again).

0 views
Ivan Sagalaev 4 days ago

Have you accepted AI yet?

Is this platform still massively against AI or has it moved more towards acceptance? — Armin Ronacher If you leave a bad argument from a prominent person unchallenged, it starts looking like an accepted common wisdom. Armin Ronacher has enough clout, and his stated position sounds wrong enough to me to not let let it just slide. Here's my argument, for those who still values what I have to say. I'm not directing this directly at Armin, as the tone of his first message should tell you everything about his readiness to have his position challenged. Despite the "have the peons already accepted their fate" tone, some people did try to have an earnest discussion. You can read the whole thread , but here's a few of Armin's replies that caught my eye: Armin, any chance I can convince you to use the term "LLMs" instead of "AI" when you want to talk about LLMs? Or maybe "generative AI" if you think LLM is not flashy enough? AI is an umbrella term that covers a lot of things, some good, some bad. @[email protected] I don’t think it really matters. LLM are a subset of AI. I’m not convinced why being more precise here would matter? Several people did explain why language matters, yet Armin insisted: LLMs can't be generalized as AI @[email protected] LLMs are part of AI, even if you disagree with them for some reason. I'm sure that, as a programmer, Armin understands that "LLMs are part of AI" does not imply they can be generalized as AI. He just wants to paint the entire argument as coming from "disagreement". @[email protected] There are specific concerns and there are abstract fears . It's impossible to work with the latter, it's possible to do a lot with the former. As an example I have a lot of concern about how society is going to deal with AI and that's also something that I'm trying to understand and work in the right way with. But that is a lot more nuance and complex than a policing the use of the word AI which does very little to navigate those complexities. (Emphasis mine.) This was the theme throughout the thread: people were arguing that the current LLM situation is its own separate topic, while Armin kept dismissing it as "word policing", "abstract fears" and "watering down the discourse". He never substantiated how he's "trying to understand and work in the right way with" societal problems. The people might have a point here. See, nobody was "massively against" AI when it was called ML and used for image recognition and translating text. Like any technology, it was also abused and heavily criticized by ethicists. But it wasn't until OpenAI launched its polished product that kick-started the next renaissance insane bubble we're in now, when it started affecting much more people. The adverse effects are widespread and profound: informational pollution , further surrender of privacy , "vibed" code nobody knows how to fix, untold wealth produced for surveillance kings , genocidal sociopaths and fascism enablers , environmental harm and, of course, a giant financial bubble, to name a random few. None of these are abstract, they are measurably and sometimes painfully affecting people right now. So it's hardly a surprise they want to talk about that instead of discussing quantization and context length. You can ignore politics only for so long… What gets me personally very easily irritated is the tone of inevitability. Being concerned is bad for your health, so you better accept the inevitable reality and only think within a comfortable, fenced area. Fuck that . But to be fair, Armin does say he has "a lot of concern about how society is going to deal with AI", so may be he's just looking for a place where he could talk specifically about technology? That should be allowed, right? Well, Simon Willison , for one, does just that, without any backlash, and I'm sure there are others.

4 views
Allen Pike 4 days ago

Why is ChatGPT for Mac So… Bad?

Last week I wrote an exploration of Ben Thompson’s recent question, “Why is the ChatGPT Mac app so good?” A lot of people on the internet, it turns out, do not agree with this premise! Many folks have been having problems with ⌘C not copying text. Hacker News sees the app as “not good at all”, to the point that my post about it being better than the alternatives was flagged off the site. X doesn’t like it either . Beyond the bugs I mentioned in last week’s post, I’ve recently been plagued with a ChatGPT Mac bug of my own, where every time I start a new chat, it will pre-fill the text field with the first input I used last time I started a new chat on Mac. All of this led me to an informative post by one of OpenAI’s Mac developers, Stephan Casas: nearly everyone who works on the ChatGPT macOS app has been stretched thin, and hard at work building Atlas. i’m thankful that our users appreciate our decision to develop a native app just as much as i’m thankful for the heightened expectations they hold because we did so Apparently he merged a fix this week for the copy-paste bug that has been plaguing many folks, which is promising. Something implied in last week’s article that’s worth saying explicitly: although many good Mac apps are native, being native is neither necessary nor sufficient for being a great app . While OpenAI is investing more in desktop apps than any other model labs, they have much to do before they can transcend “better than the alternatives” and achieve “great.”

0 views
Stratechery 5 days ago

2025.49: Conflicts, Consternation, and Code Red

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Stratechery video is on Robotaxis and Suburbia . What the Times Missed in Its David Sacks Story. On Sharp Text this week, I wrote about the commotion that ensued in tech and media after the New York Times profiled Trump Crypto and AI Czar, David Sacks, including an OpenAI-style outpouring of Sacks support, why the piece failed on its own terms, and an entirely different story that went unexplored. While the Times  focused on the private interests that may benefit under Sacks’ watch, there are better questions about the public’s interest in leaning on someone like Sacks , and why the government might need Silicon Valley expertise as it confronts a variety of tech questions that have enormous implications for the future of the Western world.  — Andrew Sharp Atlassian’s History and the Near Future.  My favorite part of every Stratechery Interview is Ben’s “how did you get here?” question to first-time interview guests, and  this week’s interview with Atlassian CEO Mike Cannon-Brookes  is a terrific entry in the series. Come for the story of how a Qantas Frequent Flyer program eventually led to a $40 billion software business in Sydney, and stay for Cannon-Brookes on how his company is adapting to the AI era, as well as his take on “correct, but chronologically challenged” snake oil salesmen. Finally, as a rabid F1 fan, I’d be remiss if I didn’t recommend the end, where Cannon-Brookes expounds on Atlassian’s role sponsoring and helping to transform the once moribund Williams team (a story that can also be marketed to enterprises the world over). — AS Code Red at OpenAI. I have, for three years now — i.e. ever since ChatGPT took the world by storm in November 2022 — been convinced that we were witnessing the birth of the next great consumer tech company. Today, however, there are very legitimate reasons to be concerned that OpenAI is going to eventually succumb to the Google behemoth, just as Yahoo, Microsoft, Blackberry, and countless others have; I still want to believe that OpenAI can be an Aggregator, but they don’t have the business model to match, and that may be fatal. I summarized all of these feelings in this week’s episode of Sharp Tech , which covered both this week’s Article about OpenAI and Nvidia angst , and Tuesday’s Update about the bear case for OpenAI . —  Ben Thompson Google, Nvidia, and OpenAI — OpenAI and Nvidia are both under threat from Google; I like OpenAI’s chances best, but they need an advertising model to beat Google as an Aggregator. OpenAI Code Red, AWS and Google Cloud Networking — OpenAI is declaring code red and doubling down on ChatGPT, highlighting the company’s bear case. Then, AWS makes it easier to run AI workloads on other clouds. AWS re:Invent, Agents for AWS, Nova Forge — AWS re:Invent sought to present AI solutions in the spirit of AWS’ original impact on startups; the real targets may be the startups from that era, not the current one. An Interview with Atlassian CEO Mike Cannon-Brookes About Atlassian and AI — An interview with Atlassian founder and CEO Mike Cannon-Brookes about building Atlassian and why he is optimistic about AI. The Forest the New York Times Missed Among the David Sacks Trees — The New York Times failed to support its David Sacks headline, and ignored better questions about the how U.S. devises modern tech policy. Google Looms Alan Dye Leaves Apple Let’s Break Down the 45nm Process Node A Quiet Chinese Mobile Giant in Africa Trump, Takaichi and a Game of Telephone; Japan Jawboning Continues; An Internet Governance Study Session; China Making Trade ‘Impossible’ Wolves and Cavs Concerns, The NBA Cup in Year 3, Questions on the Magic, Suns, Thunder and Raptors The Game of the Week, A Giannis Inc. Emergency Board Meeting, Chris Paul Gets Cut at 2 a.m. in Atlanta OpenAI Declares a ‘Code Red,’ Alan Dye Leaves Apple for Meta, Questions on Tranium 3, Substack, and F1

0 views
annie's blog 5 days ago

Fish bowl

Our very brains, our human nature, our desire for comfort, our habits, our social structures, all of it, pushes us into being fish bowl swimmers. Tiny people moving in tiny circles. Staying in the circumscribed ruts of our comfort. Ignoring a whole big world of what's different and new and interesting just beyond. That's the problem: stuff out there might be new, and interesting, but it's also different. The newness — which is really not new, at all, it's just new to us, so — the differentness, of another mindset or culture, language or belief system, method or opinion or morality or lifestyle, sends our inward threat-o-meter into overdrive. We interpret new and different as scary and difficult , because in terms of our emotions and our mental somersaulting, it is. We don't know how to act. We don't know how to evaluate. We don't know what is safe. We don't know where we fit in. We don't know how our safe, comfortable fish bowl living is affected by this new, different, expanded puddle. Sameness makes us comfortable. And comfort is the height, the very pinnacle, the crowning achievement in our pursuit of happiness. What I mean is that we've mistaken comfort for happiness. All the ways we could pursue happiness, all the freedom and technology and abilities we have to pursue meaning and joy and interaction and challenge and exploration and improvement and aliveness … All of that, at our fingertips, and being comfortable tends to top the list of what we actually want, what we're willing to put effort towards. This seems pathetic. It is pathetic. But also: We're working hard all the time in ways we often don't acknowledge. We have infinite options but finite agency. We have endless information access and very little processing power. We get fucking worn out. It's a lot of work to make a string of decent choices for 10 or 12 hours at a time. It's a lot of effort, some days (most days), to do what is required of us to feel like decent human beings, and the idea of putting in more effort, expending more energy, is exhausting. So we value comfort highly. We're tired. We're exhausted by constant inputs, invisible demands, and the burden of infinite options. Of course we don't leap out of our comfort zones when the opportunity arises: we've already been out of it for so long, on high alert. Our brains are efficiency machines. By valuing comfort so highly, and by equating comfort with sameness, we have programmed our brains to ignore the unfamiliar. Ever wondered why you can feel bored when you have constant stimulation? This is why. We carefully allocate our energy to the highest priorities. Things that aren't familiar don't help. So we ignore them. Of course, we can't always ignore stuff that is different. Sometimes it is right there, glaringly obvious, annoyingly immune to our discomfort, and we are forced to see it, acknowledge it, encounter it, at least mentally. But don't worry! We have defenses! Oh baby, do we have defenses. If we can't keep these alien objects from encroaching upon our consciousness, we can, at least, quickly evaluate the threat they pose and deal with them appropriately. Threat is precisely how we see things that are different. Comfort is bolstered, even built, by the familiar. All things unfamiliar are threats to our comfort. So we're quick to see other groups, philosophies, lifestyles, belief systems, family structures, choices, etc., as weird and wrong. We want to believe they are wrong, because we want to believe that pursuing our own comfort is right. We want to believe we have our priorities in check. Our very desire for comfort creeps into our logical reasoning, so deeply does the desire go. So insidiously does it carry out its programmed mission: to keep us from being uncomfortable, our brains will subvert objectivity and keep us from seeing the fallacies in our own thinking, keep us from recognizing that we are, at heart, selfish and misguided creatures whose greatest delight is sitting around and feeling pretty good about ourselves. If needed, then, we will happily sacrifice the validity and value of every thing, person, or choice that is different from what we know and define as normal. We will, for the sake of our own rightness, define all different things as wrong. We don't even hesitate. Hesitation is a sign that you might be starting to see the truth of your own motivation. If you start hesitating before defining, before casting judgment, before categorizing and labeling, look out: your comfort is at stake. Your brain is scurrying, be sure of it, to come up with great reasons for you to resist this awful urge to be fair. Fair. Fair? Fair! Fair has no place in the pursuit of comfort. Equality is not a factor here. If we value all people equally, we must admit that our own comfort is not the highest priority. We must admit that others, too, have valid needs, valid ideas, that the fact of their differentness is not adequate reason for us to deny them the same respect and autonomy we demand for ourselves. We can't have that. That sort of thinking gets us in trouble. That sort of thinking demolishes the layer upon layer of defensive triggers and traps that we have laid, so carefully, over the entire course of our lives. We are aware, so very aware, of how it could all fall apart. We know the reasons are thin. We know, deep down, the very idea of a fish bowl is absurd. We live in an ocean, and it's big, and it's full of creatures, and we're terrified. We want to believe we can limit what is around us. We want a fish bowl so we can feel like the biggest fish in it. It is the only way we know to feel safe. But there is another way: to see, first, that the fish bowl is an illusion of our own making, with imaginary walls upheld by discriminatory defense systems. If we can begin to see that the walls are not even real, we can see a way out. Maybe we can stop putting so much work into keeping them in place. It's scary. It is being alive. The threat only exists when we think we have something of our own, something utterly more important than all else, to protect and defend. But we don't. We are swimming in this together, all of us. There is no safer ocean, only this one.

0 views

Premium: The Ways The AI Bubble Might Burst

[Editor's Note: this piece previously said "Blackstone" instead of "Blackrock," which has now been fixed.] I've been struggling to think about what to write this week, if only because I've written so much recently and because, if I'm honest, things aren't really making a lot of sense. NVIDIA claims to have shipped six million Blackwell GPUs in the last four quarters — as I went into in my last premium piece — working out to somewhere between 10GW and 12GW of power (based on the power draw of B100 and B200 GPUs and GB200 and GB300 racks), which...does not make sense based on the amount of actual data center capacity brought online. Similarly, Anthropic claims to be approaching $10 billion in annualized revenue — so around $833 million in a month — which would make it competitive with OpenAI's projected $13 billion in revenue, though I should add that based on my reporting extrapolating OpenAI's revenues from Microsoft's revenue share , I estimate the company will miss that projection by several billion dollars, especially now that Google's Gemini 3 launch has put OpenAI on a " Code Red, " shortly after an internal memo revealed that Gemini 3 could “create some temporary economic headwinds for [OpenAI]." Which leads me to another question: why? Gemini 3 is "better," in the same way that every single new AI model is some indeterminate level of "better." Nano Banana Pro is, to Simon Willison, " the best available image generation model. " But I can't find a clear, definitive answer as to why A) this is "so much better," B) why everybody is freaking out about Gemini 3, and C) why this would have created "headwinds" for OpenAI, headwinds so severe that it has had to rush out a model called Garlic "as soon as possible" according to The Information : Right, sure, cool, another model. Again, why is Gemini 3 so much better and making OpenAI worried about "economic headwinds"? Could this simply be a convenient excuse to cover over, as Alex Heath reported a few weeks ago , ChatGPT's slowing download and usage growth ? Experts I've talked to arrived at two conclusions: I don't know about garlic or shallotpeat or whatever , but one has to wonder at some point what it is that OpenAI is doing all day : So, OpenAI's big plan is to improve ChatGPT , make the image generation better , make people like the models better , improve rankings , make it faster, and make it answer more stuff. I think it's fair to ask: what the fuck has OpenAI been doing this whole time if it isn't "make the model better" and "make people like ChatGPT more"? I guess the company shoved Sora 2 out the door — which is already off the top 30 free Android apps in the US and at 17 on the US free iPhone apps rankings as of writing this sentence after everybody freaked out about it hitting number one . All that attention, and for what? Indeed, signs seem to be pointing towards reduced demand for these services. As The Information reported a few days ago ... Microsoft, of course, disputed this, and said... Well, I don't think Microsoft has any problems selling compute to OpenAI — which paid it $8.67 billion just for inference between January and September — as I doubt there is any "sales team" having to sell compute to OpenAI. But I also want to be clear that Microsoft added a word: "aggregate." The Information never used that word, and indeed nobody seems to have bothered to ask what "aggregate" means. I do, however, know that Microsoft has had trouble selling stuff. As I reported a few months ago, in August 2025 Redmond only had 8 million active paying licenses for Microsoft 365 Copilot out of the more-than-440 million people paying for Microsoft 365 . In fact, here's a rundown of how well AI is going for Microsoft: Yet things are getting weird. Remember that OpenAI-NVIDIA deal? The supposedly "sealed" one where NVIDIA would invest $100 billion in OpenAI , with each tranche of $10 billion gated behind a gigawatt of compute? The one that never really seemed to have any fundament to it, but people reported as closed anyway? Well, per NVIDIA's most-recent 10-Q (emphasis mine): A letter of intent "with an opportunity" means jack diddly squat. My evidence? NVIDIA's follow-up mention of its investment in Anthropic: This deal, as ever, was reported as effectively done , with NVIDIA investing $10 billion and Microsoft $5 billion, saying the word "will" as if the money had been wired, despite the "closing conditions" and the words "up to" suggesting NVIDIA hasn't really agreed how much it will really invest. A few weeks later, the Financial Times would report that Anthropic is trying to go public   as early as 2026 and that Microsoft and NVIDIA's money would "form part of a funding round expected to value the group between $300bn and $350bn." For some reason, Anthropic is hailed as some sort of "efficient" competitor to OpenAI, at least based on what both The Information and Wall Street Journal have said, yet it appears to be raising and burning just as much as OpenAI . Why did a company that's allegedly “reducing costs” have to raise $13 billion in September 2025 after raising $3.5 billion in March 2025 , and after raising $4 billion in November 2024 ? Am I really meant to read stories about Anthropic hitting break even in 2028 with a straight face? Especially as other stories say Anthropic will be cash flow positive “ as soon as 2027 .” And if this company is so efficient and so good with money , why does it need another $15 billion, likely only a few months after it raised $13 billion? Though I doubt the $15 billion round closes this year, if it does, it would mean that Anthropic would have raised $31.5 billion in 2025 — which is, assuming the remaining $22.5 billion comes from SoftBank, not far from the $40.8 billion OpenAI would have raised this year. In the event that SoftBank doesn't fund that money in 2025, Anthropic will have raised a little under $2 billion less ($16.5 billion) than OpenAI ($18.3 billion, consisting of $10 billion in June   split between $7.5 billion from SoftBank and $2.5 billion from other investors, and an $8.3 billion round in August ) this year. I think it's likely that Anthropic is just as disastrous a business as OpenAI, and I'm genuinely surprised that nobody has done the simple maths here, though at this point I think we're in the era of "not thinking too hard because when you do so everything feels crazy.” Which is why I'm about to think harder than ever! I feel like I'm asked multiple times a day both how and when the bubble will burst, and the truth is that it could be weeks or months or another year , because so little of this is based on actual, real stuff. While our markets are supported by NVIDIA's eternal growth engine, said growth engine isn't supported by revenues or real growth or really much of anything beyond vibes. As a result, it's hard to say exactly what the catalyst might be, or indeed what the bubble bursting might look like. Today, I'm going to sit down and give you the scenarios — the systemic shocks — that would potentially start the unravelling of this era, as well as explain what a bubble bursting might actually look like, both for private and public companies. This is the spiritual successor to August's AI Bubble 2027 , except I'm going to have a little more fun and write out a few scenarios that range from likely to possible , and try and give you an enjoyable romp through the potential apocalypses waiting for us in 2026. Gemini 3 is good/better at the stuff tested on benchmarks compared to what OpenAI has. OpenAI's growth and usage was decelerating before this happened, and this just allows OpenAI to point to something. Its chips effort is falling behind , with its "Maya" AI chip delayed to 2026, and according to The Information, "when it finally goes into mass production next year, it’s expected to fall well short of the performance of Nvidia’s flagship Blackwell chip." According to The Information in late October 2025 , "more customers have been using Microsoft’s suite of AI copilots, but many of them aren’t paying for it." In October , Australian's Competition and Consumer Commission sued Microsoft for "allegedly misleading 2.7 million Australians over Microsoft 365 subscriptions," by making it seem like they had to pay extra and integrate Copilot into their subscription rather than buy the, and I quote, "undisclosed third option, the Microsoft 365 Personal or Family Classic plans, which allowed subscribers to retain the features of their existing plan, without Copilot, at the previous lower price." This is what a company does when it can't sell shit. Google did the same thing with its workspace accounts earlier in the year . This should be illegal! According to The Information in September 2025 , Microsoft had to "partly" replace OpenAI's models with Anthropic's for some of its Copilot software. Microsoft has, at this point, sunk over ten billion dollars into OpenAI, and part of its return for doing so was exclusively being able to use its models. Cool! According to The Information in September 2025 , Microsoft has had to push discounts for Office 365 Copilot as customers had "found Copilot adoption slow due to high cost and unproven ROI." In late 2024 , customers had paused purchasing further Copilot assistants due to performance and cost issues.

0 views
Kaushik Gopal 5 days ago

Combating AI coding atrophy with Rust

It’s no secret that I’ve fully embraced AI for my coding. A valid concern ( and one I’ve been thinking about deeply ) is the atrophying of the part of my brain that helps me code. To push back on that, I’ve been learning Rust on the side for the last few months. I am absolutely loving it. Kotlin remains my go-to language. It’s the language I know like the back of my hand. If someone sends me a swath of Kotlin code, whether handwritten or AI generated, I can quickly grok it and form a strong opinion on how to improve it. But Kotlin is a high-level language that runs on a JVM. There are structural limits to the performance you can eke out of it, and for most of my career 1 I’ve worked with garbage-collected languages. For a change, I wanted a systems-level language, one without the training wheels of a garbage collector. I also wanted a language with a different core philosophy, something that would force me to think in new ways. I picked up Go casually but it didn’t feel like a big enough departure from the languages I already knew. It just felt more useful to ask AI to generate Go code than to learn it myself. With Rust, I could get code translated, but then I’d stare at the generated code and realize I was missing some core concepts and fundamentals. I loved that! The first time I hit a lifetime error, I had no mental model for it. That confusion was exactly what I was looking for. Coming from a GC world, memory management is an afterthought — if it requires any thought at all. Rust really pushes you to think through the ownership and lifespan of your data, every step of the way. In a bizarre way, AI made this gap obvious. It showed me where I didn’t understand things and pointed me toward something worth learning. Here’s some software that’s either built entirely in Rust or uses it in fundamental ways: Many of the most important tools I use daily are built with Rust. Can’t hurt to know the language they’re written in. Rust is quite similar to Kotlin in many ways. Both use strict static typing with advanced type inference. Both support null safety and provide compile-time guarantees. The compile-time strictness and higher-level constructs made it fairly easy for me to pick up the basics. Syntactically, it feels very familiar. I started by rewriting a couple of small CLI tools I used to keep in Bash or Go. Even in these tiny programs, the borrow checker forced me to be clear about who owns what and when data goes away. It can be quite the mental workout at times, which is perfect for keeping that atrophy from setting in. After that, I started to graduate to slightly larger programs and small services. There are two main resources I keep coming back to: There are times when the book or course mentions a concept and I want to go deeper. Typically, I’d spend time googling, searching Stack Overflow, finding references, diving into code snippets, and trying to clear up small nuances. But that’s changed dramatically with AI. One of my early aha moments with AI was how easy it made ramping up on code. The same is true for learning a new language like Rust. For example, what’s the difference 2 between these two: Another thing I loved doing is asking AI: what are some idiomatic ways people use these concepts? Here’s a prompt I gave Gemini while learning: Here’s an abbreviated response (the full response was incredibly useful): It’s easy to be doom and gloom about AI in coding — the “we’ll all forget how to program” anxiety is real. But I hope this offers a more hopeful perspective. If you’re an experienced developer worried about skill atrophy, learn a language that forces you to think differently. AI can help you cross that gap faster. Use it as a tutor, not just a code generator. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎ fd (my tool of choice for finding files) ripgrep (my tool of choice for searching files) Fish shell (my shell of choice, recently rewrote in Rust) Zed (my text/code editor of choice) Firefox ( my browser of choice) Android?! That’s right: Rust now powers some of the internals of the OS, including the recent Quick Share feature. Fondly referred to as “ The Book ”. There’s also a convenient YouTube series following the book . Google’s Comprehensive Rust course, presumably created to ramp up their Android team. It even has a dedicated Android chapter . This worked beautifully for me. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎

0 views
Chris Coyier 5 days ago

The Jeopardy Phenomenon

There’s the thing where if you’re reading an article in the newspaper, and it’s about stuff you don’t know a ton about, it all seems well and good. Then you read another article in the same paper and it’s about something you know intimately (your job, your neighborhood, your hobby, etc) there is a good chance you’ll be like hey! that’s not quite right! I think of that as the Jeopardy Phenomenon. On the TV game show Jeopardy, if you don’t know the answer to a question, it can feel very much like jeez this quiz show is really hard ! But then if a category or question comes up around a topic you know a bit about, the question (or “answer” in reverse Jeopardy parlance) can feel very basic and simple. Like if the “answer” is about popular fantasy card games, the “question” is not going to be Android Netrunner, it’s going to be Magic: The Gathering. (and you’ll roll your eyes a little bit, because it’s like duh ) I think AI has the Jeopardy Phenomenon too. If you use it to generate code that is outside your expertise, you are likely to think it’s all well and good, especially if it seems to work at first pop. But if you’re intimately familiar with the technology or the code around the code it’s generating, there is a good chance you’ll be like hey! that’s not quite right!

0 views
Sean Goedecke 5 days ago

AI detection tools cannot prove that text is AI-generated

The runaway success of generative AI has spawned a billion-dollar sub-industry of “AI detection tools”: tools that purport to tell you if a piece of text was written by a human being or generated by an AI tool like ChatGPT. How could that possibly work? I think these tools are both impressive and useful, and will likely get better. However, I am very worried about the general public overestimating how reliable they are. AI detection tools cannot prove that text is AI-generated. My initial reaction when I heard about these tools was “there’s no way that could ever work”. I think that initial reaction is broadly correct, because the core idea of AI detection tools - that there is an intrinsic difference between human-generated writing and AI-generated writing - is just fundamentally mistaken 0 . Large language models learn from huge training sets of human-written text. They learn to generate text that is as close as possible to the text in their training data. It’s this data that determines the basic “voice” of an AI model, not anything about the fact that it’s an AI model. A model trained on Shakespeare will sound like Shakespeare, and so on. You could train a thousand different models on a thousand different training sets without finding a common “model voice” or signature that all of them share. We can thus say (almost a priori ) that AI detection tools cannot prove that text is AI-generated. Anything generated by a language model is by definition the kind of thing that could have been generated by a human. But of course it’s possible to tell when something was written by AI! When I read Twitter replies, the obviously-LLM-generated ones stick out like a sore thumb. I wrote about this in Why does AI slop feel so bad to read? . How can this be possible, when it’s impossible to prove that something was written by AI? Part of the answer here might just be that current-generation AI models have a really annoying “house style”, and any humans writing in the same style are annoying in the same way . When I read the first sentence of a blog post and think “oh, this is AI slop, no need to keep reading”, I don’t actually care whether it’s AI or not. If it’s a human, they’re still writing in the style of AI slop, and I still don’t want to read the rest of the post. However, I think there’s more going on here. Claude does kind of sound like ChatGPT a lot of the time, even though they’re different models trained in different ways on (at least partially) different data. I think the optimistic case for AI detection tooling goes something like this: I find this fairly compelling, so long as you’re okay with a 90% success rate . A 90% success rate can be surprisingly bad if the base rate is low, as illustrated by the classic Bayes’ theorem example . If 10% of essays in a class are AI-written, and your detector is 90% accurate, then only half of the essays it flags will be truly AI-written. If an AI detection tool thinks a piece of writing is AI, you should treat that as “kind of suspicious” instead of conclusive proof. There are a few different approaches to building AI detection tools. The naive approach - which I couldn’t find any actual production examples of - would be to train a simple text classifier on a body of human-written and AI-written text. Apparently this doesn’t work particularly well. The Ghostbuster paper tried this and decided that it was easier to train a classifier on the logits themselves: they pass each candidate document through a bunch of simple LLMs, record how much each LLM “agreed” with the text, then train their classifier on that data. DNA-GPT takes an even simpler approach: they truncate a candidate document, regenerate the last half via frontier LLMs, and then compare that with the actual last half. The most impressive paper I’ve seen is the EditLens paper by Pangram Labs. EditLens trains a model on text that was edited by AI to various extents, not generated from scratch, so the model can learn to predict the granular degree of AI involvement in a particular text. This plausibly gets you a much better classifier than a strict “AI or not” classifier model, because each example teaches the model a numeric value instead of a single bit of information. One obvious point: all of these tools use AI themselves . There is simply no way to detect the presence of AI writing without either training your own model or running inference via existing frontier models. This is bad news for the most militantly anti-AI people, who would prefer not to use AI for any reason, even to catch other people using AI. It also means that - as I said earlier and will say again - AI detection tools cannot prove that text is AI-generated . Even the best detection tools can only say that it’s extremely likely. Interestingly, there’s a sub-sub-industry of “humanizing” tools that aim to convert your AI-generated text into text that will be judged by AI detection tools as “human”. Some free AI detection tools are actually sales funnels for these humanizing tools, and will thus deliberately produce a lot of false-positives so users will pay for the humanizing service. For instance, I ran one of my blog posts 1 through JustDone , which assessed them as 90% AI generated and offered to fix it up for the low, low price of $40 per month. These tools don’t say this outright, but of course the “humanizing” process involves passing your writing through a LLM that’s either prompted or fine-tuned to produce less-LLM-sounding content. I find this pretty ironic. There are probably a bunch of students who have been convinced by one of these tools to make their human-written essay LLM-generated, out of (justified) paranoia that a false-positive would get them in real trouble with their school or university. It is to almost everyone’s advantage to pretend that these tools are better than they are. The companies that make up the billion-dollar AI detection tool industry obviously want to pretend like they’re selling a perfectly reliable tool. University and school administrators want to pretend like they’ve got the problem under control. People on the internet like dunking on people by posting a screenshot that “proves” they’re copying their messages from ChatGPT. Even the AI labs themselves would like to pretend that AI detection is easy and reliable, since it would relieve them of some of the responsibility they bear for effectively wrecking the education system. OpenAI actually released their own AI detection tool in January 2023, before retiring it six months later due to “its low rate of accuracy”. The real people who suffer from this mirage are the people who are trying to write, but now have to deal with being mistakenly judged for passing AI writing off as their own. I know students who are second-guessing how they write in order to sound “less like AI”, or who are recording their keystrokes or taking photos of drafts in order to have some kind of evidence that they can use against false positives. If you are in a position where you’re required to judge if people are using AI to write their articles or essays, I would urge you to be realistic about the capabilities of AI detection tooling. They can make educated guesses about whether text was written by AI, but that’s all they are: educated guesses. That goes double if you’re using a detection tool that also offers a “humanizing” service, since those tools are incentivized to produce false positives. AI detection tools cannot prove that text is AI-generated. People sometimes talk about watermarking : when a provider like OpenAI deliberately trains their model to output text in some cryptographic way that leaves a very-hard-to-fake fingerprint. For instance, maybe it could always output text where the frequency of “e”s divided by the frequency of “l”s approximates pi. That would be very hard for humans to copy! I suspect there’s some kind of watermarking going on already (OpenAI models output weird space characters, which might trip up people naively copy-pasting their content) but I’m not going to talk about it in this post, because (a) sophisticated watermarking harms model capability so I don’t think anyone’s doing it, and (b) unsophisticated watermarking is easily avoided. I write every one of these posts with my own human fingers. RLHF and instruction/safety tuning pushes all strong LLMs towards the same kind of tone and style That tone and style can be automatically detected by training a classifier model Sure, it’s possible for technically-sophisticated users to use abliterated LLMs or less-safety-tuned open models, but 99% of users will just be using ChatGPT or Claude (particularly if they’re lazy enough to cheat on their essays in the first place) Thus a fairly simple “ChatGPT/Claude/Gemini prose style detector” can get you 90% of the way towards detecting most people using LLMs to write their essays People sometimes talk about watermarking : when a provider like OpenAI deliberately trains their model to output text in some cryptographic way that leaves a very-hard-to-fake fingerprint. For instance, maybe it could always output text where the frequency of “e”s divided by the frequency of “l”s approximates pi. That would be very hard for humans to copy! I suspect there’s some kind of watermarking going on already (OpenAI models output weird space characters, which might trip up people naively copy-pasting their content) but I’m not going to talk about it in this post, because (a) sophisticated watermarking harms model capability so I don’t think anyone’s doing it, and (b) unsophisticated watermarking is easily avoided. ↩ I write every one of these posts with my own human fingers. ↩

0 views
Wreflection 5 days ago

Software Applications Face A New Intermediary

In 2009, Amazon’s leadership team noticed an unmistakable trend in their monthly business reviews (MBR). Mobile traffic to Amazon’s websites was increasing rapidly and, if the trend continued, would one day overtake desktop traffic ( it eventually did in 2016 ). Source: wreflection.com using data from Statcounter The year prior, Apple had opened the iPhone’s App Store to third-party developers, and Amazon had launched two apps on the App Store: a shopping app that let customers browse and buy physical products from Amazon’s e-commerce store, and a Kindle app where users could read ebooks but could not buy them. Apple’s 30% commission on in-app digital purchases meant Amazon couldn’t afford to sell ebooks or digital music/video through its iOS apps ( see footnote for more context on Amazon’s digital business economics 1 ). The traffic trend exposed a strategic risk. Mobile would become the primary shopping channel, and if Apple decided to expand its take rate to physical goods (say 1-2% or so like credit card interchange fees ), Amazon’s entire retail business would face the App Store tax, not just the digital business. In response 2 , Amazon started a new personal device project in 2010. From Fast Company : The project, code-named “Tyto” for a genus of owl, got rolling in 2010. Customers often come to Amazon via iPhones or Android devices. Not controlling the hardware can create problems. For instance, you can’t buy e-books through the Kindle app on your iPhone because Apple takes 30% of app-driven sales—a cut that would hurt Amazon’s already razor-thin margin. This device ( Fire Phone ) flopped, but the strategic concern that motivated it was valid. Mobile Operating Systems (OS) had inserted themselves as the new layer between users and applications, dictated what apps could and could not do 3 , and imposed significant transaction fees . Today, AI is emerging as another new layer. source: wreflection.com The AI Agent Layer Players from both up and down the stack want to own this new layer. Thanks for reading Wreflection! Subscribe for free to receive new posts and support my work. Application AI Agent . ChatGPT, Gemini, Perplexity etc. are building browsing and shopping AI agents into their apps that sit between users and merchants. OpenAI announced its Instant Checkout feature in September 2025, initially with Etsy and Shopify, followed quickly by Walmart and Target 4 . Amazon has notably stayed out of this partnership. But when Perplexity launched its AI browser Comet that lets users browse shopping websites using Perplexity’s AI agent, Amazon blocked Comet’s access and sent a cease-and-desist letter. While I am skeptical of current AI chatbot shopping experiences 5 , the public dispute between the two shows that neither wants to cede control of the user interaction. Operating System AI Agent . Apple Intelligence and Google’s Gemini Android integration represent attempts to make AI a native OS capability. For example, Apple’s App Intents API is trying to create structured ways for the OS AI to interact with applications. The App Intents framework provides functionality to deeply integrate your app’s actions and content with system experiences. With Apple Intelligence, Siri will suggest your app’s actions to help people discover your app’s features and gains the ability to take actions in and across apps. The vision is that when you ask Siri to “order dinner,” 6 the OS AI compares across multiple apps (such as DoorDash or Uber Eats), shows you the options (including which ones have discounts), handles the back-and-forth, and completes the transaction. The user may never open some apps directly 7 . Eventually, the OS AI layer, in addition to coordination and orchestration, might even generate a custom UI across different apps if required. Source: wreflection.com; generated using ChatGPT Applications from foundation models have the technological advantages. Their models are more capable. They have built dedicated AI infrastructure, from physical resources like compute and power access to operational systems that serve models with low latency. They have accumulated deep AI expertise and routinely publish world-class research, creating a flywheel that attracts other top AI talent. Apple Intelligence, by contrast, has struggled. John Gruber, an influential writer and podcaster, captured the sentiment in his critical piece “ Something Is Rotten in the State of Cupertino . ” Google has competitive models with Gemini and can even outline a compelling vision for the OS AI layer, but organizational dysfunction has prevented Android from shipping a cohesive one. This dysfunction reminds me of that old joke about Google promising to make messaging simple again with yet another chat application (Google Talk, Google Hangouts, Android Messages, Google+ Messenger, Huddle, Google Allo, Duo, Google Voice, Google Chat, Google Meet; did I miss any?). Structurally, though, Operating Systems are better positioned. They have system-level access to all apps on the device, so they can operate across multiple apps to finish a task, something a ChatGPT cannot do. They have access to personal data across messages, photos, calendar, and location that any individual application cannot match. They have a distribution advantage since they can ship to current devices via software updates and can come pre-installed on future ones. So Amazon may be able to block Perplexity from intermediating users, but will they block Siri when Apple does the same? I suppose, in a way, this is the “ Everything App/Super App ” that Chinese companies pioneered finally reaching Western markets, realized through the operating system rather than a single application. Speaking of China, a new entrant from Beijing suggests that the pursuit of OS AI will not be an American-only affair. Earlier this week, ByteDance unveiled a demo of their Doubao Phone Assistant, a Graphical User Interface (GUI) based OS AI that takes a different approach. Instead of relying on system-level hooks, it uses multimodal understanding of screen content to enable cross-app device control via simulated tapping, swiping, and typing. This direct interpretation means it can theoretically operate inside any app, including those the device has never seen before and with or without explicit developer cooperation. Apple Intelligence lacks this capability because it, as of now, requires apps to implement in advance the App Intents API mentioned earlier. The demo showed price comparison across multiple retailers, car control integration with Tesla, and other agentic tasks. From a market perspective, the parallels to the automotive industry are striking. Legacy US and European automakers once dismissed Chinese electric vehicles (EV) as cheap knockoffs 8 . But now Chinese EVs are capturing market share across Europe by competing both on price and quality/performance/features 9 . Tariffs and regulatory barriers have kept Chinese EVs out of the US market for now 10 . Similarly, Chinese personal device manufacturers are currently dismissed as copycats 11 . But with their novel and ambitious OS AI efforts, proven open-source AI models such as DeepSeek, and lower cost structure, they could soon compete both on capability and cost. If that happens, the US will likely have to impose restrictions on these Chinese AI devices just as it has on Chinese cars and telecom equipment. But can the US influence the rest of the world to adopt American OS AI if ByteDance/Doubao or something like it proves superior at lower costs? Your guess is as good as mine. Every product that lives on the application layer faces a similar question to the one Amazon’s leadership faced fifteen years ago as they see a similar chart in their business reviews. Source : Adobe Digital Insights, July 2024-July2025 The Verge, a tech news website, notes that CEOs of companies in the application layer such as Uber, DoorDash, Airbnb, and Lyft all believe their accumulated advantages (supply networks, operational know-how, brand loyalty etc.) will protect them from AI disintermediation ( quotes slightly edited for better readability/context ): Brian Chesky (Airbnb CEO): There’s this AI maximalist view that there’s going to be like one or two AI models and one or two applications that rule them all and you use this one app and this one model for everything in the world. If you take that to its logical conclusion, you also start to go to this place where almost one company rules everything, and I think there’s numerous problems with the AI maximalist view that it’s one company to rule them all. I don’t think companies just want to be data layers, and so these platforms or these new interfaces are only as good as the companies that participate, and the companies will only participate if they can have a relationship with their own customer. Dara Khosrowshahi (Uber CEO): As it relates to AI and these agents, we want to work with these players. We come from a place of strength because of the unique inventory and the fragmentation of these markets that we’re organizing. People spend so much time trying to figure out what the economics might be when the first thing is to try it out. Figure out the experience first and then the economics. Once you optimize the experience, we can measure. Are you an incremental consumer for Uber or are you totally cannibalistic? If it’s cannibalistic, then I’m going to charge a lot of money. If it is incremental, then I would pay some take rate. Is it a 5, 10, or 20 percent take rate? It depends on the incrementality. Ania Smith (Taskrabbit CEO): Siri, in this case, wants to be able to provide these types of services, and they can’t really do it without Taskrabbit because only Taskrabbit actually has a network of thousands of Taskers. We cultivated that network. We know who they are and we understand their skills. They’ve all been background checked. Siri, Apple, or whoever is not going to do that. Amazon’s lawsuit against Perplexity suggests the company doesn’t share this view. Fifteen years ago, facing similar concerns, Amazon tried to build its own device. That device failed, but Amazon’s e-commerce business continued to thrive by accepting its position on the application layer and optimizing within those constraints. They refused to sell digital content through iOS apps and continued to invest in the customer experiences that App Store didn’t tax. Just as Amazon’s leadership once saw personal devices as a way to escape the App Store tax, some AI companies ( notably OpenAI ) believe they too must make a personal computing device to own the user relationship. Besides the obvious belief that AI-native companies like themselves can design AI personal devices better than today’s device makers can, they don’t want to live at the mercy of Operating Systems. They don’t want to ask Cupertino for permission 12 . They want to own their own destiny. If you enjoyed this post, please consider sharing it on your socials or with someone who might also find it interesting. You can follow me on Twitter/X or LinkedIn for more frequent business and tech discussions. Thanks for reading Wreflection! Subscribe for free to receive new posts and support my work. Under the wholesale model, Amazon paid publishers a fixed price for ebooks and set its own retail price, often selling $9.99 bestsellers at a loss to drive Kindle adoption. When Apple launched iBooks in 2010, it adopted the agency model (where publishers set prices and Apple takes 30%), then tried to force Amazon to do the same. The US Department of Justice (DOJ) later successfully sued Apple and five publishers for price-fixing conspiracy. While publications such as Fast Company attribute the Fire Phone project to concerns over App Store fees, Amazon has never publicly confirmed this reasoning. In the book Working Backwards , former Amazon executives Colin Bryar and Bill Carr recount how Amazon’s earlier personal device initiative (Kindle) originated from similar strategic concerns. The catalyst was a 2003 meeting between Jeff Bezos and Steve Jobs, where Jobs demonstrated iTunes for Windows, and suggested Amazon would become “the last place to buy CDs.” This prompted Bezos to build an Amazon controlled digital reading platform before Apple or others could dominate ebooks. Apple’s policies prohibited developers ( until May 2025 ) from directing users to alternative/cheaper payment options not just in their apps but across their websites, support docs, emails/newsletters, and any other communication with customers. This “ partner with non-market leaders strategy ” to enter markets mirrors Apple’s iPhone launch with second-placed carriers (AT&T instead of Verizon in the US, SoftBank instead of NTT DoCoMo Japan). The theory is that the second and third placed players are willing to cede control for market access, while the market leader typically tries to dictate terms. I think the current AI chatbot shopping experiences lack visual UX that is on par or better than the native shopping app. The technology might get there eventually with generative UI, but until then they will remain a small fraction of how people shop. The irony of yet another example of food delivery, restaurant reservation, or travel booking AI agent is not lost on me. This applies most directly to transactional apps such as food delivery or ride hailing, where the value is in the outcome rather than the interface. Games, media, or productivity apps will likely remain experience destinations where the app itself delivers value. Reuters , 2024. Chinese EVs made up 10% of EV sales in Europe in 2023; UBS, a bank, predicts China’s share of all cars sold in Europe could hit 20% by 2030. Economist , 2024. Biden Administration tariffs on Chinese automotive imports, 2024. Sheel Mohnot , X/Twitter, 2025 In 2015, apparently Apple CEO Tim Cook summoned Uber CEO Travis Kalanick to Apple HQ in Cupertino after finding out that Uber secretly tracked iPhones even after Uber’s app had been deleted. Apple directed Uber to either stop or lose access to the App Store. Source: wreflection.com using data from Statcounter The year prior, Apple had opened the iPhone’s App Store to third-party developers, and Amazon had launched two apps on the App Store: a shopping app that let customers browse and buy physical products from Amazon’s e-commerce store, and a Kindle app where users could read ebooks but could not buy them. Apple’s 30% commission on in-app digital purchases meant Amazon couldn’t afford to sell ebooks or digital music/video through its iOS apps ( see footnote for more context on Amazon’s digital business economics 1 ). The traffic trend exposed a strategic risk. Mobile would become the primary shopping channel, and if Apple decided to expand its take rate to physical goods (say 1-2% or so like credit card interchange fees ), Amazon’s entire retail business would face the App Store tax, not just the digital business. In response 2 , Amazon started a new personal device project in 2010. From Fast Company : The project, code-named “Tyto” for a genus of owl, got rolling in 2010. Customers often come to Amazon via iPhones or Android devices. Not controlling the hardware can create problems. For instance, you can’t buy e-books through the Kindle app on your iPhone because Apple takes 30% of app-driven sales—a cut that would hurt Amazon’s already razor-thin margin. This device ( Fire Phone ) flopped, but the strategic concern that motivated it was valid. Mobile Operating Systems (OS) had inserted themselves as the new layer between users and applications, dictated what apps could and could not do 3 , and imposed significant transaction fees . Today, AI is emerging as another new layer. source: wreflection.com The AI Agent Layer Players from both up and down the stack want to own this new layer. Thanks for reading Wreflection! Subscribe for free to receive new posts and support my work. Application AI Agent . ChatGPT, Gemini, Perplexity etc. are building browsing and shopping AI agents into their apps that sit between users and merchants. OpenAI announced its Instant Checkout feature in September 2025, initially with Etsy and Shopify, followed quickly by Walmart and Target 4 . Amazon has notably stayed out of this partnership. But when Perplexity launched its AI browser Comet that lets users browse shopping websites using Perplexity’s AI agent, Amazon blocked Comet’s access and sent a cease-and-desist letter. While I am skeptical of current AI chatbot shopping experiences 5 , the public dispute between the two shows that neither wants to cede control of the user interaction. Operating System AI Agent . Apple Intelligence and Google’s Gemini Android integration represent attempts to make AI a native OS capability. For example, Apple’s App Intents API is trying to create structured ways for the OS AI to interact with applications. The App Intents framework provides functionality to deeply integrate your app’s actions and content with system experiences. With Apple Intelligence, Siri will suggest your app’s actions to help people discover your app’s features and gains the ability to take actions in and across apps. The vision is that when you ask Siri to “order dinner,” 6 the OS AI compares across multiple apps (such as DoorDash or Uber Eats), shows you the options (including which ones have discounts), handles the back-and-forth, and completes the transaction. The user may never open some apps directly 7 . Eventually, the OS AI layer, in addition to coordination and orchestration, might even generate a custom UI across different apps if required. Source: wreflection.com; generated using ChatGPT Applications from foundation models have the technological advantages. Their models are more capable. They have built dedicated AI infrastructure, from physical resources like compute and power access to operational systems that serve models with low latency. They have accumulated deep AI expertise and routinely publish world-class research, creating a flywheel that attracts other top AI talent. They have system-level access to all apps on the device, so they can operate across multiple apps to finish a task, something a ChatGPT cannot do. They have access to personal data across messages, photos, calendar, and location that any individual application cannot match. They have a distribution advantage since they can ship to current devices via software updates and can come pre-installed on future ones.

0 views
Martin Fowler 6 days ago

Fragments Dec 4

Rob Bowley summarizes a study from Carnegie Mellon looking on the impact of AI on a bunch of open-source software projects. Like any such study, we shouldn’t take its results as definitive, but there seems enough there to make it a handy data point. The key point is that the AI code probably reduced the quality of the code base - at least if static code analysis can be trusted to determine quality. And perhaps some worrying second-order effects This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time. ❄                ❄                ❄                ❄                ❄ Rob’s post is typical of much of the thoughtful writing on AI. We can see its short-term benefits, but worry about its long-term impact. But on a much deeper note is this lovely story from Jim Highsmith . Jim has turned 0x50, and has spent the last decade fighting Parkinson’s disease. To help him battle it he has two AI assisted allies. Between my neural implants and Byron’s digital guidance, I now collaborate with two adaptive systems: one for motion, one for thought. Neither replaces me. Both extend me. If you read anything on AI this week, make it be this . It offers a positive harbinger for our future and opens my mind to a whole different perspective of the role of AI in it ❄                ❄                ❄                ❄                ❄ Anthropic recently announced that it disrupted a Chinese state-sponsored operation abusing Claude Code. Jim Gumbley looks at the core lesson to learn from this, that we have to understand the serious risk of AI Jailbreaking New AI tools are able to analyze your attack surface at the next level of granularity. As a business leader, that means you now have two options: wait for someone else to run AI-assisted vulnerability detection against your attack surface, or run it yourself first. ❄                ❄                ❄                ❄                ❄ There’s plenty of claims that AI Vibe Coding can replace software developers, something that folks like me (perhaps with a bias) think unlikely. Gergely Orosz shared this tidbit Talked with an exec at a tech company who is obsessed with AI and has been for 3 years. Not a developer but company makes software. Uses AI for everything, vibe codes ideas. Here’s the kicker: Has a team of several devs to implement his vibe coded prototypes to sg workable I’d love to hear more about this (and similar stories) ❄                ❄                ❄                ❄                ❄ Nick Radcliffe writes about a month of using AI I spent a solid month “pair programming” with Claude Code, trying to suspend disbelief and adopt a this-will-be-productive mindset. More specifically, I got Claude to write well over 99% of the code produced during the month. I found the experience infuriating, unpleasant, and stressful before even worrying about its energy impact. Ideally, I would prefer not to do it again for at least a year or two. The only problem with that is that it “worked”. He stresses that his approach is the “polar opposite” of Vibe Coding. The post is long, and rambles a bit, but is worthwhile because he talks in detail about his workflow and how he uses the tool. Such posts are important so we can learn the nitty-gritty of how our programming habits are changing. ❄                ❄                ❄                ❄                ❄ Along similar lines is a post of Brian Chambers on his workflow, that he calls Issue-Driven Development (and yes, I’m also sick of the “something-driven” phraseology). As with much of the better stuff I’ve heard about AI assisted work, it’s all about carefully managing the context window, ensuring the AI is focused on the right things and not distracted by textual squirrels.

0 views
Stratechery 6 days ago

An Interview with Atlassian CEO Mike Cannon-Brookes About Atlassian and AI

Good morning, This week’s Stratechery Interview is with Atlassian founder and CEO Mike Cannon-Brookes . Cannon-Brookes and Scott Farquhar — whom I interviewed in 2017 — founded Atlassian in 2002; their first product was Jira, a project and issue-tracking tool, followed by Confluence, a team collaboration platform. Atlassian, thanks in part to their location in Australia, pioneered several critical innovations, including downloadable software and a self-serve business model; over the ensuing two decades Atlassian has moved to the cloud and greatly expanded their offering, and is now leaning into AI. In this interview we discuss that entire journey, including Cannon-Brookes’ desire to not have a job, how the absence of venture capital shaped the company, and how the company’s go-to-market approach has evolved. We then dive into AI, including why Cannon-Brookes believes that there will be more developers doing more, and why Atlassian’s position in the enterprise lets them create compelling offerings. Finally we discuss Atlassian’s sponsorship of Williams, the F1 race team, and why Cannon-Brookes thinks they can both help Williams win and also accrue big benefits for Atlassian. To repeat a disclosure I have long made in my Ethics Statement , I did, in the earliest years of Stratechery, take on consulting work for a limited number of companies, including Atlassian. And, for what it’s worth, I’m also a huge F1 fan! Go Max. As a reminder, all Stratechery content, including interviews, is available as a podcast; click the link at the top of this email to add Stratechery to your podcast player. On to the Interview: This interview is lightly edited for content and clarity. Mike Cannon-Brooks, welcome to Stratechery. MCB: Thank you for having me, Ben. So this is admittedly a new experience for me, I’ve already interviewed the founder of Atlassian , but it wasn’t you. I’m of course referring to Scott [Farquhar] . That was eight years ago, actually, before I even had podcasts. It was very brief, but hey, like I said, new experiences. MCB: That’s true. That’s true. And you wrote a consulting paper for us in 2014! I was going to disclose, yes, in the very brief period where I did consulting work, you flew me down to Sydney for a week, I had a chance to learn a lot about Atlassian. And on a personal note, that consulting contract helped me a lot, that was when I was just starting. It’s funny how small the numbers seem in retrospect, but maybe that’s why I’ve shied away from writing about you too much over the years, because it meant a lot to me. So I appreciate it and there’s my disclosure for the interview. MCB: Thank you. It’s a good piece of work. Don’t forget, ironically, we started as a consulting and services business and then decided that software was a better business model, so I think you did the same thing. You went the scalability route instead of the consulting work via Sydney. Absolutely. I’m not doing anything that doesn’t scale anymore, but I did love visiting Sydney, so it was great. MCB: Still, we pulled out the old consulting paper you wrote for us in 2014. Why are we going to win, why are we going to lose, everything else, it was classic Ben work. Was it good? MCB: It’s pretty good! It’s interesting, I’d probably be embarrassed if I read it today. Anyhow, the good news is that since it’s the first time I’m interviewing you, I do get to do my favorite segment, which is learning more about you. Where did you grow up, but also, where were you born? I know they were different places. Then, how’d you get interested in technology and what’s your version of the Atlassian origin story? MCB: Sure, I feel like I’ve heard this question 1,000 times! Where to start? My dad was in banking, he joined the glorious institution that is Citibank today, from England. Parents are both from Cambridge and bounced around the world a lot as part of that job. Took the, “Hey, we need someone to go to this country”, and he was like, “I’ll take that”. So I was born in America, in a period I lived in New York. To be honest, lived there for three months before I moved to Taiwan. Really? Whoa. I didn’t know that. MCB: Yeah, in 1980 when it was very different than what it is today. Yeah. Were you saving that to drop that off me? I had no idea. I thought you went straight from America to Australia. MCB: I only just thought about it about 30 seconds ago, actually. No, I went to Taiwan for a few years, lived in Hong Kong for a few years, went to Australia for a few years. So how I got into technology is actually related because my parents were moving around so much, the logic was being English, that they would send us to English boarding schools and that would be a stable thing while they were moving once we got old enough. So at the mighty age of seven, I was put on Qantas and sent to England and back four times a year to go to boarding school in England for about five, six years. Because of that boarding school, I have one of the lowest frequent flyer numbers in Australia, they introduced the frequent flyer program and that was at the end of year one or end of year two. I get given this catalog by my parents and how you’ve earned all these points, “What do you want to buy?”, and it’s like, “I don’t know, trips, winery things, booze”, I’m flicking through this catalog and I’m like, “There’s literally nothing in this catalog”, of gear that you used to be able to get that I wanted and at the back is this computer, so I was like, “I guess I’ll get that”. The only thing that was potentially age appropriate. MCB: That was the only thing in the catalog, I didn’t want a toaster, I didn’t want wine, so that became my first computer, the mighty Amstrad PC20 . Four colors, no hard drive. Eventually, I bought an external floppy drive, so you could put in two and did buy magazines and type in programs and write games and stuff from magazines and play with it, played a lot of video games basically back in that era. I was into computers peripherally all through high school, came back to Australia at 12, my parents had settled here by then and weren’t moving, and so I came back here, did all high school and university here. In high school, I was always going to be an architect, that was my dream the entire way through, but come to the end of grade 12, applied for a bunch of scholarships, because university, applied for the scholarships, ended up getting one and so I thought, “Oh, well, maybe I’ll take that”, and it was in a course called BIT. Basically, half computer science, half finance and economics, but it was 15 grand a year, tax-free, so I was like, “Well, I’ll do that for a while and go back to the architecture thing”. Of course, famously in that scholarship, I met my first business partner of my first startup, met my second business partner of the second startup, they went in radically different directions in terms of outcome, but it was just 30 kids right at the right time, did the dot-com era thing. Now, ironically, as a part of that scholarship, you had to spend six months in three industrial placements, so the origin story of Atlassian comes from then a little bit, because those industrial placements were so boring. Scott spent six months installing Windows at a large corporate and he was crazy freaking smart and it was like, “Hey, go from computer to computer and upgrade to Windows 98”, or whatever it was. It was like, “Guys, this is our life, this is going to be horrible”. I worked for Nortel Bay Networks, which was a good, at the time, massive competitor, Cisco then completely disappeared and so a good tech lesson in and of itself, I basically cataloged the room full of networking gear and routers, it was mind-numbingly boring. So towards the end of the university course, I famously sent an email to a few people saying, “Look, I don’t really want to get a real job, why don’t we start a company and we’ll try some stuff?”. And this was after the dot-com era? This was the early 2000s? MCB: This was after the dot-com era, yeah. So I lived through the dot-com era actually as a journalist and writer, analyst and technology. I worked for a company called Internet.com, which became Jupiter Media and Jupiter Research and that was great, that was an amazing era for me. We ran events, newsletters, what would’ve been podcasts, didn’t have them back then. And we ran events on Mobile Monday, I think one of them was called and it was all about WAP and— Well, the real secret is you’re not the only one. There are some founders that are very successful, that they’re like, “Look, I just want to pontificate about technology”. MCB: A little bit like you, I remember getting in a lot of trouble from some of the startups, because some company would launch and I wrote basically 500 words on, “This thing’s never going to work, this is a disaster of an idea”, and they would ring up and yell at my boss and he was awesome, he’d be like, “Dude, just keep writing what you think”, and it didn’t make you very popular as a journalist type. Anyway, emailed some people, tried to start a business, we didn’t actually know what we were going to do. Atlassian has, I always tell people, a terrible origin story. You should not copy us. You just didn’t want to be installing Windows or upgrading software. MCB: We literally did not want to get a real job. And Scott replied and said, “Yeah, sure, I’m in for trying that”. He was one of the smartest kids in our class and his nickname is Skip, because he was the president of our student association and always a leader type and Eagle Scout and everything else, so we’re like, “Yeah, okay, let’s do that, we’re good mates” — and that started Atlassian. We picked the name in about five minutes, which if you consulted any branding company, would not have been chosen. Ironically, originally, we were going to do customer service and consulting, that was what the gig was. Hence the name, because Atlas was a Greek titan whose job was to stand on top of the Atlas Mountains and hold up the sky, that’s what he was supposed to be doing. He was a bad guy, so his punishment was to hold the sky up and we thought that was an act of legendary service, and so we were going to provide legendary service by holding up the sky for customers and as I said, did the service thing for about six months, decided that this is a terrible business. People paying us $350 US to answer their questions and didn’t scale and was at crazy hours of the morning and night and everything else. So in the meantime, we wrote the first version of what became Jira . We actually wrote three pieces of software, one was a knowledge basey type tool, one was a mail archiving tool for groups, so you could see each other’s email as a shared archiving. And were you seeing this and you were building tools for yourself, for your consulting business? MCB: Literally, yes, exactly. So all three were tools that we needed for ourselves. People would email us and I couldn’t see Scott’s email and he couldn’t see mine at the time and it was like this is silly, and we built Jira to handle questions and issues and problems that we were having ourselves that became a teeny bit popular. There was this glimmer that someone else cared, so we poured all the effort into that. What was that? What was the glimmer? Because this is when Agile is taking over software development and at least the legend is Jira and Agile go hand in hand, is that a correct characterization? MCB: A little bit, but this is actually pre-Agile. So Jira comes out before Agile is even a thing. I think it was about two or three years before we had any version of marketing or feature sets that involved Agile. This was just a web-based, at the time, a bug tracker. So the interesting evolution part of the company obviously is it started as a bug tracker for software developers, it became an issue tracker for technology teams and now it’s like a business workflow for tens of millions of people every day across the world, most of whom have nothing to do with technology, so it’s gone on its own evolution. Would anything have been different if this was the plan from the beginning, or did it have to be this organic, “We’re figuring it out as we go along as we’re running away from Windows installations”, sort of story? MCB: I think, look, obviously, if we could choose to follow in our own footsteps, the Back to the Future skeptic in me would say it’s gone pretty well, so I’d follow every single footstep I took. (laughing) Yep, totally. MCB: And that would’ve become the plan. But look, we had two hunches really, which both turned out to be radically correct. Now, I would say we were following waves or whatever else, but one was that the Internet would change software distribution, which sounds ridiculous now and when I talk to graduates nowadays, I have to put them in the right time and place and say, “Look, when we started, software was distributed on a CD”, BEA WebLogic was the bee’s knees and you used to have to get it on a CD if you were lucky. If not, someone would come and install it for you and that’s how software was distributed. We made that CD into a ZIP file and put it on the Internet for people to download. You didn’t access it like a SaaS application, you literally download it from our website. Right. It’s funny that when you first say that, it’s like, “Oh, it’s completely transformative”, well, but you were an on-premises software story. But actually, no, there’s several steps to getting to SaaS, one of which is just downloading software. MCB: And we had people call us before they would download to check that we were real and stuff and I’m like, “Why don’t you just download the damn ZIP file?”, and I also date them, because, well, maybe I’ll get to the business model part, but the second innovation was that we thought open source would change software costs. So we had this big hunch, we were both writing a bunch of open source code at the time. Open source was a massive movement, especially in the Java space. Embarrassingly, I actually wrote a book called Open Source Java Programming that you can find with some mates. It’s still on Amazon and we sold a few thousand copies, I think, but I swore I’d never write a book again, it was a very painful experience. Thank you, you’re validating my life decisions . MCB: Yeah. Open source did bring the cost of building software down radically. We were writing a very small layer, 5% of the code at best on top of masses of amazing open source libraries and we contributed to those libraries, but we could deliver an amazing experience for a very low cost. We learned a lot, pricing and packaging. So what was the implication of that hunch though? Just that the market for developers, that would subsequently mean there was more software? MCB: A little bit that was the implication of the hunch. Largely for us, it was that the cost was going down. Pre-open source, you had to write everything so if Jira was back then, I don’t know, a million lines of code, if you added all the open source libraries together, it was 25, 30, 40 million lines of code. It was so big that it was so expensive, because you had to write all of that. To think of Windows, they wrote everything, the networking stack, there were no libraries, there was no open source involved in the original versions, it was all written by Microsoft. So the cost of that was very high, then you had to charge a lot of money. So we thought, look, if we could take all these amazing open source libraries, contribute back to them — we were a great open source citizen — and build a piece of proprietary software on top of them that solved customer’s problems, we could deliver that really cheaply. In fact, we sold the original versions of Jira, they were $800, unlimited users, unlimited use with no lifespan. So it was just 800 bucks, one-time fee forever and we learned a lot about pricing and packaging firstly, but secondly, it was very simple. Our goal in the early days, we had to sell one copy a week to stay alive, that was it. Some weeks, we’d sell two copies. $1,600 US would roll in and we’d be like, “Cool, we got a week off to survive”, and then one copy a week became two and two became five and five became ten, and now it’s hundreds of thousands. Well, isn’t the thing you just didn’t want to have a job? So I love this part of the story, because when I started Stratechery, I had a job from Microsoft that made, I think, $104,000 or something like that. I’m like, “I just want to make that, because I don’t want to work for a corporation, so if I could just get to there, it’ll be great”. MCB: We had exactly the same sets of goals. We had a few things we wanted to make somewhere that we wanted to go to work. I wanted to get up every day and think, “I want to go to work”, and weirdly, almost 24 years later, I love coming to work, so a tick achieved. We wanted to make it so we didn’t have to wear a suit, neither of us really like wearing suits at all — in fact, it’s a bit of an allergic reaction often and so tick, don’t turn up to work in a suit every day. And thirdly, most of our friends, so this is right where IBM bought PwC ironically, so out of the 30-odd kids in our class, maybe 10 went to IBM as consultants and 10 went to PwC and then they all end up going to the same shop and their grad salary there was $47,600. So our goal for year one was to end the year making at least a grad salary and convince ourselves we’re not crazy kind of thing and we smashed that goal, so that was good, but that was there. The Internet, the distribution part is important, knowing your favorite topics. Tell me about that and along with the business model, because again, this goes back so far, I don’t think people appreciate the extent to this entire idea of self-serve or bottoms up selling. This is really where it all started. MCB: Yes. And look, a few things. Firstly, if you come from Australia, we’re an exporting nation. “We’re built on the sheep’s back”, is a phrase, Australia’s built on the sheep’s back. What that really means is because we were this colony originally, then country on the far side of the world, anything we did to make money largely had to leave the country and go somewhere else. Originally, that was a struggle to find a product that could do that. “Built on a sheep’s back” is because wool was the first product that could do that, you could put it on a wooden boat, because it wasn’t very heavy and you could ship it a long distance, because it kept really well, so we could make sheep’s wool and make money as a country by shipping it back to Europe and it could survive the journey and so the country was built on the sheep’s back. We are a massive exporting nation. Trump brings in his tariffs, we’re the only country with a negative rate of return, we have a positive trade relationship with America and we’re like, “Wait a second, why did we get taxed?”, so obviously, it’s rocks, technology, we build and export everything as a country that we do. So our mentality was like, “Well, if we’re going to make money, it’s going to be overseas”, that was the first thing, is, “Okay, it’s going to be somewhere else, it’s not going to be Australians buying our software”, and so the Internet allowed us to do this. We put up a shopfront, early website and people could come to our website, download our software and then we just needed a way to get paid for it. The problem was in order to do that and the trust barriers of the Internet, we had to have a very low price and we had to have a fully installable offering. So we spent so much time on making it installable, documentation, “How would you get yourself up and running and try it?” — the software, as we put it, had to sell itself. Our software had to be bought, not sold. We didn’t have any salespeople, we couldn’t travel to your office in Sweden or London and help you out with it. For $800, we couldn’t have done that and secondly, it didn’t make any sense. So the evolution was, “Okay, this is the only possible path that we can go down is we have to figure out how to get people to do this”, now it turns out once you have figured out how to do that, it’s an incredibly powerful motor because you have lots of people coming, you have a very cheap piece of software for its relative performance, and you get people using it in all these big businesses all over the place. I would say 50% of the customers I go meet nowadays, probably meet a handful of customers, a couple a day on an average kind of thing, many of those have been a customer for 20 years, 22 years, 23 years. How many customers have been a customer 23 years? I’m like that’s crazy, we’re only 24 years old. That’s awesome. MCB: And so they downloaded very early, they didn’t download as all of , all of them are customers. Just one guy who’s like, “I need a way to track my issues”. MCB: Exactly. It was some guy in a backroom who needed to track it. I know the Cisco origin story, that was literally a guy, he’s still there, he’s been there 22, 23 years, he’s awesome. And they started with just, “I just needed a way to manage my issues for 10 people”, and now it’s hundreds of thousands of people, seats that we have there, it’s kind of grown over time. How did we know that business model was working? Again, it dates us a lot, this didn’t mean we didn’t answer questions, we were big on customer service and helping people, email was the way to do that. A bit of IRC back then, we had a channel you could log into and we’d help you. But the first customer, we used to walk into the office in the morning and we had a fax machine with literally rolls of paper. So if you wanted to pay for this distributed software, this says how old, there was no SSL keys, I heard you complaining about it the other day, totally agree with that era. You had to download a PDF off our website, which was pretty modern that it was a PDF, fill in your credit card details, and fax it to us, that is how you paid when we started. So we would walk in the morning and there’d be these rolls of paper on the ground, you be like, “Ah, sweet, someone bought something”, you know what I mean? It became a weird dopamine drug for us. The very first company was American Airlines… MCB: About six months in that we came in the morning and there was a fax on the ground with $800 and a credit card number written on it and we had never talked to American Airlines, they had never emailed us, they had never asked for customer service, they’d never gone on IRC, they had never talked to us in any way, shape or form. Man, this thing could work, we just made $800 out of the air. MCB: I mean, there was a lot of pre-work to get them there, but obviously that was kind of different. MCB: Then secondarily, as you wrote, I’m just trying to finish a very long answer here, we started Confluence in 2004, and those two became the jewel engines and both of those I think were probably major moments. I often say Confluence is a bigger moment, actually. The business model was kind of established, this is two years into the business. We made, I think, $800 grand in year one, $1.6 million in year two, maybe $5 million in year three, and $12 million in year four, if I remember the revenue numbers. So the thing was working really well. You’re the company that’s the Microsoft heir in some respects, which is the really just you took venture eventually, but didn’t really need to, just pure bottoms up. You and Scott, we’re able to keep a huge portion of the company because of that, it’s an amazing story that is, I think, under-told in some respects. MCB: Yeah, well, we actually did. I mean, we did and didn’t. So the venture story is one of my favorites because it describes how we think from first principles. Firstly, the first capital we put on the balance sheet, institutional capital to put on the balance sheet, I guess you could argue our initial, I don’t know, $10 grand each was some money, but was in the IPO . So in 2015, when we went public, that was the first capital that went into the business all time. We took two rounds of funding, one in 2010 and one in 2013, but both of which were to employees, the first was to the founders and the second was to large number of employees who bought in so both of those companies bought ordinary stock. Secondary shares basically, yeah. MCB: They bought ordinary stock, there were no preferences, there were no anything, that was kind of the way it is. And we love the Accel guys that invested, it’s kind of funny because their business model was wildly wrong, we now have their original spreadsheets and stuff. We’ve 15 years in, you know them really, really well, they wanted us to grow it. I think we had to grow at 30% for two years, 20% the year after and something like that to double or triple their money and at the time they put in $60 mil US , that was the largest investment I think Accel had ever made in anything software, digital kind of world and it was this massive bet. It was a one-page term sheet for ordinary stock, so credit to those two partners who took massive risk on us, had to fight, we know that GC, everybody else to do this unusual funding round and I think we did 50% growth the first year, and our CAGR since then is probably 40%. Yeah, it worked out pretty well. MCB: They did very well. I think their 2-3x was more like a 300x or something. You mentioned the Confluence moment. Why was that a big deal? Usually the story is you have one product and you need to focus and you’re two years old, you’re launching a completely new product. Is that the aspect you’re referring to? MCB: Yes, I think it comes down to being bootstrapped. Look, we spent nine years convinced we were going to die every day, there was just such a mentality that this thing was all going to fall over and we better work harder and keep going. The Confluence moment was important because I remember, I don’t know exactly, but sometime around then we understood venture capital. Firstly, on the venture capital side, because they do relate to each other, there was no VC available in 2001 and 2002 in Australia. We’re a nuclear winter, we’re two idiots with no credibility. Right. You could barely get funded in San Francisco, you’re not going to get funding in Sydney. MCB: No, because 2001, you weren’t even finding San Francisco funding because the whole dot-com boom had just happened, no one was getting funded anyway. We’re in Australia and we have no credibility, so we didn’t even bother. We literally, 2010 when we went to the Accel thing and we talked to five VCs, was the first time we’d ever pitched the business. It was just not a thing, people don’t understand, we used to say we were customer-funded when people would ask the also awkward question of, “Who’s your funding come from?”, we were like, “We’re customer-funded”, They go, “Oh, okay”. Lifestyle business! MCB: But we did understand venture capital, massive readers, I have an army full of technical books, books about technology and the industry and history and stuff from that magic era of airport bookstores. We read every episode of Red Herring and Industry Standard and Wired Magazine, I have just this huge library, so voracious readers. One thing you understood about venture capital is they put the portfolio theory on their side — and I’m a big fan of venture capital, I should say, I’m the chair of Australia’s biggest VC fund and that’s my other mate that I met in university, Niki Scevak . But we wanted portfolio theory on our side, we’d done finance and economics, we had one product, this was highly risky if you’re bootstrapped. So there was a little bit of the thinking that actually if we have two products, our chances of total failure are less, one of them can fail and we’ll be okay and so we started a second product. Yes, arguably it was hard, but our first one was going all right, it was like making, I don’t know, five million bucks a year and we had a handful of really awesome backpacker programmers. And the early people, it’s like a whole total band of misfits that somehow made this thing work and we’re having a lot of fun, we’re working really hard and so we made another internal tool that became Confluence and being adjacent, but very different, selling to different audiences, but having a lot — if you bought one, there was a good reason to have the other one, no matter which way you started, became a really good symbiotic loop of these two engines that powered us for a very long time. So it was more a case of reducing our risk actually than anything else. Wasn’t it risky to be splitting your resources or did that not even occur to you? MCB: I don’t think it occurred to us, no. It was more about splitting our risk and we were doing pretty well, but it changed the business because we moved from being the Jira company to a software company, and I say that’s probably the most under-understood moment because we had to learn about not how to market Jira, but how to market software, not how to build Jira, but how to build software. So now we have 20, 25 apps in 5 different categories that sell to all sorts of different teams who own a business, but we had to become a software company. Microsoft, I don’t know the analogy’s really that fair to them, to be honest, or fair to us, it seems massively over-glamorizing what they’ve achieved, which is amazing, I’m huge fan of Microsoft. The need to understand how to sell, in their case, like Minecraft, SQL Server, Azure, AI, you have to understand the building, the creation of technology, the selling of technology, the marketing of technology at a generic level, it really helped us generify the business. I think if we’d gone too much longer, everybody would’ve been on the Jira team, it would’ve been too hard to start a second thing and instead, we’ve always been a multi-product company. You just mentioned selling a lot. When did you finally realize or transition away from just being self-serve to actually, “We’ve got to grow beyond this”? Was it almost like a pivot that came too late because your identity was so wrapped up into the, “We’re the self-serve company”? MCB: Look, it’s never been a pivot, I get asked this by investors all the time. I would say our go to-market model and our process has kept evolving pretty much every year or two for 20 years and I say evolving because we’re very aware of the strengths of the model that we came up with and we’re very aware of what it takes to power that and we’ve been very careful when we’ve evolved, changed, added to it, not to destroy the original one. So nowadays, we have two amazing business models where we call them high-touch and low-touch. So we have the low-touch model, which is literally the same thing as it’s always been, hundreds of thousands of people show up every week, they try our software, we want them to have a great experience trying the software, we want to spread it as widely as possible and as many enterprises as we can, and some of those will stick, some of those will get working and we measure aggressively the rates of return and dollars and flows and funnels and everything else. This whole team whose job is to make sure that that’s working at now massive scale, right. But at the same time, what happened is as customers got more and more Atlassian software deployed, they wanted a different relationship with us, they wanted a bigger relationship. Those days they used to be spending, as soon as we were spending $20 grand, we were like, “Oh man, maybe we should talk to these people”, nowadays it’s more like around $50 to $100 grand is when we’ll talk to you. So the lines kept moving for different reasons and we actually have online sales, inside sales in between actually, the sort of classical someone gets on an airplane and goes to travel to you. So it’s just kept evolving. We talk about the IPO a lot, it’s our 10-year anniversary coming up this month, I’m off to New York next week to ring the bell and celebrate 10 years. When we went public, as an example, we had less than 10 companies paying a million dollars a year, now we’re well north of 500 in 10 years. So that doesn’t come without an amazing enterprise sales team and teams that go out and help customers and customer success and all the trappings of a really top flight enterprise sales organization, because for most of those customers, again, I think it’s north of 85% of the Fortune 500 are deep Atlassian customers. We become a strategic partner to these businesses that if we go down, rockets don’t take off, banks shut down, it’s a real critical importance to most of these customers. How big is your business outside of directly working with developer teams? As I recall, this was part of the consulting thing was you were wanting to do Jira for sales or Jira for all these different sort of functions, where and how did that evolve? MCB: So it’s been a continuum for a long time. So nowadays, less than half of our users are in technology teams, and probably a third of those are developers, less than half of them. So a portion of our audience, it’s a very important point of words. When I talk about this, all the engineers are like, “Hey, you don’t care about us anymore”, I’m like, “No, that’s not true”, that business is a great business, it’s just the rest of our business has grown massively around it. There are not enough developers in the world for our business. Our fundamental value has always been actually, and it took us one of these things, it took a decade to realize, firstly, we don’t solve technology problems, we never have, we’ve never had anything that’s like, “I care what code you write, which language the code is in, what the code does”. We solve collaboration and people problems, we always have solved people problems, even Agile was a people problem. It’s not a technology problem, actually, it’s a people problem. It’s, “How do we organize a group of people to build a piece of technology that best meets the customer’s needs and goes off track as little as possible?”, that is a collaborative people problem, we’ve always solved people problems. Our value actually came because there’s a lot of tools for technology teams and we never wanted to be in the dev tools business, that’s a road of bones, it’s very hard to build sustainable competitive advantage and dev tools, the history shows this. There’s just a different company every few years, developers tastes are fickle, our developers taste are fickle, this is not me sledging developers at all, we have a massive R&D arm and that group changes languages every couple of years, they change how they build software every couple of years, they’re constantly moving on, they change our analytics tools and everything else because they are tool builders and toolmakers, that makes sense, but that’s a hard place to build a business. Interestingly topical today, so we’ll see. But the easier place to build a business in the long term was the level above that, which is the collaboration problems that came, which started as, “How do we get engineers, designers, product managers, business analysts to all be on the same page about what it is that they’re building and have a repeatable process for that?”. It turned out that as the world has become technology-driven, as we say, our customers are technology-driven organizations. If you’re a large organization for whom technology is your key distinct advantage, it doesn’t matter whether you’re making chips and databases or whether you’re making rockets or cars or whether you’re making financial services or insurance or healthcare, I would argue for most of the businesses that are great, technology is their key competitive advantage, then you should be our customer, that is it. And what we help you do is we help your technology teams and your business teams collaborate across that boundary because that’s actually the hardest boundary. Building great technology is one set of problems, making it work for your customers usually means in different industries, a different amount of working with all sorts of business people and that’s what Jira did from the very start. Now that’s what our whole portfolio in service management, in strategy and leadership teams is about doing that at different scales and different amounts in different places. Does it bug you when you get complaints on the Internet of, “Jira’s so complicated”, “Hard to use”, blah, blah, blah? And are you speaking to, the problem is that the problem space we’re working in is not the single developer trying to track an issue, it’s trying to herd a bunch of cats and get them the same direction and muddling through that is a lot more difficult than it seems. MCB: It bothers me anytime people don’t like our software, sure. We’ve worked for the last 20 years to make it better every day. We’ll probably work for the next 20 years to make it better every day and people will still probably be dissatisfied and that is our fundamental core design challenge. There’s a few reasons they say that. Firstly, the on-premise business model and the cloud shift is really important because with the cloud shift, we update the software, with the on-premise business model, we don’t, so you would often be on older data versions, customers would upgrade once a year or every two years or something, and so we can’t control that. Secondly, the challenge of Jira is at our core, we solve a whole lot of what we say is structured and unstructured workflows. Confluence is an unstructured workflow, Jira’s a very structured workflow. You have a set of steps, you have permissioning and restrictions, you have fields, you have what’s happening in this process. The auditor will do something and pass it to the internal accounting team, the accounting team will do this and pass it to legal, legal will do this and pass it to these people. You’re defining a workflow and you’re having information flow back and forth and a Jira work item is, as we call it, it’s a human reference to work. That’s the best description of what Jira is work in the knowledge work era is this very ephemeral concept. Back to your development example, is the code the software? Is the idea the software? Is the designs in Figma — these are all parts of what it is, this thing that’s called this virtual thing that we’ve built. What we track is with a human reference to that, so someone can say it’s a new admin console. Cool, here’s the design for the admin console, there’s the spec for the admin console, there’s the code for the admin console, here’s where it’s been tested, here’s where it’s deployed. Did customers like it? We need a reference to this thing that is otherwise spread across hundreds of systems and virtualized. Once you’re building a workflow system, companies, ours included, love process, we love workflows, we love control, and that control usually comes with more data. “Hey, don’t fill in these three fields, fill in these 50 fields”, and they’re all required for some reason and our job to customers is to say, “Do you really need 50 fields?”, because you’re creating a user experience- You’re ruining it for us! MCB: Your users are going to have to fill in all 50 fields, and it feels like that’s going to take you a while. We have customers — I went back and checked, I think almost every single person you’ve interviewed on your podcast is a customer of ours. I don’t know if it’s 100%, but it’s definitely north of 95% out of the last 20 guests. Stratechery is a customer of yours, so there you go. MCB: Oh, really? Well, there you go. Thank you. One of my engineers adores Jira, so I get the opposite angle from what I asked about. MCB: That’s right. So look, it’s a challenge for sure, but at the same time, man, the value we’ve created, the business value, the number of customers that run on it, it’s ironic, we talk about the AI era and all these other things. Literally, no chips go out of any of the chip companies you love talking about, every single one of them, soup to nuts. So at what point did you realize that AI was going to impact you in a major way? Was there an “aha” moment or it’s just been in the air? Or is it a specific time you realized, “Look, this is going to completely change what we do?” MCB: Again, I’m one of these — I’ve realized I’ve become the old man in the room. We’ve done machine learning for a long time in lots of ways because of our online business model, so I’d say we’ve done AI for a long time. Obviously, LLMs are what people refer to nowadays by AI and agents and these words that have corrupted the entire thing, the meaning changes in technology when it means something else. The launch of various versions of ChatGPT were very instructive obviously, they were a moment for everybody. The optimism, and I would say we’re massive AI optimists, it is the best thing that’s happened to our business in 25 years. Why? Because people might look at you from the outside and say you’re still characterized as — even though your business expanded far beyond developers — “Oh, you have a lot of developers”, I’m skipping over the transition to the cloud just because we’re running out of time, but it’s an interesting story. You did announce you are finally ending the on-premises software, which I’m curious, it is a sentimental moment to come to that decision, but people might look at you from the outside and say, “Oh, there’s a company that’s going to have a problem with AI, AI is going to replace developers, it’s the decreased seats . What are they going to do?” MCB: There’s a few ways to take that. I’m trying to put it on a tee for you. I think I know what you want to say. MCB: There’s a few ways to look at it. Firstly, I think AI is a good example where people are very concrete about the negatives and the positives are upside. I think it’s a huge force multiplier personally for human creativity, problem solving, all sorts of things, it’s a massive positive for society. That doesn’t mean there aren’t any negatives, but the net effect is really high. And we spend a lot of time, you hear it in the media talking about the job loss, the efficiency gains, whichever way you want to put it, that’s the thing. Well, that’s because it’s really concrete in a spreadsheet, “I can do this process with half as many people”, “Wow, look at that, that’s great”, what’s never written in the spreadsheet is all the new processes that get created, all the new ways of doing things, the quality of the output is going to be twice as high. If software costs half as much to write, I can either do it with half as many people, but core competitive forces, I would argue, in the economy mean I will need the same number of people, I would just need to do a better job of making higher quality technology. So our view on AI overall is an accelerant, not a replacement to everything we do, and just the next era of technology change is really positive. We’ve loved technology, we love the cloud, we love all the tech changes we’ve been through, mobile. Look, us as a business, we are in the game of knowledge work. We solve human problems, workflows, business processes, this is what we do. These largely revolve around text, or if it’s video nowadays, that can be reduced to text in various ways. LLMs allow us to understand that text in a massively deeper way than we ever have been, and the problems we solve aren’t going away. 20 years time, there’ll be groups of people trying to solve some sort of problem as a team and working on a project, and so these things aren’t going to go. They’re going to need to talk to each other and collaborate of what work’s going on and how it’s working, so the textual aspect of it has been amazing. The features we’ve been able to ship, we never could have built five years ago, it was literally impossible, so the ability to solve customer problems is so much higher than it ever has been. Secondly, our software is incredibly valuable at the core of these workflows, but it’s also incredibly promiscuous. What I mean by that is we have always been very highly interlinked with everything else. If it’s a sales team, there are links to Salesforce and customer records, there are links to internal systems, there are links to maybe features that need to be built, there are links to some content and document. So any Jira, Confluence, or Loom , you don’t record a Loom unless you’re talking about something, you don’t have a Jira issue without pointing to all sorts of different resources, whether that’s a GitHub or Figma, whether it’s Salesforce or Workday. That gives us a really unique knowledge, which we’ve turned into the teamwork graph, that actually started pre-AI, so the irony is the Teamwork Graph is about 6 years old. Well, it started with Confluence. This is the whole thing where you look backwards, and to your point, if you had just been the Jira company, but because from the very beginning, you mentioned Confluence was different but it was adjacent and you had to build the links and stuff together, and as you build all these different tools, because everyone wants to be this point of integration. And I wanted you to tell me about Rovo and this idea of being able to search across all your documents. Who gets permission to do that? It’s someone that’s already there, and you made the critical decision to be there back in 2004 or whatever it was. MCB: That’s true. Certainly back in 2004, and then in I think 2019, the Teamwork Graph starts, which is trying to take all of those links and turn them into a graph. The connectivity, two things linked to this Figma thing, five things linked to this customer record — okay, cool, that means something, so we built this Graph. To be honest, it was a bit of a technology lark. We have a lot of these projects that are really cool and we’re like, “We’ll be able to use this somehow and it’s going to grown”, and now it’s a hundred billion objects and connections connecting all of the company’s knowledge. It becomes the organizational memory nowadays and context and all these things nobody knew in 2019 that’s what it was going to be, it just seemed we needed it for various process connections. That turns out to be because it’s got permissions and compliance and all of the enterprise stuff built in, which is incredibly difficult, the best resource to point AI at in various forms. You still have to be good at the AI parts to get the knowledge, the context for any area, so the Teamwork Graph is our data layer. It’s not only the best kind of enterprise search engine for your content from a 10 Blue Links kind of way of thinking. If you’re chatting through your content, you still need all your organizational knowledge. I actually obviously found your Article, I was like, “Hey, what has Ben Thompson written about us last year?”, and I asked Rovo in chat and it comes back to me with he wrote this, that and the other and pulls out some snippets. I’m like, “Tell me more, do you think we’ve hit that?”, I literally got a report written by Rovo on your report as to whether it had been accurate. “Go look at the last 10 years with deep research and web search and come back and tell me, was he right or wrong?”, and it gave me a really interesting analysis of whether you were right and wrong. It’s like most AI things, it’s like 90% correct, it’s pretty good. It solved a lot of the first problem and I would not have done that work otherwise. I would have read it quickly and so I wasn’t going to put an analyst on it internally to do this work, but I could send something to do work I never would’ve done. Who’s your competitor for this spot, for this Rovo position where you have all this context, you can actually search your company in a way that just wasn’t possible previously? MCB: Who are the competitors you say? Yeah, because everyone is claiming they’re in this spot, “We can be the central place that you go and we have visibility everywhere”, why is Atlassian the one that’s going to win that space? MCB: A few reasons why we will. I think we have a great chance to be a great player is maybe the easiest way to say it. I think everybody loves this absolute win position, we don’t believe in enterprise technology, you usually get these absolute wins, it’s not quite the same as in the consumer world. We have a lot of business processes and workflows, millions every day that run through us, those are human collaboration workflows, so they are cool. The auditing team hands off to the accounting team, hands off to the tax team, whatever it is, sales workflows, marketing workflows, and they span lots of our applications and many others. If you’re going to go and introduce agents, these autonomous AI-driven software programs, whatever you want to call an agent, you’re going to put them into existing processes to make those processes either more efficient, more accurate. When the human picks up a task, it’s got all the information they need because something’s gone out to find it, that is an incredibly powerful position, which is why we support our agents and everybody else’s. You can assign a Jira work item to a Cursor agent in terms of code, you can assign it to a Salesforce agent. If you have your agent technology choice, I don’t think you’re going to have one agent platform, I think you’re probably going to have multiples, there are going to be a handful of organizational knowledge graphs that are powerful enough to solve these problems across multiple tools, but we have access to all those tools. We already know the information to some level, and that becomes a very unique advantage. Do you see this as a way to expand even further how much of a company you cover? You started with developers, then you expand to adjacent teams, and you talk about it’s now just a fraction of your user base. Do you own entire companies or could you get there? It’s like, “Okay, we still have these teams over here that are not on Jira, but Rovo’s so good that we need to bring everyone in”? MCB: Look, again, it would be great. I think it is unrealistic, and we should say “Absolutely”, right? MCB: If [Salesforce CEO Marc] Benioff was here, he’d be like, “Absolutely, we’ll own the world”, we love him, that’s the way he is, I don’t think about it as owning a customer. Our mentality has always been — I always use the subway analogy versus we have some competitors, for example, that want to be the control tower, their whole thing is we’ll be the control tower, just give us control and we’ll go and control everybody else, we’ll move the planes around. I think in enterprise IT, that’s an unrealistic view. Every CIO has been sold this for decades, it doesn’t happen because the world changes too quickly. Our philosophy and our commitment to customers has always been we will be a great citizen on all sides, we will interact with all of the applications you need, the old ones and the new ones, and we will be a valuable point of exchange in your business workflows and processes, whether those are structured like in Jira, whether unstructured like in Loom or Talent or something else. The reason for that is you have lots of systems. We want to be a valuable station on your subway network, we don’t want to be at the end of one of the lines, we want to be one of the handful of hub stations that are about moving trains around, and that is the best way to get your knowledge moving in your organization, it’s the best way to deal with your processes. Therefore, we need to have amazing AI capabilities. We have a massive investment in R&D, we have thousands of people working on AI tooling at the moment, and we have a huge creation bent, which is one of the reasons I think — we’ve talked a bit about the data advantage we have, I think we have a huge design advantage, and I actually think design is one of the hardest parts of building great AI experiences because it’s real fundamental design for the first time. You had a great line, you did a podcast a couple of weeks ago that I’ll put a link to, but you mentioned basically, the customer should not need to understand the difference between deterministic and probabilistic in the context of design, that’s what you’re driving at here. MCB: They should not need to understand that, they should need to understand when outcomes, outputs may be wrong or may be creative. Again, you talk a lot about the fact that hallucination is the other side of creativity, right, you can’t have one without the other. Hallucinations are a miracle. We have computers making stuff up! MCB: Our job is to explain to a customer when that happens, so it’s like this might be something you want to do, and that requires a lot of design. We have a feature in Jira called Work Breakdown which is super popular, where I can take a Jira issue and say, “Make me a bunch of sub-issues, this task has to be broken into a set of steps”. I don’t believe in the magic button theory of AI, that I’ll just hit a button and it’ll do all the things, I believe deeply in the value from AI will come from human-AI collaboration in a loop. It’s me and the AI working back and forth. You talk about yourself and Daman quite a lot , and it’s you, Daman and ChatGPT working together, but it’s not like you ask one thing and it’s done. It’s an interaction, it’s a collaboration back and forth, and that’s going to happen everywhere. In Work Breakdown, what it does is it says, “Hey, based on these types of documents I’ve gone to find from your whole graph in Google Docs and Confluence, whatever, I think this piece breaks down into these, is that correct?”, and it goes, “No, actually, that one doesn’t make any difference, these two are really good, you forgot about this document”, “Cool, let me go do that for you again”, and come back and say, “Is it these?”, “That’s closer”, and then you’re like, “That’s good enough, it’s 90% of what I need”, and then I go add the two that I need myself. That is a huge productivity boost but it’s not magically correct, and it requires a lot of design to tell people, “These are not the answers, these are possible answers, help us refine them and get better at it so that you get the 90% upside and the 10% downside is managed”. Are all these people pursuing these full agents that act on their own, are they just totally misguided? MCB: No, because I think, well, agents will take — there’s a snake oil sales thing going on as there always is in any bubble, and the snake oil sales is not wrong, it’s just chronologically challenged. (laughing) That’s so good. MCB: Well, customers are struggling. When I talk to customers every day, they’re like, “Is everyone else using these things to just magically transform their business with this simple, it took them five minutes and it’s replaced entire armies of people?”, and I’m like, “No, nobody’s doing that”. What they’re actually doing is taking business processes that are really important to their business and saying, “Okay, can I make this step better? This is highly error-prone. It’s compliance in a large organization, how do I make this part of the process better?”, and we’re like, “Oh, we can totally do that”, and they will replace small bits of lots of processes so that in Ship of Theseus style, five years from now, the process will look radically different. Occasionally, they are replacing entire processes, but this is the 1% case, what they’re actually doing is they have whole machines that are running and they’re trying to fix this cog and fix that cog, and that’s super valuable for them. That’s not a downside, that’s really, really valuable. And often, it’s work they didn’t want to do, work that wasn’t getting done, it wasn’t done at a high quality, so we got to remember that, I say this quite a lot, people shouldn’t be afraid of AI taking their job, I fundamentally believe this, they should be afraid of someone who’s really good at AI taking their job. That’s actually what’s going to happen, is someone is going to come along, in a sales sense, they’re really good at using all these AI tools to give better customer outcomes or handle more customers at one time. Is this why you’re hiring so many young people? MCB: Yes, I guess so. Yes, they’re more AI-native, they come out understanding these tools and technologies. I find the biggest irony in universities is all these people who “cheat” their way through every assignment, I use cheat in quote marks, using ChatGPT to handle these assignments, and then they’re worried AI is going to take all these jobs. I’m like, “Wait, you literally took your own job of writing the assignment, but you’ve also trained yourself on how to use these tools to get the outcome required” — now one might argue the university degree should be different, but just like when Google came along and you could look up any fact, knowing facts became far less important than the ability to look it up. I still think AI, it doesn’t create anything, maybe slightly controversial, but I argue it synthesizes information, it’s really good at processing huge amounts of information, giving it back to you, changing its form, bringing it back. Humans are still the only source of fundamental knowledge creation. I point out one of the flaws in the one person billion dollar company argument, and this will happen but it’ll be an anomaly. That company doesn’t get created without that one person, so there’s not AI creating companies magically. It’s like can a company eternally buy back its stock? No, because at some point, someone is going to own the final share? MCB: That’s right and I think this is missed, right? This is where we say it’s about unlocking creativity and what we do for our customers is put Rovo and these amazing data capabilities that we have alongside all the enterprise compliance and data residency, and there’s a massive amount of making this work in the enterprise with trust and probity and security. It’s very difficult. And great design to say, “What do you hire us to do? How do you get these technology and business teams to work together? What workflows do you have in your projects and your service teams, and how can we make those workflows better with more data and make your teams more informed?” That will end up with us having more share of employees in a business that use our stuff every day. Awesome. You made two big acquisitions recently, the DX acquisition , I think, makes a ton of sense to me measuring engineering productivity, particularly in the area of AI. What actual ROI are we getting on this? MCB: And how much money am I spending? Because I’m spending suddenly a lot of money, right? This is not cheap at all, I have huge bills. Internally, we use Rovo Dev , we use Claude Code, we use GitHub Copilot, we use Cursor, we have them available to all. We have a huge R&D — again, I think we’re still number one on the NASDAQ for R&D spending as proportion of revenue. You can take that as a good thing in the AI era or a bad thing, everyone gets to choose their own view on that, but we’ve always been incredibly high on R&D spending since day one. The bills that we pay though are very high, so DX is simply saying, “Okay, cool, how do I measure what I’m getting for that? Should I pay twice as much money because these bills are worthwhile, or is there a lot of it that’s actually just it’s really fun and it’s not actually leading to productivity gains?”. This is going to be a hard problem because there’s a lot of money on the line at the moment that people are paying for these tools, which is not without value, but measuring exactly what the value is is really, really hard, and that team’s done a phenomenal job. And we now have an Atlassian office in Salt Lake City, Utah, where I already spend a lot of time. Totally by coincidence, but it’s really nice. So that purchase, love it, makes a ton of sense. In perfect alignment with you. How does The Browser Company fit in? MCB: A lot of ways. So I have believed for a long time that browsers are broken. We’ve built browsers for an era of software that we don’t live in today. And I don’t, in my browser, have a bunch of tabs that represent webpages, I don’t have that. I have a bunch of tasks, I have a bunch of applications, I have a bunch of documents, and the browser was fundamentally never built to do that. That’s what Arc, first product from The Browser Company — if you don’t use Arc every single day, you should be, it’ll increase your productivity instantly because it’s built for knowledge workers and the way that they have to actually work every day and how they manage all of these tabs and tasks and flows versus serving the New York Times or whatever. That is a browser built for knowledge workers, and there’s a lot more we can do in that era as software changes. Secondly, obviously AI has come along, and we now have chats and applications as a extra part of the browser experience, so I think we can change how enterprises use browsers, security being a big issue. I think AI in the browser is a really important thing, but I suspect it’s not in the basic way of just combining Chrome and ChatGPT, that’s not how it’s going to play out. I suspect it requires a massive amount of design, which The Browser Company is phenomenal at, and it requires changing how people use their day-to-day applications. From our point of view, and I’ve been an Arc fan since day one, [The Browser Company CEO] Josh [Miller] and I have known each other a long time, there’s a knowledge worker angle and there’s obviously a business angle to it in a huge way that our customers are knowledge workers. We can change the way they do their work in a meaningful way of productivity, that is exactly what we have been trying to do in a lot of different ways. The browser itself, being chromium-based, Edge being chromium-based, Chrome being chromium-based, the rendering of webpages is not the problem, it is the fundamental user experience of, “How do I take all of my SaaS applications, my agents, my chats, my tabs, my knowledge, and put it all together in ways that make my day quicker?” — that is what we are trying to do fundamentally at the start. The context that we have is incredibly important for that. And the browser has, if you think about it, my personal memory. We used to call it the browser history. Great, it shows what I’ve seen, it does not have my organizational memory, which we have a great example of in the Teamwork Graph. So if I can put these things together, I can make a much more productive browsing experience for customers fundamentally in that world. I think we have an amazing shot of doing that and of changing how knowledge workers use SaaS. We’re not trying to make a browser, as I’ve said, for my kids, we’re not trying to make a browser for my parents, we’re not trying to make a browser for shopping or for anything else. We’re trying to make a browser for people who spend all day living in Salesforce and Jira and Google Docs and Confluence and Figma and GitHub, and that is their life. The laptop warrior that sits in that experience, I believe we can use AI and design to make that a far better experience and build an amazing product. They’re well on the way to doing that, we can supercharge doing it. You look skeptical. No, I’m looking at the clock, I skipped over a huge section. Your whole shift to the cloud, all those sorts of things. However, there is one thing I wanted to get to: you are wearing an Atlassian Williams Racing hat , I am a big F1 fan, I was very excited about you doing this . How did that come about? How was the first year? Was this another hunch this is going to work out? I mean, Williams is looking like a pretty good bet. MCB: Yes, our world’s largest sports bet. Look, how did it come about? So how do I make a short answer? F1 is changing, I think, in a massive way. I know now being incredibly deep in the business of it, the fundamental change is that hardware is becoming less important and software is becoming more important, this is a trend that we are used to. JV, James Vowles , the Team Principal, was the first person that approached us a long while ago now to help them, and for a teeny, teeny sticker in the corner, to help them get more productive as a team. What people don’t realize about F1 is these are large organizations, right? There’s 1100 people that work for Atlassian Williams Racing. And Williams was really pared down and skinny, he was brought back in with new owners to actually rebuild the entire thing? MCB: Yes, they were in deep trouble. But in rebuilding it, he is a software engineer, software developer by trade, by history kind of thing. He’s a technically-minded person. He downloaded Jira himself in 2004 to install it, so he knows us quite well. So we were brought on for our ability to help them with their teamwork and their collaboration, they really needed a technical upgrade to a whole lot of their systems. Turns out they need us in almost every part of their business because the service workflow’s important. We’re now in the garage, we’re using tons of AI to try to make them better, so there’s a lot of things we can do to build to hopefully help them win, and it’s a mission you can fall in love with. Here is one of the most storied brands in Formula 1 that’s fallen on tough times, every sportsperson loves a recovery story. And I was sold early on the recovery story, I’m like, “Fuck it, let’s go help, let’s make this happen. Let’s get back to being a championship team”. So we fell in love with the mission, and JV is super compelling, he’s got a one-decade goal, and they’re very goal-driven, and we love that, but they needed a lot of help, so that’s what they asked us for help with is initially. The more we looked at it, the more we learned about Formula 1, yes, it’s becoming a software-driven sport. So as an example, Atlassian Williams, I believe have twice as many software developers as the next team on the grid. Because it’s cost-capped, you got to choose, “Do I hire a software developer or an aerodynamicist?” — it’s a very clear cost cap, you’re choosing where to put your resources. As virtualization and everything get better, it’s less, “How well can I draw a curve?” and, “How much can I help 1100 people work together, and how can we build great software”, which really is the core of the car, right? So that then comes to us, tiny sticker, probably a founder-ish moment where I’m like, “How much is the sticker on the top?”, and they didn’t have a sticker on the top and I’m like, well, “What would that get us?” So we ran the numbers on that and the reason is twofold. You talked about our GTM, our go-to-market transformation, we have an ability to build various things. Firstly, branding is obviously massive, top three teams get 10 times the branding as the bottom three teams. So if you’re going to make a sports bet, you pay for a long period of time with the bottom three team, you help make them a top three team, and your sport bet pays out really well just on a sheer TV time and etc — the number of staff, parents, and other things, have said to staff members, “Hey, that company you work for, it’s really great, I saw them on the TV on the weekend”, and the staff member will say, “Dude, I’ve worked there for 12 years, why do you suddenly know about it?”, “Oh, I saw them driving. Carlos [Sainz Jr.] is great”, or something. And he is! So obviously, there’s a huge marketing and branding angle that’s about their position being better. The really interesting part of what we’re doing there is we have customers all around the world, we have customers in 200-odd countries, and we can’t go and visit all of our biggest customers in a meaningful way. We certainly can’t take them to some of our best and most exciting customers, right? There are electric car companies that use our stuff that we’d love to take many customers to a factory, or rockets, or whoever, I can’t take many customers into some of your favorite chip companies and say, “Look how they use our stuff”, I can maybe get one or two customers a year into that customer and show them how they use our things. With Formula 1, what we’re building is a mobile EBC, so an executive briefing center. Formula 1 goes around the world. It goes to Melbourne, it goes to Singapore, it goes to Japan, it goes to England, it goes to various parts of Northern Europe, it goes to various parts of America and you’re like, “Hey, where are our customers?” — roughly distributed like that. It comes to town, we can invite a whole lot of customers into a great experience, we can tell them a lot about Atlassian software, we can also invite them into one of our best customers. They can sit in the garage, and I can tell them how our service collection is helping power the assets, that when that wing’s broken, it gets known here, and they start making a new one back in the factory in Oxford, and this one gets shipped around the world and another one will get moved. And, “Here, I can show you the asset management and the service that goes along with it, I can show you how the garage is getting more efficient because of us, I can show you how we’re helping them win races”. We don’t drive cars, we help them be more productive as a team and I can do that in an environment of it’s an exciting environment. They can drink a great latte or a champagne or whatever they want, and I can explain to them how we are transforming this business in a meaningful way with our tools no matter which way they want to look at it, which is the most powerful customer story that you can go and tell a couple-hundred customers a year in their city. We come to their city, right? I was in Montreal, I took a whole bunch of Canadian customers over the three days, they were like, “This changes my view of Atlassian”, and I’m like, “That’s exactly our goal”, that is at the enterprise end of enterprise sales though, right? But that’s the ironic thing, it’s as far away from where you started as you could be. MCB: Well, they didn’t get there. I met two Canadian banks we had in Montreal as an example, both of whom had been customers for over 20 years, they started spending $800 bucks or maybe $4800 as we moved our pricing to around five grand — now they spend a million, two million dollars a year, and they could be spending ten. We have the ability to give the massive business value across a far larger swath of their business. And I can say, “What do you use from our system of work today? What could you use? Let me show you how Williams uses that piece of the system of work”, which is just a very visceral and exciting customer example to show them how they’re winning. And it helps, again, culturally, super aligned. They’re an awesome group of people trying really hard to win in the most ridiculously competitive sport and the highs are highs, the lows are low. Any sporting fan, you’re well familiar with various different sports that we have in common, but this is technology built by a large business team that has to win a sport. That doesn’t happen anywhere else in the sporting world, I would claim. Giannis [Antetokounmpo] doesn’t make his own shoes and have a team of people making better shoes and a better basketball so he can win, that doesn’t happen in other sports. It’s all about the people on the floor in an NBA game as to who wins, and that’s great, don’t get me wrong, I love basketball. The work in Formula 1 is done by 1000 people back in Oxford. It’s a Constructor Championship . MCB: The constructor championship I do think should be more important, especially given the current exact week we’re in, which is an amazing week for Atlassian Williams Racing, second podium . You talk about that bet, I told JV at the start of the year, I thought that he’s like, “What do you think our five-year future is?”, and I said, “Look, I think, number one, we’ll get one podium this year, 2025; 2026, we’ll win a race; and by 2030, we will have won a championship, that is my OKRs [Objectives and Key Results]”, and he said, “Oh, wow, okay, yeah I think so”. It lines up, I know the team OKRs and other things. And we won two podiums this year, so I was wrong, and I think we have a great chance for 2026, and we are working hard to make the team better and the single-best customer example we have of every piece of software that we sell. Mike, I’d love to talk again. It was great talking to you again. And, hey, good luck. And I’m a Williams fan, so I’ll be cheering for you this weekend. MCB: Oh, yeah. Well, I’m not sure this weekend, but 2026, 2027- Okay. I’m kind of kissing up, I am dying for Max [Verstappen] to win is the honest truth. I need the McLarens to run into each other . But other than that, Williams is my second love. MCB: Do you think McLaren will issue team orders to switch them if Oscar is in second and Lando’s in fourth? Yes. And I don’t know what’s going to happen if that happens, and this will be fascinating. MCB: We will have to see. It’s going to be a huge week. But that’s what makes the sport exciting, right? The whole thing is amazing. Talk to you later. MCB: All right. Thanks, man. This Daily Update Interview is also available as a podcast. To receive it in your podcast player, visit Stratechery . The Daily Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly. Thanks for being a supporter, and have a great day!

0 views
Karan Sharma 6 days ago

Fixing a CIBIL Score Disaster with AI

About a month ago, I downloaded my CIBIL report expecting a routine check. Instead, I found loans from lenders I had never interacted with, written-off accounts, overdues from fintechs I had never installed, and even two-wheeler loan enquiries. I don’t even ride a bike. My credit score had collapsed to under 680. I stared at the report trying to understand how this could happen. Buried in my profile section was the problem: my date of birth was wrong. Not a typo, but a completely different year. Because of this mismatch, CIBIL’s system had paired my PAN and mobile number with someone else’s DOB, effectively merging two individuals’ credit histories into one report. The accounts mapped to me included: Some were written-off, others 90+ days overdue, others still active. On paper, I looked like a serial defaulter. I opened ChatGPT and uploaded the entire PDF with a simple prompt: identify everything wrong in this report. Within minutes, it had mapped every suspicious account, flagged which ones didn’t match my history, highlighted the incorrect DOB, and explained why CIBIL systems mis-map accounts when demographic data is inconsistent. More usefully, it drafted formal dispute letters citing relevant RBI regulations and prepared lender-specific escalations with the right legal language. It felt like having a credit compliance team on demand. With the AI-drafted communications as a starting point, I sent disputes to CIBIL and direct emails to each lender. The key was being specific: every email included the CIBIL report control number, the exact account identifiers from the report, and references to specific RBI regulations. For example, when writing to Poonawalla Fincorp about a co-lending arrangement with Kissht, the email included: CIBIL Report details: I reiterate that I have never applied for, signed, or availed any facility from Poonawalla/Kissht. This appears to be erroneous mapping / data contamination. The emails also cited the relevant regulations explicitly: Under Section 45-A(2) of the Credit Information Companies (Regulation) Act 2005 and Para 7.2.2 & 8.1.3 of the RBI Master Directions on Credit Information Companies (2021), please verify this record against your origination/KYC systems. If the record is not verifiable or was created with misused/incorrect KYC, immediately instruct TransUnion CIBIL to delete/correct the entry. This kind of precise, regulation-backed language gets results. Vague complaints are likely to get ignored or deprioritized. Specific complaints with control numbers, account IDs, and regulatory citations get escalated to teams that can actually fix things. For co-lending cases (common with fintechs like Kissht, Ring, etc.), I learned to CC both parties and explicitly request a “consolidated correction” so the entry gets fully removed rather than bouncing between two institutions. When initial responses were slow, I sent reminders that referenced the original complaint number and the 30-day statutory deadline: This is a reminder regarding my complaint Ref No. [REDACTED]. The acknowledgement stated that the issue would be resolved by [DATE], yet I have not received any confirmation. Failure to resolve within the statutory period will leave me with no option but to escalate to the RBI Integrated Ombudsman. CIBIL started closing disputes. One by one, accounts were removed. Eight fraudulent accounts were purged in the first wave. But there was a catch: even after the fraudulent accounts were removed, my DOB was still wrong. CIBIL kept closing my DOB correction disputes without actually fixing the underlying data. Their responses were templated and generic, treating it like a lender issue when DOB is actually a CIBIL demographic field that they control directly. This required escalating to the Nodal Officer with a sharper tone: My Date of Birth correction dispute has been closed twice, yet my DOB remains incorrect in every new CIBIL report. This is a CIBIL demographic field — it is not lender-controlled and should have been corrected immediately once KYC was submitted. Because of this incorrect DOB, my profile was wrongly merged with another individual’s data. Although many wrong accounts have been removed, the root cause remains uncorrected — the wrong DOB is still mapped, and therefore the risk of future wrongful linkages still exists. Only after escalating to the Nodal Officer did the DOB finally get corrected. Once that happened, the system stopped associating the other person’s accounts with my profile. It was an algorithmic identity collision, and fixing the DOB resolved it. My latest CIBIL report shows the correct date of birth, zero fraudulent loans, no written-off or overdue accounts, and a score back in a healthy range. Only my actual accounts remain. Credit bureaus are not infallible. A single incorrect demographic detail (in my case, a mismatched DOB) can cause wrong loan mappings, score drops, false delinquencies, and a complete distortion of your financial identity. The resolution required documentation, persistence with escalations, and an understanding of RBI regulations. AI made the last part significantly easier. Instead of spending hours researching dispute procedures and drafting formal letters, I could focus on gathering the right documents and following up with the right people. If you haven’t checked your CIBIL report recently, it’s worth verifying that your basic details are correct: DOB, PAN, address, mobile, email. One wrong field can create problems that take weeks to untangle. Aditya Birla Capital : Short-term personal loan marked as doubtful/substandard Clix Capital : A loan marked written-off (₹50,000+) Poonawalla Fincorp : Personal loans with delayed payments Ring / Kissht : Unsecured digital loans InCred : Personal loan I never took Dhani Loans : BNPL-style loan with unrecognized activity Axio (Capital Float) : Old consumer loan KrazyBee : Various short-term loans Transactree : Small-ticket personal loan Multiple enquiries from HDFC, ICICI, IDFC First, Shriram Finance, and others Control Number: [REDACTED] Downloaded on: [DATE] Where your name appears: “POONAFIN – Personal Loan – Account No. [REDACTED]” Delinquency trail in history: DPD values 35 / 62 / 93 / 124 during [MONTHS]

0 views
Ahead of AI 1 weeks ago

A Technical Tour of the DeepSeek Models from V3 to V3.2

Similar to DeepSeek V3, the team released their new flagship model over a major US holiday weekend. Given DeepSeek V3.2’s really good performance (on GPT-5 and Gemini 3.0 Pro) level, and the fact that it’s also available as an open-weight model, it’s definitely worth a closer look. Figure 1: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . I covered the predecessor, DeepSeek V3, at the very beginning of my The Big LLM Architecture Comparison article, which I kept extending over the months as new architectures got released. Originally, as I just got back from Thanksgiving holidays with my family, I planned to “just” extend the article with this new DeepSeek V3.2 release by adding another section, but I then realized that there’s just too much interesting information to cover, so I decided to make this a longer, standalone article. There’s a lot of interesting ground to cover and a lot to learn from their technical reports, so let’s get started! While DeepSeek V3 wasn’t popular immediately upon release in December 2024, the DeepSeek R1 reasoning model (based on the identical architecture, using DeepSeek V3 as a base model) helped DeepSeek become one of the most popular open-weight models and a legit alternative to proprietary models such as the ones by OpenAI, Google, xAI, and Anthropic. Figure 2: DeepSeek V3/R1 architecture from December 2024. We will revisit and discuss architectural details in a later section. So, what’s new since V3/R1? I am sure that the DeepSeek team has been super busy this year. However, there hasn’t been a major release in the last 10-11 months since DeepSeek R1. Personally, I think it’s reasonable to go ~1 year for a major LLM release since it’s A LOT of work. However, I saw on various social media platforms that people were pronouncing the team “dead” (as a one-hit wonder). I am sure the DeepSeek team has also been busy navigating the switch from NVIDIA to Huawei chips. By the way, I am not affiliated with them or have spoken with them; everything here is based on public information . As far as I know, they are back to using NVIDIA chips. Finally, it’s also not that they haven’t released anything. There have been a couple of smaller releases that trickled in this year, for instance, DeepSeek V3.1 and V3.2-Exp. Figure 3: DeepSeek releases since last year. The main models are shown in red. As I predicted back in September, the DeepSeek V3.2-Exp release was intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. V3.2-Exp and V3.2 use a non-standard sparse attention variant that requires custom code, but more on this mechanism later. (I was tempted to cover it in my previous Beyond Standard LLMs article, but Kimi Linear was released around then, which I prioritized for this article section on new attention variants.) Before discussing further model details, it might be worthwhile to discuss the overall model types. Originally, DeepSeek V3 was released as a base model, and DeepSeek R1 added additional post-training to develop a dedicated reasoning model. This procedure is summarized in the figure below. Figure 4: Overview of the DeepSeek R1 training pipeline. This figure is from my more detailed Understanding Reasoning LLMs article. You can read more about the training pipeline in the figure above in my Understanding Reasoning LLMs article. What’s worthwhile noting here is that DeepSeek V3 is a base model, and DeepSeek R1 is a dedicated reasoning model. In parallel with DeepSeek, other teams have also released many really strong open-weight reasoning models. One of the strongest open-weight models this year was Qwen3. Originally, it was released as a hybrid reasoning model, which means that users were able to toggle between reasoning and non-reasoning modes within the same model. (In the case of Qwen3, this toggling was enabled via the tokenizer by adding/omitting tags.) Since then, LLM teams have released (and in some cases gone back and forth between) both dedicated reasoning models and Instruct/Reasoning hybrid models, as shown in the timeline below. Figure 5: The timeline of some of the reasoning and hybrid models released this year. For instance, Qwen3 started out as a hybrid model, but the Qwen team then later released separate instruct and reasoning models as they were easier to develop and yielded better performance in each respective use case. Some models like OpenAI’s gpt-oss only come in a hybrid variant where users can choose the reasoning effort via a system prompt (I suspect this is handled similarly in GPT-5 and GPT-5.1). And in the case of DeepSeek, it looks like they moved in the opposite direction from a dedicated reasoning model (R1) to a hybrid model (V3.1 and V3.2). However, I suspect that R1 was mainly a research project to develop reasoning methods and the best reasoning model at the time. The V3.2 release may be more about developing the best overall model for different use cases. (Here, R1 was more like a testbed or prototype model.) And I also suspect that, while the DeepSeek team developed V3.1 and V3.2 with reasoning capabilities, they might still be working on a dedicated R2 model. Before discussing the new DeepSeek V3.2 release in more detail, I thought it would be helpful to start with an overview of the main changes going from V3 to V3.1. I already discussed DeepSeek V3 and R1 in great detail in several other articles. To summarize the main points, DeepSeek V3 is a base model that uses two noteworthy architecture aspects: Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA). I think you are probably well familiar with MoE at this point, so I am skipping the introduction here. However, if you want to read more, I recommend the short overview in my The Big Architecture Comparison article for more context. The other noteworthy highlight is the use of MLA. MLA, which is used in DeepSeek V2, V3, and R1 , offers a memory-saving strategy that pairs particularly well with KV caching. The idea in MLA is that it compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the figure below. This adds an extra matrix multiplication but reduces memory usage. (As a side note, the queries are also compressed, but only during training, not inference.) Figure 6: Multi-Head Latent Attention (MLA) in DeepSeek V3/R1. (The compressed space of the query vector is not shown for simplicity.) The figure above illustrates the main idea behind MLA, where the keys and values are first projected into a latent vector, which can then be stored in the KV cache to reduce memory requirements. This requires a later up-projection back into the original key-value space, but overall it improves efficiency (as an analogy, you can think of the down- and up-projections in LoRA). Note that the query is also projected into a separate compressed space, similar to what’s shown for the keys and values. However, I omitted it in the figure above for simplicity. By the way, as mentioned earlier, MLA is not new in DeepSeek V3, as its DeepSeek V2 predecessor also used (and even introduced) it. DeepSeek R1 uses the same architecture as DeepSeek V3 above. The difference is the training recipe. I.e., using DeepSeek V3 as the base model, DeepSeek R1 was focused on the Reinforcement Learning with Verifiable Rewards (RLVR) method to improve the reasoning capabilities of the model. The core idea in RLVR is to have the model learn from responses that can be verified symbolically or programmatically, such as math and code (but this can, of course, also be extended beyond these two domains). Figure 7: An example of a verifiable task. The GRPO algorithm, which is short for Group Relative Policy Optimization, is essentially a simpler variant of the Proximal Policy Optimization (PPO) algorithm that is popular in Reinforcement Learning with Human Feedback (RLHF), which is used for LLM alignment. Figure 8: Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by removing the reward model, relying instead on verifiable rewards from symbolic tools such as calculators or compilers. I covered the RLVR training with their GRPO algorithm in more detail (including the math behind it) in my The State of Reinforcement Learning for LLM Reasoning if you are interested in additional information. As the DeepSeek team stated themselves, DeepSeek R1-0528 is basically a “minor version upgrade.” The architecture remains the same as in DeepSeek V3/R1, and the improvements are on the training side to bring it up to par with OpenAI o3 and Gemini 2.5 Pro at the time. Unfortunately, the DeepSeek team didn’t release any specific information describing how this was achieved; however, they stated that it partly comes from optimizations in their post-training pipeline. Also, based on what’s been shared, I think it’s likely that the hosted version of the model uses more computational resources at inference time (longer reasoning). DeepSeek V3.1 is a hybrid model with both general chat (instruct) and reasoning capabilities. I.e., instead of developing two separate models, there is now one model in which users can switch modes via the chat prompt template (similar to the initial Qwen3 model). DeepSeek V3.1 is based on DeepSeek V3.1-Base, which is in turn based on DeepSeek V3. They all share the same architecture. DeepSeek V3.2-Exp (Sep 2025) is where it gets more interesting. Originally, the DeepSeek V3.2-Exp didn’t top the benchmarks, which is why there wasn’t as much excitement around this model upon release. However, as I speculated back in September, this was likely an early, experimental release to get the infrastructure (especially the inference and deployment tools) ready for a larger release, since there are a few architectural changes in DeepSeek V3.2-Exp. The bigger release is DeepSeek V3.2 (not V4), but more on that later. So, what’s new in DeepSeek V3.2-Exp? First, DeepSeek V3.2-Exp was trained based on DeepSeek V3.1-Terminus as a base model. What’s DeepSeek V3.1-Terminus? It’s just a small improvement over the DeepSeek V3.1 checkpoint mentioned in the previous section. The technical report states that: DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2-Exp achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. As the paragraph above states, the main innovation here is the DeepSeek Sparse Attention (DSA) mechanism that they add to DeepSeek V3.1-Terminus before doing further training on that checkpoint. This DSA consists of (1) a lightning indexer and (2) a token-selector, and the goal is to selectively reduce the context to improve efficiency. To explain how it works, let’s start with sliding-window attention. For instance, sliding window attention is a technique (recently used by Gemma 3 and Olmo 3) that limits the attention window to a fixed size, as illustrated in the figure below. Figure 9: In sliding window attention, the current query token doesn’t attend to all previous tokens but just a subset. DSA is based on the same idea as sliding-window attention: only a subset of past tokens can be attended to. However, instead of selecting the tokens that can be attended via a fixed-width sliding window, DSA has an indexer and token selector to decide which past tokens can be attended. In other words, the tokens that can be attended are more random, as illustrated in the figure below. Figure 10: In DSA, the current token can attend a select number of tokens in the past (instead of all tokens like in regular causal attention). However, while I said “random” above, the pattern of which past tokens are selected is not actually random but learned. In practice, DSA uses its so-called lightning indexer to compute relevance scores for each new query token based on all previous tokens. For this computation, the lightning indexer uses the compressed token representations in DeepSeek’s Multi-Head Latent Attention (MLA) and computes the token similarity towards other tokens. The similarity score is basically a scaled dot product between query and key vectors passed through a ReLU function. If you are interested in the mathematical details, the equation (taken from the paper) for this lightning indexer similarity score is shown below: Here, w is a learned per-head weighting coefficient that determines how much each indexer head should contribute to the final similarity score. The q refers to the query, and the k refers to the key vector. And below is a list of the different subscripts: t : position of the current query token; s : position of a previous token in the sequence (0 ≤ s < t); j : the index over the different indexer heads (Figure 10 above only showed one head for simplicity), so q t, j means “query vector for current token t in indexer head j “. You may notice that the indexer is only over the queries, not the keys. That’s because the model only needs to decide which past tokens each new query should consider. The keys are already compressed and stored in the KV cache, so the indexer does not need to score or compress them again over the different heads. The ReLU function here, since it’s , zeroes negative dot-product positions, which could theoretically enable sparsity, but since there is a summation over the different heads, it’s unlikely that the indexer score is actually 0. The sparsity rather comes from the separate token selector. The separate token selector keeps only a small number of high-scoring tokens (for example, the top- k positions) and constructs a sparse attention mask that masks out the other tokens that are not contained in the selected subset. (The k in top- k , not to be confused with the k that is used for the keys in the equation above, is a hyperparameter that is set to 2048 in the model code that the DeepSeek team shared.) The figure below illustrates the whole process in a flowchart. Figure 11: A visual summary of DeepSeek V3.2’s Sparse Attention mechanism. To sum it up, the indexer and token selector result in each token attending to a few past tokens that the model has learned to consider most relevant, rather than all tokens or a fixed local window. The goal here was not to improve the performance over DeepSeek V3.1-Terminus but to reduce the performance degradation (due to the sparse attention mechanism) while benefiting from improved efficiency. Overall, the DSA reduces the computational complexity of the attention mechanism from quadratic O(𝐿 2 ), where L is the sequence length, to a linear O(𝐿𝑘), where 𝑘 (≪𝐿) is the number of selected tokens. Having discussed DeepSeek V3.2-Exp, we are getting closer to the main topic of this article: DeepSeek V3.2. However, there is one more puzzle piece to discuss first. On November 27, 2025 (Thanksgiving in the US), and just 4 days before the DeepSeek V3.2 release, the DeepSeek team released DeepSeekMath V2 , based on DeepSeek V3.2-Exp-Base. This model was specifically developed for math and achieved gold-level scores in several math competitions. Essentially, we can think of it as a proof (of concept) model for DeepSeek V3.2, introducing one more technique. The key aspect here is that reasoning models (like DeepSeek R1 and others) are trained with an external verifier, and the model learns, by itself, to write explanations before arriving at the final answer. However, the explanations may be incorrect. As the DeepSeek team succinctly states, the shortcomings of regular RLVR: [...] correct answers don’t guarantee correct reasoning. [...] a model can arrive at the correct answer through flawed logic or fortunate errors. The other limitation of the DeepSeek R1 RLVR approach they aim to address is that: [...] many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. So, to improve upon these two shortcomings mentioned above, in this paper, they train two models: An LLM-based verifier for theorem proving. The main model, a proof-generator, uses the LLM-based verifier as a reward model (instead of a symbolic verifier). In addition to this self-verification via an LLM as described above, they also use self-refinement (covered in the upcoming Chapter 5 of my Build a Reasoning Model (From Scratch) book) to have the LLM iteratively improve its own answers. Having an LLM score for the intermediate steps is not new. There is a whole line of research on so-called process reward models, which have focused on this. Examples include Solving Math Word Problems With Process- and Outcome-based Feedback (2022) or Let’s Verify Step by Step (2023) , but there are many more. The challenges with process reward models are that it’s not easy to check whether intermediate rewards are correct, and it can also lead to reward hacking. In the DeepSeek R1 paper in Jan 2025, they didn’t use process reward models as they found that: its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments. In this paper, they successfully revisit this in the form of self-verification. The motivation is that, even if no reference solution exists, humans can self-correct when reading proofs and identifying issues. So, in order to develop a better model for writing mathematical proofs (LLM 1 in the figure below), they developed a proof verifier (LLM 2) in the figure below, which can be used as an LLM-as-a-judge to score the prover (LLM 1) outputs. Figure 12: The general math proof generator (LLM 1) and verifier (LLM 2) setup. The verifier LLM (LLM 2) takes in a rubric to score the generated proof, where the score is “1 for complete and rigorous proofs with all logical steps clearly justified;” “0.5 for proofs with sound overall logic but minor errors or omitted details;” “and 0 for fundamentally flawed proofs containing fatal logical errors or critical gaps.” For the proof verifier model, they start with DeepSeek V3.2-Exp-SFT, a model they created based on DeepSeek V3.2-Exp by supervised fine-tuning on reasoning data (both math and code). They then further train the model with reinforcement learning using a format reward (a check whether the solution is in the expected format) and a score reward based on how close the predicted score is to the actual score (annotated by human math experts). The goal of the proof verifier (LLM 2) is to check the generated proofs (LLM 1), but who checks the proof verifier? To make the proof verifier more robust and prevent it from hallucinating issues, they developed a third LLM, a meta-verifier. Figure 13: The meta-verifier (LLM 3) checks whether the verifier (LLM 2) is verifying the generator (LLM 1) correctly. The meta-verifier (LLM 3) is also developed with reinforcement learning, similar to LLM 2. While the use of a meta-verifier is not required, the DeepSeek team reported that: the average quality score of the verifier’s proof analyses – as evaluated by the meta-verifier – improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction. This is actually quite an interesting setup. If you are familiar with generative adversarial networks (GANs), you may see the analogy here. For instance, the proof verifier (think of it as a GAN discriminator) improves the proof generator, and the proof generator generates better proofs, further pushing the proof verifier. The meta score is used during training of the verifier (LLM 2) and the generator (LLM 1). It is not used at inference time in the self‑refinement loop, which we will discuss in the next section. In the previous section, we talked about self-verification, i.e., analyzing the quality of the solution. The purpose of this is to implement self-refinement, which means that the LLM can act upon the feedback and revise its answer. Traditionally, in self-refinement, which is an established and popular inference-scaling technique, we would use the same LLM for generating the solution and verifying it, before refining it. In other words, in the previous figures 12 and 13, LLM 1 and LLM 2 would be the same LLM. So, a traditional self-refinement process would look as follows: Figure 14: A classic self-refinement iteration where we use the same LLM for generating the initial response (Output 1), the evaluation (Eval), and the refined answer (Output 2). However, the DeepSeek team observed a crucial issue with using the same LLM for both the generation and verification in practice: when prompted to both generate and analyze its own proof in one shot, the generator tends to claim correctness even when the external verifier easily identify flaws. In other words, while the generator can refine proofs based on external feedback, it fails to evaluate its own work with the same rigor as the dedicated verifier. As a logical consequence, one would assume they use a separate proof generator (LLM 1) and proof verifier (LLM 2). So, the self-refinement loop used here becomes similar to the one shown in the figure below. Note that we omit LLM 3, which is only used during the development of the verifier (LLM 2). Figure 15: Self-refinement with a separate verifier LLM (LLM 2). However, in practice, and different from Figure 15, the DeepSeek team uses the same generator and verifier LLM as in a classic self-refinement loop in Figure 14: “All experiments used a single model, our final proof generator, which performs both proof generation and verification.” In other words the separate verifier is essential for training, to improve the generator, but it is not used (/needed) later during inference once the generator is strong enough. And the key difference from naive single‑model self‑refinement is that the final prover has been trained under the guidance of a stronger verifier and meta‑verifier, so it has learned to apply those rubrics to its own outputs. Also, using this 2-in-1 DeepSeekMath V2 verifier during inference is also beneficial in terms of resource and cost, as it add less complexity and compute requirements than running a second LLM for proof verification. Coming back to the general self-refinement concept shown in Figures 14 and 15, both figures show self-refinement with 2 iterations (the initial one and a refined answer). Of course, we can add more iterations to this process. It’s a classic inference-scaling trade-off: the more iterations we add, the more expensive it becomes to generate the answer, but the higher the overall accuracy. In the paper, the DeepSeek team used up to 8 iterations, and it looks like the accuracy didn’t saturate yet. Figure 16: Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath V2 paper . The Best@32 accuracy majority voting method is also known as “self-consistency” and covered in Chapter 4 of my Build a Reasoning Model (From Scratch) book . The reason why we spent so much time on DeepSeekMath V2 in the previous section is that a) it’s a very interesting proof of concept that pushes the idea of Reinforcement Learning with Verifiable Rewards (RLVR) further with self-verification and self-refinement techniques, and b) the self-verification and self-refinement techniques are used in DeepSeek V3.2 as well. But before we get to this part, let’s start with a general overview of DeepSeek V3.2. This model is a big deal because it performs really well compared to current flagship models. Figure 17: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . Similar to several other DeepSeek models, V3.2 comes with a nice technical report , which I will discuss in the next sections. The main motivation for this model is, of course, to improve overall model performance. For instance, like DeepSeekMath V2, it achieves gold-level performance on math benchmarks. However, the model is also trained with tool-use in mind and also performs well on other tasks, for instance, code and agentic tasks. At the same time, the DeepSeek team writes about computational efficiency as a big, motivating factor. That’s why they use the Multi-Head Latent Attention (MLA) mechanism from V2 and V3 together with the DeepSeek Sparse Attention (DSA) mechanism, which they added in V3.2. In fact, the paper says that “DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp,” which we discussed in an earlier section. Figure 18: The DeepSeek V3.2 architecture. As I mentioned earlier the DeepSeek V3.2-Exp release was likely intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. Figure 19: Inference cost savings thanks to DeepSeek Sparse Attention (DSA). Annotated figure from the DeepSeek V3.2 report . Interestingly, as the screenshot from the paper above shows, the DeepSeek team reverted to using NVIDIA chips (after they allegedly experimented with model training on chips from Huawei). Since the architecture is the same as that of DeepSeek V3.2-Exp, the interesting details lie in the training methods, which we will discuss in the next sections. Overall, the DeepSeek team adopts the Reinforcement Learning with Verifiable Rewards (RLVR) procedure using the Group Relative Policy Optimization (GRPO) algorithm similar to DeepSeek R1. However, there are some interesting updates to discuss. Originally, DeepSeek R1 used a format reward (to make sure the answer is properly formatted); a language consistency reward (so that the model doesn’t alternate between different languages when writing its response); and the main verifier reward (whether the answer, in a math or code problem, is correct or not) For DeepSeek V3.2, they changed the rewards: For reasoning and agent tasks, we employ rule-based outcome reward, length penalty, and language consistency reward. For general tasks, we employ a generative reward model where each prompt has its own rubrics for evaluation. For instance, they removed the format reward but added a length penalty for agentic tasks. Then, for general tasks where there is no symbolic verifier (math) or code interpreter to verify the answer, they use a reward model (another LLM trained to output a reward score). So, it sounds like the pipeline is no longer purely verifier‑based RLVR like in DeepSeek R1, but a hybrid of RLVR (for verifiable domains) and more standard LLM‑as‑a‑judge reward modeling for everything else. For the math domain, they state that they additionally “incorporated the dataset and reward method from DeepSeekMath-V2,” which we discussed earlier in this article. Regarding GRPO itself, the learning algorithm inside the RLVR pipeline, they made a few changes since the original version in the DeepSeek R1 paper, too. Over the last few months, dozens of papers have proposed modifications to GRPO to improve its stability and efficiency. I wrote about two popular ones, DAPO and Dr. GRPO, earlier this year in my The State of Reinforcement Learning for LLM Reasoning article . Without getting into the mathematical details of GRPO, in short, DAPO modifies GRPO with asymmetric clipping, dynamic sampling, token-level loss, and explicit length-based reward shaping. Dr. GRPO changes the GRPO objective itself to remove the length and std normalizations. The recent Olmo 3 paper also adopted similar changes, which I am quoting below: Zero Gradient Signal Filtering: We remove groups of instances whose rewards are all identical (that is, a batch with zero standard deviation in their advantage) to avoid training on samples that provide zero gradient, similar to DAPO (Yu et al., 2025). [DAPO] Active Sampling: We maintain a consistent batch size in spite of zero gradient filtering with a novel, more efficient version of dynamic sampling (Yu et al., 2025). See OlmoRL Infra for details. [DAPO] Token-level loss: We use a token-level loss to normalize the loss by the total number of tokens across the batch (Yu et al., 2025), rather than per-sample to avoid a length bias. [DAPO] No KL Loss: We remove the KL loss as a common practice (GLM-4.5 Team et al., 2025; Yu et al., 2025; Liu et al., 2025b) as it allows less restricted policy updates, and removing it does not lead to over-optimization or destabilized training. [DAPO and Dr. GRPO] Clip Higher: We set the upper-bound clipping term in the loss to a slightly higher value than the lower bound to enable larger updates on tokens, as proposed by Yu et al. (2025). [DAPO] Truncated Importance Sampling: To adjust for differences between log probabilities from the inference and training engines, we multiply the loss by the truncated importance sampling ratio, following Yao et al. (2025). No standard deviation normalization: When calculating advantage, we do not normalize by the standard deviation of the group, following Liu et al. (2025b). This removes a difficulty bias, where questions with low standard deviation in their rewards (for example, too hard or too easy) have their advantages significantly increased by the normalization term. [Dr. GRPO] The GRPO modifications in DeepSeek V3.2 are a bit less aggressive, which I summarized in a similar style as Olmo 3 did: Domain‑specific KL strengths (including zero for math): Instead of always dropping KL like DAPO and Dr. GRPO do for math‑style RL, DeepSeek V3.2 keeps a KL term in the objective but tunes its weight per domain. However, they also note that very weak or even zero KL often works best for mathematics. (But instead of removing it completely, it becomes a hyperparameter.) Unbiased KL estimate: As mentioned above, DeepSeek V3.2 doesn’t remove the KL penalty. And in addition to treating it as a tuning knob, they propose a fix to how the KL penalty is estimated in GRPO by reweighting the KL term with the same importance ratio used for the main loss, so the KL gradient actually matches the fact that samples come from the old policy rather than the current one. Off‑policy sequence masking: When they reuse rollout data (rollout is simply jargon for the full sequence the model generates) across many gradient steps, DeepSeek V3.2 measures how far the current policy has drifted from the rollout policy on each full answer and simply drops those sequences that both have negative advantage and are “too off‑policy”. So, this prevents the model from learning from overly off‑policy or stale data. Keep routing for MoE models: For the Mixture‑of‑Experts backbone, they log which experts were activated during rollout and force the same routing pattern during training, so gradient updates are for those experts that produced the sampled answers. Keep sampling mask for top‑p / top‑k: When rollouts use top‑p or top‑k sampling, DeepSeek V3.2 stores the selection mask and reapplies it when computing the GRPO loss and KL, so the action space at training time matches what was actually available during sampling. Keep original GRPO advantage normalization: Dr. GRPO shows that GRPO’s length and per‑group standard‑deviation normalization terms bias optimization toward overly long incorrect answers and over‑weight very easy or very hard questions. Dr. GRPO fixes this by removing both terms and going back to an unbiased PPO‑style objective. In contrast, DAPO moves to a token‑level loss that also changes how long vs short answers are weighted. DeepSeek V3.2, however, keeps the original GRPO normalization and instead focuses on other fixes, such as those above. So, overall, DeepSeek V3.2 is closer to the original GRPO algorithms than some other recent models but adds some logical tweaks. DeepSeek V3.2 also comes in an extreme, extended-thinking variant called DeepSeek V3.2-Speciale, which was trained only on reasoning data during the RL stage (more akin to DeepSeek R1). Besides training only on reasoning data, they also reduced the length penalty during RL, allowing the model to output longer responses. Generating longer responses is a form of inference scaling, where responses become more expensive due to the increased length, in return for better results. Figure 20: The “extended-thinking” Speciale model achieves higher accuracy but also generates more tokens. In this article, I didn’t cover all the nitty-gritty details of the DeepSeek V3.2 training approach, but I hope the comparison with previous DeepSeek models helps clarify the main points and innovations. In short, the interesting takeaways are: DeepSeek V3.2 uses a similar architecture to all its predecessors since DeepSeek V3; The main architecture tweak is that they added the sparse attention mechanism from DeepSeek V3.2-Exp to improve efficiency; To improve math performance, they adopted the self-verification approach from DeepSeekMath V2; There are several improvements to the training pipeline, for example, GRPO stability updates (note the paper goes into several other aspects around distillation, long-context training, integration of tool-use similar to gpt-oss, which we did not cover in this article). Irrespective of the relative market share of DeepSeek models compared to other smaller open-weight models or proprietary models like GPT-5.1 or Gemini 3.0 Pro, one thing is for sure: DeepSeek releases are always interesting, and there’s always a lot to learn from the technical reports that come with the open-weight model checkpoints. I hope you found this overview useful! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . I covered the predecessor, DeepSeek V3, at the very beginning of my The Big LLM Architecture Comparison article, which I kept extending over the months as new architectures got released. Originally, as I just got back from Thanksgiving holidays with my family, I planned to “just” extend the article with this new DeepSeek V3.2 release by adding another section, but I then realized that there’s just too much interesting information to cover, so I decided to make this a longer, standalone article. There’s a lot of interesting ground to cover and a lot to learn from their technical reports, so let’s get started! 1. The DeepSeek Release Timeline While DeepSeek V3 wasn’t popular immediately upon release in December 2024, the DeepSeek R1 reasoning model (based on the identical architecture, using DeepSeek V3 as a base model) helped DeepSeek become one of the most popular open-weight models and a legit alternative to proprietary models such as the ones by OpenAI, Google, xAI, and Anthropic. Figure 2: DeepSeek V3/R1 architecture from December 2024. We will revisit and discuss architectural details in a later section. So, what’s new since V3/R1? I am sure that the DeepSeek team has been super busy this year. However, there hasn’t been a major release in the last 10-11 months since DeepSeek R1. Personally, I think it’s reasonable to go ~1 year for a major LLM release since it’s A LOT of work. However, I saw on various social media platforms that people were pronouncing the team “dead” (as a one-hit wonder). I am sure the DeepSeek team has also been busy navigating the switch from NVIDIA to Huawei chips. By the way, I am not affiliated with them or have spoken with them; everything here is based on public information . As far as I know, they are back to using NVIDIA chips. Finally, it’s also not that they haven’t released anything. There have been a couple of smaller releases that trickled in this year, for instance, DeepSeek V3.1 and V3.2-Exp. Figure 3: DeepSeek releases since last year. The main models are shown in red. As I predicted back in September, the DeepSeek V3.2-Exp release was intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. V3.2-Exp and V3.2 use a non-standard sparse attention variant that requires custom code, but more on this mechanism later. (I was tempted to cover it in my previous Beyond Standard LLMs article, but Kimi Linear was released around then, which I prioritized for this article section on new attention variants.) 2. Hybrid Versus Dedicated Reasoning Models Before discussing further model details, it might be worthwhile to discuss the overall model types. Originally, DeepSeek V3 was released as a base model, and DeepSeek R1 added additional post-training to develop a dedicated reasoning model. This procedure is summarized in the figure below. Figure 4: Overview of the DeepSeek R1 training pipeline. This figure is from my more detailed Understanding Reasoning LLMs article. You can read more about the training pipeline in the figure above in my Understanding Reasoning LLMs article. What’s worthwhile noting here is that DeepSeek V3 is a base model, and DeepSeek R1 is a dedicated reasoning model. In parallel with DeepSeek, other teams have also released many really strong open-weight reasoning models. One of the strongest open-weight models this year was Qwen3. Originally, it was released as a hybrid reasoning model, which means that users were able to toggle between reasoning and non-reasoning modes within the same model. (In the case of Qwen3, this toggling was enabled via the tokenizer by adding/omitting tags.) Since then, LLM teams have released (and in some cases gone back and forth between) both dedicated reasoning models and Instruct/Reasoning hybrid models, as shown in the timeline below. Figure 5: The timeline of some of the reasoning and hybrid models released this year. For instance, Qwen3 started out as a hybrid model, but the Qwen team then later released separate instruct and reasoning models as they were easier to develop and yielded better performance in each respective use case. Some models like OpenAI’s gpt-oss only come in a hybrid variant where users can choose the reasoning effort via a system prompt (I suspect this is handled similarly in GPT-5 and GPT-5.1). And in the case of DeepSeek, it looks like they moved in the opposite direction from a dedicated reasoning model (R1) to a hybrid model (V3.1 and V3.2). However, I suspect that R1 was mainly a research project to develop reasoning methods and the best reasoning model at the time. The V3.2 release may be more about developing the best overall model for different use cases. (Here, R1 was more like a testbed or prototype model.) And I also suspect that, while the DeepSeek team developed V3.1 and V3.2 with reasoning capabilities, they might still be working on a dedicated R2 model. 3. From DeepSeek V3 to V3.1 Before discussing the new DeepSeek V3.2 release in more detail, I thought it would be helpful to start with an overview of the main changes going from V3 to V3.1. 3.1 DeepSeek V3 Overview and Multi-Head Latent Attention (MLA) I already discussed DeepSeek V3 and R1 in great detail in several other articles. To summarize the main points, DeepSeek V3 is a base model that uses two noteworthy architecture aspects: Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA). I think you are probably well familiar with MoE at this point, so I am skipping the introduction here. However, if you want to read more, I recommend the short overview in my The Big Architecture Comparison article for more context. The other noteworthy highlight is the use of MLA. MLA, which is used in DeepSeek V2, V3, and R1 , offers a memory-saving strategy that pairs particularly well with KV caching. The idea in MLA is that it compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the figure below. This adds an extra matrix multiplication but reduces memory usage. (As a side note, the queries are also compressed, but only during training, not inference.) Figure 6: Multi-Head Latent Attention (MLA) in DeepSeek V3/R1. (The compressed space of the query vector is not shown for simplicity.) The figure above illustrates the main idea behind MLA, where the keys and values are first projected into a latent vector, which can then be stored in the KV cache to reduce memory requirements. This requires a later up-projection back into the original key-value space, but overall it improves efficiency (as an analogy, you can think of the down- and up-projections in LoRA). Note that the query is also projected into a separate compressed space, similar to what’s shown for the keys and values. However, I omitted it in the figure above for simplicity. By the way, as mentioned earlier, MLA is not new in DeepSeek V3, as its DeepSeek V2 predecessor also used (and even introduced) it. 3.2 DeepSeek R1 Overview and Reinforcement Learning with Verifiable Rewards (RLVR) DeepSeek R1 uses the same architecture as DeepSeek V3 above. The difference is the training recipe. I.e., using DeepSeek V3 as the base model, DeepSeek R1 was focused on the Reinforcement Learning with Verifiable Rewards (RLVR) method to improve the reasoning capabilities of the model. The core idea in RLVR is to have the model learn from responses that can be verified symbolically or programmatically, such as math and code (but this can, of course, also be extended beyond these two domains). Figure 7: An example of a verifiable task. The GRPO algorithm, which is short for Group Relative Policy Optimization, is essentially a simpler variant of the Proximal Policy Optimization (PPO) algorithm that is popular in Reinforcement Learning with Human Feedback (RLHF), which is used for LLM alignment. Figure 8: Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by removing the reward model, relying instead on verifiable rewards from symbolic tools such as calculators or compilers. I covered the RLVR training with their GRPO algorithm in more detail (including the math behind it) in my The State of Reinforcement Learning for LLM Reasoning if you are interested in additional information. 3.3 DeepSeek R1-0528 Version Upgrade As the DeepSeek team stated themselves, DeepSeek R1-0528 is basically a “minor version upgrade.” The architecture remains the same as in DeepSeek V3/R1, and the improvements are on the training side to bring it up to par with OpenAI o3 and Gemini 2.5 Pro at the time. Unfortunately, the DeepSeek team didn’t release any specific information describing how this was achieved; however, they stated that it partly comes from optimizations in their post-training pipeline. Also, based on what’s been shared, I think it’s likely that the hosted version of the model uses more computational resources at inference time (longer reasoning). 3.4 DeepSeek V3.1 Hybrid Reasoning DeepSeek V3.1 is a hybrid model with both general chat (instruct) and reasoning capabilities. I.e., instead of developing two separate models, there is now one model in which users can switch modes via the chat prompt template (similar to the initial Qwen3 model). DeepSeek V3.1 is based on DeepSeek V3.1-Base, which is in turn based on DeepSeek V3. They all share the same architecture. 4. DeepSeek V3.2-Exp and Sparse Attention DeepSeek V3.2-Exp (Sep 2025) is where it gets more interesting. Originally, the DeepSeek V3.2-Exp didn’t top the benchmarks, which is why there wasn’t as much excitement around this model upon release. However, as I speculated back in September, this was likely an early, experimental release to get the infrastructure (especially the inference and deployment tools) ready for a larger release, since there are a few architectural changes in DeepSeek V3.2-Exp. The bigger release is DeepSeek V3.2 (not V4), but more on that later. So, what’s new in DeepSeek V3.2-Exp? First, DeepSeek V3.2-Exp was trained based on DeepSeek V3.1-Terminus as a base model. What’s DeepSeek V3.1-Terminus? It’s just a small improvement over the DeepSeek V3.1 checkpoint mentioned in the previous section. The technical report states that: DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2-Exp achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. As the paragraph above states, the main innovation here is the DeepSeek Sparse Attention (DSA) mechanism that they add to DeepSeek V3.1-Terminus before doing further training on that checkpoint. This DSA consists of (1) a lightning indexer and (2) a token-selector, and the goal is to selectively reduce the context to improve efficiency. To explain how it works, let’s start with sliding-window attention. For instance, sliding window attention is a technique (recently used by Gemma 3 and Olmo 3) that limits the attention window to a fixed size, as illustrated in the figure below. Figure 9: In sliding window attention, the current query token doesn’t attend to all previous tokens but just a subset. DSA is based on the same idea as sliding-window attention: only a subset of past tokens can be attended to. However, instead of selecting the tokens that can be attended via a fixed-width sliding window, DSA has an indexer and token selector to decide which past tokens can be attended. In other words, the tokens that can be attended are more random, as illustrated in the figure below. Figure 10: In DSA, the current token can attend a select number of tokens in the past (instead of all tokens like in regular causal attention). However, while I said “random” above, the pattern of which past tokens are selected is not actually random but learned. In practice, DSA uses its so-called lightning indexer to compute relevance scores for each new query token based on all previous tokens. For this computation, the lightning indexer uses the compressed token representations in DeepSeek’s Multi-Head Latent Attention (MLA) and computes the token similarity towards other tokens. The similarity score is basically a scaled dot product between query and key vectors passed through a ReLU function. If you are interested in the mathematical details, the equation (taken from the paper) for this lightning indexer similarity score is shown below: Here, w is a learned per-head weighting coefficient that determines how much each indexer head should contribute to the final similarity score. The q refers to the query, and the k refers to the key vector. And below is a list of the different subscripts: t : position of the current query token; s : position of a previous token in the sequence (0 ≤ s < t); j : the index over the different indexer heads (Figure 10 above only showed one head for simplicity), so q t, j means “query vector for current token t in indexer head j “. Figure 11: A visual summary of DeepSeek V3.2’s Sparse Attention mechanism. To sum it up, the indexer and token selector result in each token attending to a few past tokens that the model has learned to consider most relevant, rather than all tokens or a fixed local window. The goal here was not to improve the performance over DeepSeek V3.1-Terminus but to reduce the performance degradation (due to the sparse attention mechanism) while benefiting from improved efficiency. Overall, the DSA reduces the computational complexity of the attention mechanism from quadratic O(𝐿 2 ), where L is the sequence length, to a linear O(𝐿𝑘), where 𝑘 (≪𝐿) is the number of selected tokens. 5. DeepSeekMath V2 with Self-Verification and Self-Refinement Having discussed DeepSeek V3.2-Exp, we are getting closer to the main topic of this article: DeepSeek V3.2. However, there is one more puzzle piece to discuss first. On November 27, 2025 (Thanksgiving in the US), and just 4 days before the DeepSeek V3.2 release, the DeepSeek team released DeepSeekMath V2 , based on DeepSeek V3.2-Exp-Base. This model was specifically developed for math and achieved gold-level scores in several math competitions. Essentially, we can think of it as a proof (of concept) model for DeepSeek V3.2, introducing one more technique. The key aspect here is that reasoning models (like DeepSeek R1 and others) are trained with an external verifier, and the model learns, by itself, to write explanations before arriving at the final answer. However, the explanations may be incorrect. As the DeepSeek team succinctly states, the shortcomings of regular RLVR: [...] correct answers don’t guarantee correct reasoning. [...] a model can arrive at the correct answer through flawed logic or fortunate errors. The other limitation of the DeepSeek R1 RLVR approach they aim to address is that: [...] many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. So, to improve upon these two shortcomings mentioned above, in this paper, they train two models: An LLM-based verifier for theorem proving. The main model, a proof-generator, uses the LLM-based verifier as a reward model (instead of a symbolic verifier). Figure 12: The general math proof generator (LLM 1) and verifier (LLM 2) setup. The verifier LLM (LLM 2) takes in a rubric to score the generated proof, where the score is “1 for complete and rigorous proofs with all logical steps clearly justified;” “0.5 for proofs with sound overall logic but minor errors or omitted details;” “and 0 for fundamentally flawed proofs containing fatal logical errors or critical gaps.” Figure 13: The meta-verifier (LLM 3) checks whether the verifier (LLM 2) is verifying the generator (LLM 1) correctly. The meta-verifier (LLM 3) is also developed with reinforcement learning, similar to LLM 2. While the use of a meta-verifier is not required, the DeepSeek team reported that: the average quality score of the verifier’s proof analyses – as evaluated by the meta-verifier – improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction. This is actually quite an interesting setup. If you are familiar with generative adversarial networks (GANs), you may see the analogy here. For instance, the proof verifier (think of it as a GAN discriminator) improves the proof generator, and the proof generator generates better proofs, further pushing the proof verifier. The meta score is used during training of the verifier (LLM 2) and the generator (LLM 1). It is not used at inference time in the self‑refinement loop, which we will discuss in the next section. 5.2 Self-Refinement In the previous section, we talked about self-verification, i.e., analyzing the quality of the solution. The purpose of this is to implement self-refinement, which means that the LLM can act upon the feedback and revise its answer. Traditionally, in self-refinement, which is an established and popular inference-scaling technique, we would use the same LLM for generating the solution and verifying it, before refining it. In other words, in the previous figures 12 and 13, LLM 1 and LLM 2 would be the same LLM. So, a traditional self-refinement process would look as follows: Figure 14: A classic self-refinement iteration where we use the same LLM for generating the initial response (Output 1), the evaluation (Eval), and the refined answer (Output 2). However, the DeepSeek team observed a crucial issue with using the same LLM for both the generation and verification in practice: when prompted to both generate and analyze its own proof in one shot, the generator tends to claim correctness even when the external verifier easily identify flaws. In other words, while the generator can refine proofs based on external feedback, it fails to evaluate its own work with the same rigor as the dedicated verifier. As a logical consequence, one would assume they use a separate proof generator (LLM 1) and proof verifier (LLM 2). So, the self-refinement loop used here becomes similar to the one shown in the figure below. Note that we omit LLM 3, which is only used during the development of the verifier (LLM 2). Figure 15: Self-refinement with a separate verifier LLM (LLM 2). However, in practice, and different from Figure 15, the DeepSeek team uses the same generator and verifier LLM as in a classic self-refinement loop in Figure 14: “All experiments used a single model, our final proof generator, which performs both proof generation and verification.” In other words the separate verifier is essential for training, to improve the generator, but it is not used (/needed) later during inference once the generator is strong enough. And the key difference from naive single‑model self‑refinement is that the final prover has been trained under the guidance of a stronger verifier and meta‑verifier, so it has learned to apply those rubrics to its own outputs. Also, using this 2-in-1 DeepSeekMath V2 verifier during inference is also beneficial in terms of resource and cost, as it add less complexity and compute requirements than running a second LLM for proof verification. Coming back to the general self-refinement concept shown in Figures 14 and 15, both figures show self-refinement with 2 iterations (the initial one and a refined answer). Of course, we can add more iterations to this process. It’s a classic inference-scaling trade-off: the more iterations we add, the more expensive it becomes to generate the answer, but the higher the overall accuracy. In the paper, the DeepSeek team used up to 8 iterations, and it looks like the accuracy didn’t saturate yet. Figure 16: Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath V2 paper . The Best@32 accuracy majority voting method is also known as “self-consistency” and covered in Chapter 4 of my Build a Reasoning Model (From Scratch) book . 6. DeepSeek V3.2 (Dec 1, 2025) The reason why we spent so much time on DeepSeekMath V2 in the previous section is that a) it’s a very interesting proof of concept that pushes the idea of Reinforcement Learning with Verifiable Rewards (RLVR) further with self-verification and self-refinement techniques, and b) the self-verification and self-refinement techniques are used in DeepSeek V3.2 as well. But before we get to this part, let’s start with a general overview of DeepSeek V3.2. This model is a big deal because it performs really well compared to current flagship models. Figure 17: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . Similar to several other DeepSeek models, V3.2 comes with a nice technical report , which I will discuss in the next sections. 6.1 DeepSeek V3.2 Architecture The main motivation for this model is, of course, to improve overall model performance. For instance, like DeepSeekMath V2, it achieves gold-level performance on math benchmarks. However, the model is also trained with tool-use in mind and also performs well on other tasks, for instance, code and agentic tasks. At the same time, the DeepSeek team writes about computational efficiency as a big, motivating factor. That’s why they use the Multi-Head Latent Attention (MLA) mechanism from V2 and V3 together with the DeepSeek Sparse Attention (DSA) mechanism, which they added in V3.2. In fact, the paper says that “DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp,” which we discussed in an earlier section. Figure 18: The DeepSeek V3.2 architecture. As I mentioned earlier the DeepSeek V3.2-Exp release was likely intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. Figure 19: Inference cost savings thanks to DeepSeek Sparse Attention (DSA). Annotated figure from the DeepSeek V3.2 report . Interestingly, as the screenshot from the paper above shows, the DeepSeek team reverted to using NVIDIA chips (after they allegedly experimented with model training on chips from Huawei). Since the architecture is the same as that of DeepSeek V3.2-Exp, the interesting details lie in the training methods, which we will discuss in the next sections. 6.2 Reinforcement Learning Updates Overall, the DeepSeek team adopts the Reinforcement Learning with Verifiable Rewards (RLVR) procedure using the Group Relative Policy Optimization (GRPO) algorithm similar to DeepSeek R1. However, there are some interesting updates to discuss. Originally, DeepSeek R1 used a format reward (to make sure the answer is properly formatted); a language consistency reward (so that the model doesn’t alternate between different languages when writing its response); and the main verifier reward (whether the answer, in a math or code problem, is correct or not) Zero Gradient Signal Filtering: We remove groups of instances whose rewards are all identical (that is, a batch with zero standard deviation in their advantage) to avoid training on samples that provide zero gradient, similar to DAPO (Yu et al., 2025). [DAPO] Active Sampling: We maintain a consistent batch size in spite of zero gradient filtering with a novel, more efficient version of dynamic sampling (Yu et al., 2025). See OlmoRL Infra for details. [DAPO] Token-level loss: We use a token-level loss to normalize the loss by the total number of tokens across the batch (Yu et al., 2025), rather than per-sample to avoid a length bias. [DAPO] No KL Loss: We remove the KL loss as a common practice (GLM-4.5 Team et al., 2025; Yu et al., 2025; Liu et al., 2025b) as it allows less restricted policy updates, and removing it does not lead to over-optimization or destabilized training. [DAPO and Dr. GRPO] Clip Higher: We set the upper-bound clipping term in the loss to a slightly higher value than the lower bound to enable larger updates on tokens, as proposed by Yu et al. (2025). [DAPO] Truncated Importance Sampling: To adjust for differences between log probabilities from the inference and training engines, we multiply the loss by the truncated importance sampling ratio, following Yao et al. (2025). No standard deviation normalization: When calculating advantage, we do not normalize by the standard deviation of the group, following Liu et al. (2025b). This removes a difficulty bias, where questions with low standard deviation in their rewards (for example, too hard or too easy) have their advantages significantly increased by the normalization term. [Dr. GRPO] Domain‑specific KL strengths (including zero for math): Instead of always dropping KL like DAPO and Dr. GRPO do for math‑style RL, DeepSeek V3.2 keeps a KL term in the objective but tunes its weight per domain. However, they also note that very weak or even zero KL often works best for mathematics. (But instead of removing it completely, it becomes a hyperparameter.) Unbiased KL estimate: As mentioned above, DeepSeek V3.2 doesn’t remove the KL penalty. And in addition to treating it as a tuning knob, they propose a fix to how the KL penalty is estimated in GRPO by reweighting the KL term with the same importance ratio used for the main loss, so the KL gradient actually matches the fact that samples come from the old policy rather than the current one. Off‑policy sequence masking: When they reuse rollout data (rollout is simply jargon for the full sequence the model generates) across many gradient steps, DeepSeek V3.2 measures how far the current policy has drifted from the rollout policy on each full answer and simply drops those sequences that both have negative advantage and are “too off‑policy”. So, this prevents the model from learning from overly off‑policy or stale data. Keep routing for MoE models: For the Mixture‑of‑Experts backbone, they log which experts were activated during rollout and force the same routing pattern during training, so gradient updates are for those experts that produced the sampled answers. Keep sampling mask for top‑p / top‑k: When rollouts use top‑p or top‑k sampling, DeepSeek V3.2 stores the selection mask and reapplies it when computing the GRPO loss and KL, so the action space at training time matches what was actually available during sampling. Keep original GRPO advantage normalization: Dr. GRPO shows that GRPO’s length and per‑group standard‑deviation normalization terms bias optimization toward overly long incorrect answers and over‑weight very easy or very hard questions. Dr. GRPO fixes this by removing both terms and going back to an unbiased PPO‑style objective. In contrast, DAPO moves to a token‑level loss that also changes how long vs short answers are weighted. DeepSeek V3.2, however, keeps the original GRPO normalization and instead focuses on other fixes, such as those above. Figure 20: The “extended-thinking” Speciale model achieves higher accuracy but also generates more tokens. 7. Conclusion In this article, I didn’t cover all the nitty-gritty details of the DeepSeek V3.2 training approach, but I hope the comparison with previous DeepSeek models helps clarify the main points and innovations. In short, the interesting takeaways are: DeepSeek V3.2 uses a similar architecture to all its predecessors since DeepSeek V3; The main architecture tweak is that they added the sparse attention mechanism from DeepSeek V3.2-Exp to improve efficiency; To improve math performance, they adopted the self-verification approach from DeepSeekMath V2; There are several improvements to the training pipeline, for example, GRPO stability updates (note the paper goes into several other aspects around distillation, long-context training, integration of tool-use similar to gpt-oss, which we did not cover in this article).

0 views
Stratechery 1 weeks ago

AWS re:Invent, Agents for AWS, Nova Forge

AWS re:Invent sought to present AI solutions in the spirit of AWS' original impact on startups; the real targets may be the startups from that era, not the current one.

0 views
Harper Reed 1 weeks ago

Getting Claude Code to do my emails

Over the last week or so, I have been using Claude Code to help me with some email, and scheduling. It started cuz the holidays are overwhelming, and I felt like I was constantly behind. My inbox was overflowing with everything I had deemed important, and I hadn’t been able to make a dent. It was stressful. It still is! Maybe a storm?, Ricoh GRiiix, 11/2025 I had just seen the zo.computer launch (neat project!) and it reminded me that Pipedream has this wild MCP server that you can use to connect to literally anything Pipedream supports. This means I could use it to do my emails! Problem solved. Problem created. More problems created! WHY ARE WE COUNTING PROBLEMS! I love email. I really do. Typically I am really great at managing my email. However, when the email gets overwhelming - I end up ignoring it. Which means it gets worse. Which means everyone loses. I just need a way to start doing my email that is pretty low pressure. Using Claude Code has made it MUCH easier. Basically, I just have Claude Code check my email, and then pops out a message like “your brother emailed asking about thanksgiving plans” and I say “cool. Tell him we will be there, and will bring turkey juice or whatever you call stuffing” and then Claude Code will write an email that is approximately what I said but in the style it found from your past emails. I then specifically save it as a draft that I review heavily before sending. I trust these agents to write code way way more than I trust them to write an email to a friend, stranger or business partner. It is pretty close, but not quite close enough to go yolo mode and let it send. Maybe soon? Or maybe a different email address? Who knows. It works remarkably well. However, there are some gotchas A while back I was prototyping an email triaging agent with some code I had written, and it was working well but not great. A friend connected me to a person who was looking for someone to write a book about AI. The agent was like “this person wants to talk about writing a book about ai, they want a skeptical and academic perspective about AI’s impact” and I was like “I LOVE THIS. But this isn’t for me. I have some friends that would be good.” However, there was a bug, and the agent was drafting replies before getting feedback from me. And it ended up sending the previous draft that said very formally: “I would love to do this.” The person excitedly replied and was like “let’s do this.” That is when I found out that I fucked up. So I replied back and was like “well. My agents said yes for me before talking. Here is what I meant to say” and that person rightfully replied “ F U ” This is why I am cautiously doing it again. Haha. This time I am way more careful, and the agent cannot send email only draft it. I am finding that I am editing every email - but only a little, and maybe less and less. Feels like a year ago with codegen. Thus far it works pretty well. It allows me to get through a lot of email that I normally would ignore, giving me space to focus on the emails I really want to reply to. Basically it cleans up the cruft (vendors, services, etc) and allows me to hang out with my friends. Perfect AI usage. I recommend it. I didn’t want to start fresh every time so I built a super simple Claude Code “directory.” It has most of the things that you will need to handle triaging your inbox. There are a few important parts: This is highly personal (how do you manage YOUR inbox), and very important. The first thing I did was have Claude go through the past couple hundred emails I have sent, and develop a vibe for how I write emails. After a bit of back and forth, we have this: Then I went through and did a bunch of emails via Claude Code. It did ok. But I was able to coach it, and once it was in a good place I had it make a skill based on what we discovered together. Having Claude build its own skills is clutch. You really need to iterate to make it happen. I am still mad about MCP as a concept, but not mad enough not to use it. The goal is to simply give Claude Code a suite of tools that allows it to do its job well. The Pipedream MCP is very straight forward. You essentially add it to your Claude Code (works in other clients too), and then it will pop you through their auth. Once auth’d you add various services. Those services are then exposed as MCP tools. These are clutch. You can then use Claude to wire together some workflows. I currently use their Gmail, Google Calendar, and Contacts connections. I created 3 simple MCP servers that I wanted to exist: a super straight forward todo tracker on the CLI. The plan is to build out support for various backends. Currently it is just local. It works as a cli app or as a mcp server: You can try it out here: a log for agent actions. I want my agents to log what they have been up to! You can try it out here: a misspelled crm that acts as my p ersonal agen t backend. This allows me to have reasonable understanding of where I am from a comms, etc standpoint You can try it out here: The robust skill, the claude.md and the MCP tools make this a pretty easy and helpful system for triaging email. It is not perfect, but it does work nicely. I do recommend playing around with this. I would maybe be cautious about blindly trusting it. Lol. I made a simple plugin that should do this for you. All you gotta do is install it! Installation Now: Whether we like it or not, it appears that agentic email will be a thing. It is early enough that we will start to see people like myself building bespoke and custom experiences that largely do what a product will do. Somehow Google will launch a version directly in gmail, that somehow doesn’t work. My guess is that the best versions will be like Mimestream or superhuman that are primarily agentic. I hope it isn’t primarily chat - but we shall see. I do recommend playing with this. Especially if you have a lot of email that you need to take care of. I think of it as clearing brush. You don’t want to fuck up the flowers (all of you. you are my flowers), but you don’t mind cutting down the weeds (all of them! You can see the random emails about business loans lurking in the corners..) My emails are going through Pipedream, and Anthropic. This is not ideal. it is obviously a privacy concern. I can’t wait to run these things locally, and maybe have an MCP server that interacts directly with Google Suite. Not even once, Leica Q, 11/2017 Giving your agents access to things that affect other people is scary and should be done with caution. It works pretty well for me, but I did totally fuck up a few situations trying this out. My inbox is pristine. DO NOT SEND ME EMAILS! IT IS BEAUTIFUL! Thank you for using RSS. I appreciate you. Email me I boot up a Claude Code session hooked to the proper MCP servers. I ask it to check my email It tells me what email is in my inbox that needs addressing (unread, then read email) Either offers to start fixing emails, or just starts writing drafts intelligently (checks my calendar, searches for context, etc). Says “We are done!” I go to my email client, and check the drafts. Edit and send most of them cuz they are great. Reject a few. Rinse Repeat harperreed/toki harperreed/chronicle harperreed/pagen

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090

Having worked through the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware? The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series , I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time. But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC. Additionally, Andrej Karpathy recently announced nanochat , "the best ChatGPT that $100 can buy". He mentions on the main page that he's trained a model called , with 32 Transformer layers, which has 1.9B parameters, for about $800. His smaller 20-layer model, with 561M parameters, he says should be trainable in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the $100 total price. What's even more interesting about nanochat is that it's built with PyTorch; initially I'd got the impression that it was based on his pure C/CUDA , which I would imagine would give a huge speedup. But no -- he's using the same stack as I have been in this series! Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help. This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project. But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs. Here's the full story. For this project, I want to use the exact same model code as Raschka presented in the LLM from scratch book -- my copy here . There have been a number of architectural improvements to LLMs since GPT-2, but for now it's best to keep things simple. But there are still some settings to decide on. The config dictionary for the models we've been using has these parameters: There's also the aspect of weight-tying -- the original GPT-2 reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits . There's nothing in the code we've been working with to enforce that, though -- when we do our small train in the book, we're using independent weights for each of those steps. The only time it is "enforced" is when we download the pretrained weights from OpenAI, where we put the same values into both the embedding matrix and the final output head. Given that Raschka says that it's in general better to avoid weight-tying, and actually doing it would be harder than not doing it, then it seems a no-brainer to not do it. So, what does that mean about our model? That matches what we got when working through the book; 163M parameters. Can we train it? It seems like every AI project starts with the question "what data can we use?" The original report on GPT-2, " Language Models are Unsupervised Multitask Learners ", is frustratingly lacking in details. However, it does say that they trained it on "8 million documents for a total of 40 GB of text". Now, according to OpenAI , it's reasonable to assume roughly four characters per token for typical English text. So 40 GB of text is ~10 billion tokens. That data was essentially gathered by scraping pages linked from Reddit that had more than three upvotes there, so was reasonably high quality. Can we get something similar? Conveniently, Hugging Face host a big dataset called FineWeb , and that has a 10 billion token "sample" dataset, randomly selected from the full 18.5 trillion tokens. So the sample feels like it's order-of-magnitude right. And while reading more about Karpathy's nanochat, I spotted that it uses FineWeb-Edu , which is a version of FineWeb that contains "only the most educational web pages". I wrote a script to download both of those , and kicked it off. It took about 20 minutes for each one (slow wifi in my study, I was getting < 5MB/s); FineWeb's 10B sample took up about 29 GiB, and FineWeb-Edu's about 27 GiB. Time to take a look at them. The Hugging Face function loads up all of the files you provide, and you can tell it how to split them up into train/validation/test sets. This command just loads up the whole FineWeb one and says "treat it all as the train split", which is good enough for now: Yikes. It took 1 minute, 53 seconds to generate the train split. However, that appears to be a one-off cost -- when I accessed it again later using the same code in a different Python session, it just did the second "Loading dataset shards" portion, taking three seconds, not the generation of the split. Presumably it caches it. Anyway, let's see what's in it: Great, so we have 14,868,862 rows, each of which has various bits of information. Checking the first one's text: Well, for FineWeb, that doesn't look particularly "fine", but I guess it's better than the stuff that Karpathy talked about in his recent interview with Dwarkesh Patel : When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet Let's take a look at FineWeb-Edu. That looks a lot better! Now let's take a look at the document lengths in terms of tokens. There's a column, but I don't know which tokeniser that's for, so to be safe we'll calculate it ourselves. How long would it take to tokenise every row in FineWeb 10B to check? Let's tokenise the first 10,000 of the 14,868,862 that we have, and see how long that would take -- then we can work out the estimated time for the whole thing. 2,160 seconds or about 36 minutes. Yikes! After a bit of digging, though, I found that tokenisers can handle batches (poorly documented, but it's there in the source ): Also, we can map a function over an entire HF dataset, and that can be made to run with multiple processes. So, we can combine the two: Just over three minutes, not too bad! (The reason the command count above jumps from 47 to 53 was that in the first run I didn't have the in there -- one of the rows in the dataset had in it, and the tokenizer rejected it. I'm going to play fast and loose and ignore that for now.) Now let's see how it added it: Cool! We've added a column with the number of GPT-2 tokens for each row, and we can extract what amounts to a list of those values. Let's plot them as a histogram. Trying to do it directly -- that is, just doing ...seems to make MatPlotLib very unhappy, and my interpreter crashed with an OOM -- I think it might be trying to load all of the dataset -- text, IDs, etc -- into RAM in one go. So I started a fresh one and did the stuff to load it and annotate it with token lengths again -- weirdly, this time the mapping only took 10 seconds or so! That was strange, I'll need to look into that. Perhaps the earlier command added the column to the files on disk? To work around the memory issue, I converted the column from the dataset to an actual list: That took ten or twenty seconds. Let's then try the plot again (full code this time): That took about 11s to run, and the result is this: That's really promising! The bulk of them are less than our 1,024 token sequence length. 1 If we present each row in the dataset as a stand-alone training sample, cropping them when necessary, perhaps we won't lose too much data? Let's see. First step, how many tokens are there in total? Nice, about 10B, as expected. How many tokens would we have if we cropped them to the default GPT-2 context length of 1,024? Ouch, 7.3B. That's quite a reduction: So we're losing 29% of our tokens by that cropping. That's from curtailing just 16% of the sequences: That's not great. I feel that we have two options here: At this point in the experiment, I'm going to keep both options open. I'm inclined towards the latter (I believe it's closer to what the real GPT-2 train did), but I'm not sure. Anyway, we're scoping things out here, so let's move on. After looking at the data, I've thought a bit more about this. I'd previously been thinking in terms of training across all of the tokens in the dataset; we'd work our way through the 10B tokens, and then we'd be done. But when training a model, you do multiple epochs, normally -- you run through the dataset once, updating your gradients as you go, then run through it again likewise, and eventually you stop when your validation loss starts rising. I think that because I'd read that LLMs are normally trained on just one epoch these days, I'd kind of internalised that we only need to do one. But it wasn't the case in 2019 when GPT-2 came out. They had less data -- just 10B tokens or so, compared to insanely huge datasets like the full FineWeb (not the 10B one we've been looking at -- the 18.5T full one), so they would have trained it for some number of epochs. How many? That's another case where the GPT-2 paper is annoyingly light. This report says in the "Replicating GPT-2" section that OpenAI trained it for 800k iterations with a batch size of 512. Plugging in a sequence length of 1024, that gives us this many tokens: Over 419B tokens! Now, if we believe that their dataset was 10B tokens, then we can work out how many epochs that came to: The same report says that they -- as in, the report authors -- make that "around a total of 60 epochs through the training set" -- I believe that the training set they're talking about could well be slightly shorter than the original GPT-2 one -- the GPT-2 authors didn't release their own, which is called "WebText", so the report's author is using a different one that tries to replicate it, OpenWebText . That sounds expensive; even without knowing how many tokens per second we can train for, 40-odd epochs of 10B tokens each sounds like it would take a long time. Are there any other comparison points that might tell us how long to train for? Well, there's a "Chinchilla heuristic" that I've heard of, which says that you should train on about 20 tokens per model parameter. I spent some time reading into where that comes from; originally it's in " Training Compute-Optimal Large Language Models " from Google DeepMind, and it's an interesting paper, and is surprisingly easy to read, with a few bits of maths that get a bit hairy (but aren't required to get a good-enough feel for what they're saying). I recommend you take a look. It was written in 2022, and the authors felt that people were scaling up models a lot, but weren't increasing the number of tokens that they used for training enough. So, they trained a huge number of models, trying to answer the question: "given a particular budget in training FLOPs, what is the optimal balance of training tokens versus parameters to make sure you're using those FLOPs most efficiently?". They were arguing against the method taken in a particular paper, where another team had trained a model (called Gopher) on significantly fewer tokens than they thought optimal. The number of FLOPs used to train a model is linear with both the number of parameters and the number of tokens you train it on, so if you get 2x the number of FLOPs that you had before, you can either train the same model on twice as many tokens, or you can double its size. Which is better? Their conclusion was that you should actually scale both parameters and tokens up by the same amount -- that is, in the 2x case you'd want to have 2 times both the parameters and tokens, which would double your FLOPs and get you better performance. As you can probably see, by doing this they indirectly worked out an optimal number of tokens to train a particular size of model for. They don't state the "20x" heuristic themselves, but it's pretty clear in table 3 in the paper, where they give a number of model sizes and the optimal number of tokens for each. Now, this number is not the number of tokens you need to train for to get the best model you can for a particular number of parameters; a model of a given size can always be trained more and will (hopefully) get better. But it tells you when you've trained on enough tokens that you could get better results by training a larger model than you have right now. They're implicitly assuming that models can get as large as you want, which of course is not the case -- in reality, you're going to be targeting a particular model size, the size that can fit on your training hardware (or more likely with production models, the size that can fit on your planned inference hardware). But interestingly, looking at the README.md for Karpathy's nanochat project, he trained his 1.9B "d32" model on 38B tokens -- exactly 20x. And if you look at the script in the same repo, he explicitly says that he's training for 20x parameters for the smaller model: If Andrej Karpathy thinks that training for Chinchilla-optimality is the right way to go, then who am I to disagree? ;-) More seriously, perhaps the better quality of the dataset makes this a reasonable thing to do. From the GPT-2 paper, their description of how they got the data: ...we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. That's a clever trick, but I believe that FineWeb is much more carefully filtered and improved than the WebText dataset they got from that. Back in 2019, they had to do everything from scratch -- find appropriate ways to get data, filter it, and so on. Now we can just download stuff from Hugging Face. So maybe Chinchilla-optimal is enough. Anyway, we have 163,009,536 parameters, so on that basis, let's train for: ...tokens. (I'll just use 3.2B from now on, but that's the actual number I mean.) That's pretty cool! We have more than that number of tokens already in our FineWeb 10B sample, so we can do a single-epoch training run. So the question is -- is that even doable on my hardware? It all hinges on how many tokens per second we can train at. A good way to check this is to write a throwaway "trainer". We can use that to work out what our maximum batch size on the RTX 3090's 24 GiB of VRAM, then run a bunch of batches through -- a forward and backward pass for each -- and see how many we get. This won't estimate how much time we'll spend validating the model, of course. But my gut is telling me that we should spend no more than 5% of our training time running validations, so we can later on do a similar test, eval mode, forward pass only with no gradient tracking, and use that to work out how many tokens should be in the validation set. So, let's estimate training speed. This code gets an estimate of tokens/second at different batch sizes. Hopefully it's clear enough to not need an in-depth explanation. An outline: Here's what it prints out: So we can see that it gets faster as we increase the batch size, which makes sense because we're handling sequences in parallel, but it does flatten off a bit, which makes sense because there's a limit to how much parallelism we can do, even on a GPU. Let's see how that fits in with the different training sizes we looked at above: OK. We're definitely not going to be able to train this thing the GPT-2 way! I expected that to be the case, but now we have a solid proof of that. But the three-day Chinchilla-optimal train actually sounds doable! I'm heading to London to visit family soon, so won't be using my home PC. With a bit of help from Tailscale I'll be able to log into it from my laptop, though, so I can potentially nurse a run through. Can we make it any faster? Now, when doing the fine-tuning work, I found that you could generally speed things up by doing everything in 16-bit rather than 32-bit. Intuitively that makes sense -- lower-precision numbers, fewer bits, means less work for the GPU doing the various multiplications and additions that are involved in our train. Working with ChatGPT, I found a couple of ways to take advantage of that. Firstly, using TF32. The normal float32 format uses 8 bits for the exponent, and 23 for the mantissa. If you haven't looked into how floats are represented in memory (or if you've forgotten), that means that, using m to mean the mantissa and x the exponent, the numbers are represented in memory as TF32 is messier; it has the same exponent size -- and thus the same range -- as float32, but it essentially ignores the lower 13 bits of the mantissa. So it takes up the same amount of memory, but is lower-precision, which means that calculations can be faster. Most importantly, cards like the RTX 3090 have dedicated "tensor cores" -- as opposed to the normal CUDA cores that do normal matrix multiplications -- and they operate in TF32. Unsurprisingly, "TF32" is "tensor float 32-bit". The PyTorch allows you to tell it what precision to use for matrix multiplications; the default is , which means "use float32 all of the time", so you're stuck using just the CUDA cores. If, instead, you set it to , then it will use TF32 if the hardware supports it and it has the appropriate kernels available. So that will let us use the tensor cores. I added this to the code above just above the loop over the different batch sizes: Let it run, and: That's a 22% speedup! Of course, the precision of the training isn't as good. But given that many modern models are trained at 16-bit (I've seen suggestions that some are even trained as low as 4-bit) then that shouldn't matter. Let's see whether we can train in 16-bit instead. PyTorch has a smart mode where you can tell it "use 16-bit where it makes sense, otherwise use 32-bit" -- AMP, which stands for "Automatic Mixed Precision". There's a great recipe for how to use it in the docs , so let's use that. We need to create a object to handle scaling parameters from 16-bit to 32-bit as needed -- we can re-use that across all batch sizes so we can create it just before the loop: ...then we need to replace this core part of our training loop: ...with some code to use AMP and that scaler -- basically we use a context manager to switch it on when we're doing the forward pass and work out the loss, and then use the scaler to manage the backward pass and the optimiser's step: Running that gives us these results: Wow! With that we can train on 3.2B tokens in about 160,000 seconds, which is 44 hours. That's definitely doable. Now, what happens if we remove the ...so that we're using AMP, but not the tensor cores? It's basically the same. 300tps slower at the start, down to 70 at the end. Still, it looks better to keep the "high" precision in place, rather than the "highest". Right. We have the beginnings of a training loop that should be able to let us run a Chinchilla-optimal train on a GPT-2 small sized model in 44 hours, and I have the time to do it. And it looks like a batch size of six is what we can fit into the RTX 3090's 24 GiB of VRAM. What else are we going to need to build something to do this? If I want to do a long training run, then stuff might go wrong -- it might crash for some reason. So we're going to need to save checkpoints as we go and be able to restart training from those checkpoints. In those, we're going to need to save the model and the optimiser's state, plus some kind of info about how far through the dataset we are. We should keep training and validation losses too, so that we can easily chart and recover our progress, and according to this forum post we're going to need to save the scaler (which makes me think that it actually has state in it, so we probably should have used a fresh scaler for each batch size in the above -- let's hope that doesn't prove to be a problem [note from later: it wasn't]). I wrote a script to create a model, train it for a bit, and then dump out all of that apart from the metadata (which I reckon is going to be less than 1kB). I wanted to use the safetensors format for all of it, but unfortunately I couldn't get it to work for the optimiser or the scaler, so had to use for those (which I don't like because it uses pickle , which introduces serious problems if you ever want to move files from machine to machine, as the Python and library versions need to match perfectly). Ah well. Here's what the test checkpoint looks like: That's huge! And it's almost all the optimiser. From what I read, that stores two numbers per parameter, so it makes sense that it's double the size of the model weights. And at 32-bit, 4 bytes per param, then 670MiB for the model is sane. Timing-wise, it takes about a second to save, the same to load, so that's fine. So that sounds reasonable in terms of timing, and disk space is pretty high, but not so huge that it can't be managed with careful planning -- don't checkpoint so much that we run out of disk during the train (I have a 2TiB disk, but it's far from empty). It's probably worth double-checking that it works, though! Because my checkpoint test already did some training, I changed it so that it does this: Looks sane! The numbers for loss are the same before and after, so I think it's vanishingly implausible that the checkpoint we restored is different from the one we saved. And the continued training seems to be working -- at least, loss is going down -- so that sounds reasonable too. OK, so, again, the time taken to checkpoint is negligible, but the disk space isn't. I reckon we can comfortably do 100 checkpoints over the train. That's roughly one every half-hour over 44 hours. We're going to want to do a validation run each time we checkpoint, so let's think about that next. How big should our validation set be? Let's say we only want to spend 5m per checkpoint period doing validation. How many batches can we get through in that time? I wrote a simple script to run a model (after a few hundred training steps) in eval mode on different numbers of iterations to see how long each one took. It used the same trick as the training loop above in order to use mixed precision, and I ran it with instead of the that I've used in the past (ChatGPT tells me it's a little faster). I also put in some calls to around the loop that I was timing, which should apparently help make sure that the numbers are precise. The code is here if you'd like to take a look. After some fiddling with the min/max numbers at the top: OK, so let's call it 3200. That's 3200 * 6 * 1024 tokens = 19,660,800 tokens. That's about 0.006144 of our training set. Pretty low, but we're talking about such a large training set that I think we're OK. And practically we can't do more -- we're already talking about 5 mins every half-hour, so we're bumping up our train time by 88 * 5 = 440 minutes, which is seven hours. Now let's start thinking about the datasets. We can split the HF thing into train and validation sets. I'm thinking it might be useful to load all of our training and validation data into RAM for the train loop. 3.2B tokens with four bytes per token should be about 13 GiB, after all, and I have 64 GiB RAM on the machine. ...but wait, int64 is the default for PyTorch for long ints -- that's what our token lists are in the original, and it's twice the size, so we're talking 26 GiB. I believe that PyTorch expects that format for the cross entropy loss. That's not the end of the world, though -- we can store the data as int32 in RAM (with 50,257 as our vocab size we could even use int16 if we wanted to) and then we'll need to make them the right type just before using them. We can do that when splatting them onto the GPU, eg. First thought, can we store them as a Python list? Turns out they're not all that memory-efficient, though: How about PyTorch tensors? Promising! (Though ChatGPT pointed out when reviewing a draft of this post that I was using the default rather than an type here. Still, it's the same size.) Let's measure memory usage in a new interpreter. Yup, 12,801,474,560, so about 12 GiB. Can we save it? OK, let's try reloading it in a fresh session: Nice. So, I think we can write a quick script that splits our incoming dataset into say 99/1% train and validation, grabs the first 3.2B tokens from the training set, glomming them together into one big tensor with EOSes between them, and saves them, and then does likewise for the first 19,660,800 tokens from the validation set. We'll use FineWeb, with the possibility of switching to FineWeb-Edu later on. Doing it that way means that we're actually using the second of the two options I considered earlier: Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. I thought it would be harder than concatenating/padding rows, but it actually turns out to be simple enough. Let's give it a go. Here's the code . I wanted to have an round number of 6-sequence batches of 1,024 tokens each, so the the number of training tokens worked out at ...rather than the strict Chinchilla-optimal 3,260,190,720, but that's no biggie. Running it takes 5m55s, and then: Looks about the right size -- 19M * 4 for val, 3.2B * 4 for train. Cool! Let's finally write our training script. You can see the full training script here -- note that this is the final version from the repo, so isn't exactly what I'm running at this point in the post. The checkpointing code is (sensibly enough) in a separate file, . It took two days to run, and... Both train and validation losses fall nicely! Training loss is a bit choppy, but that's because I erroneously only plotted the most recent iteration's training loss rather than an average over all iterations between the last and current validation run; the validation loss is correct because I did average all of the validation numbers. (The version of the code linked above fixes that error.) The best epoch for val loss is not the last one but it was close. Looking at the last 5 iterations, their val losses were: It's time to do some evals Firstly, let's try the smoke test that we do in the book. What does our model think should come after the text "Every effort moves you"? With uninitialised weights we get gibberish, as expected But with our best checkpoint we get this: Nice! The multiple mentions of protein is actually the kind of repetition that small models tend to do, so that's not bad news. Let's try with the last iteration's checkpoint: Also very nice, perhaps better! I think that both of those are qualitatively as good as the result we got when we loaded the pre-trained weights from OpenAI , which was: That's very reassuring. But is there something a bit more quantitative that we can do? Firstly, can we compare it to anything in the GPT-2 paper? In figure 4 they give their perplexity against their train and test sets for the different model sizes; for the small one it's a bit over 16, Let's assume that they're basing that on natural logarithms, so they mean that they have a loss of ln 16 . That's , which is much lower than our best loss of 3.9401. However, that is across different datasets, so while it makes me suspect that their model is better than ours, we can't really say for sure either way. The cool thing is, though, that we have their model -- so we can actually run it against our dataset. I wrote a script called , and running it gives us this: Still better than ours :-( I considered doing the same thing against Qwen to see whether that was also better, but with a different tokeniser we couldn't really treat it as comparable. Loss and perplexity are both over next-token predictions, and if the meaning of "token" changes, then the numbers will change. 2 OK, so we have a model, but it's not as good as the original GPT-2 small. Our loss on our validation set is roughly 3.94, while the original weights get about 3.50. Expressing that in terms of perplexity gives our own model about 51.4, while the original has 33.1. That's actually still higher than the 16 that they had in the paper, which is interesting -- presumably it's related to the fact that they're validating over their own WebText test set rather than ours; they're both samples of web content, but there must be differences. At this point, my guess is that this shows that all of that extra training that the OpenAI team did beyond the Chinchilla-optimal number of tokens did have a real benefit -- and that's not suprising. Remember that the Chinchilla paper is about the best way to spend a FLOPs budget. They're not saying that you can't drive down loss by continuing to train your model further -- of course you can. They're saying that when you pass the optimal number of tokens, you should increase the model parameters and the tokens by the same ratio, and by doing that you'll get the best balance. But still, a Chinchilla-optimal model of 163M parameters might still be useful. What happens if we instruction fine-tune it like we did the original model in Chapter 7 of the book ? In that post and its followup , we used some training samples using the "Alpaca" one-shot question-answering format: ...to get a model that we then provided a test set of questions in the same format, then used the Llama 3 7B model to judge the results on a scale of 0 to 100. We then averaged the results and got a plausible-looking indicator of how useful the model was, as compared to the more narrowly technical loss number. One problem with that is that we ran those tests on the OpenAI weights for the medium-sized 355M-parameter GPT-2 model. If we don't want to be comparing apples to oranges, we'll need to re-run it on their weights for the small model. Let's see how we do. First, let's run it for five epochs just to see when/if it starts overfitting: OK, so two epochs looks like the right amount, just as it was with the medium model. So we can train for that (because I'm using the original code I wrote when working through the chapter, I didn't checkpoint during training -- but it takes less than a minute to run the whole thing, so no biggie). Here's the loss chart: Validation loss at the end is 0.733, noticeably above the 0.649 that I got with the medium-sized model. And the sample outputs shown at the end aren't as good, either. With the medium-sized model, I got these: ...but with the small model (remember, this is with OpenAI's original weights) I get this: Definitely worse, especially the last one! Let's see what Llama 3 thinks of it, again using the code from the book: The medium model got an average of 50, so the OpenAI small model is definitely much worse, as the examples suggested. Makes sense. Let's see how our own base model performs when fine-tuned on the same data. After a bit of fiddling I found that validation loss settled down at the end of epoch 10: (It's hard to see from the chart, but validation loss was actually very slowly dropping even after epoch 5.) It's interesting that our own model took longer to train here, but it does make sense in terms of it being that little bit dumber. The samples it printed out at the end are also interesting: The simile is pretty good, I think better than the OpenAI original weights' one, but the storm clouds one is dreadful. It's fascinating that they both chose the same wrong answer for "Pride and Prejudice" -- my guess is that it's because the training set contained this question: ...so both models picked up on Robert Frost being a useful author to reference in answers. Anyway, what does Llama 3 think of the output? Yup, it's dumber than the original weights -- but, at least to my mind, closer to the original weights' score than you might have thought based on that loss/perplexity number alone. But, on the other hand, I'm not convinced that Llama 3 7B is smart enough to be doing a good job. In the stuff the eval script printed out, we have this: This is clearly completely wrong, the mention of cumulonimbus is coming from the dataset response, not the model response. Llama 3 7B is tripping up over what came from where, which is pretty normal for a small model. Of course, it's possible that the scores for the OpenAI GPT-2 small weights also have been given a higher rating than they deserve -- or, indeed, that there were right answers that were incorrectly judged wrong. Conceivably it averages out. But there's no reason to assume it would, so it's essentially noise and is making the results less useful. Let's try using a much smarter LLM as a judge and run both of the models responses through it -- the just-released OpenAI GPT-5.1 model. The code is here . Running that against our own model's answers: ...and against the model fine-tuned from the small OpenAI weights: ...and, of course, it didn't make the mistake of confusing the dataset response with the model's in any of the cases printed out. ChatGPT 5.1 in the chat interface is very smart, I expect these results are much closer to a reasonable ground truth. Out of interest, what does it make of the model based on the GPT-2 medium weights that we train as part of the book? That's as compared to an average of about 50 from Llama 3 7B. It seems like GPT 5.1 is a tougher judge than the small local model -- and my guess is that that is because it's more accurate. 3 Anyway, the ranking remains the same; after fine-tuning on the same Alpaca dataset, GPT-2 medium > GPT-2 small > our model. But it's still a relatively close-run thing between our model and GPT-2 small. Can we close the gap without vast amounts of extra training? The results so far were from using 3.2B tokens of the FineWeb 10B corpus. Now, as I noted at the start of this post, Andrej Karpathy's nanochat project uses FineWeb-Edu, a separate corpus designed to be really informative. Indeed, back at the start when we were looking at the two datasets, the first row in the Edu dataset was about Jane Austen, so maybe we would wind up with a model that at least got that question right! That's going to take another two days to train, but that's no big deal. We first need to change our script that generates the train/validation splits to regenerate them using the Edu dataset; we'll move the old ones to one side, though -- it will be interesting to see what loss we get on the non-edu validation data with the new model. (Note to self: work out some way to split out different datasets and training runs for future experiments like this. The setup I had in my recent post on RNNs worked quite well. Throughout the remainder of this post I'm juggling directories of checkpoints and datasets, and I'm sure I got it right, but it was an error-prone process.) That being done, it's time to move the checkpoints we already have to one side, and to kick off the train! Here's what we have after two days on that -- oops, I forgot to add the code to average training loss across all of the batches, so again it's a bit spiky. But we got to a final eval loss of about 3.693 this time. Of course, that's on its own validation set, so it's not comparable with the numbers from before; loss is specific to a particular dataset. Let's see what it makes of the original run's validation set. Juggle some directories around (my messy file structure means that there is just one "datasets" directory and one "checkpoints" one, so I'm moving them around to make sure I'm using the right combination): We get 4.16! That's truly terrible, worse than both the original base model that we trained on FineWeb's non-edu dataset, and than the OpenAI GPT-2 small weights. Let's see what we get from the closer-to-real-world instruction fine-tuning test. Five epochs turns out to be best: I won't bother running it past Llama 3 7B, as that's proven unhelpful, so we'll go straight to GPT-5.1. Gosh! So it's judged slightly worse than our weights based on FineWeb. That does surprise me a bit. I was definitely expecting the Edu version of the dataset to give us a better model. So: OpenAI medium > OpenAI small > our FineWeb base model > our FineWeb-Edu base model. That last pairing does surprise me a bit. Handwaving wildly, perhaps the more "regular" nature of the Edu dataset meant that the model saw less variation in its training set, and that actually made it learn less? I think there's one more experiment I want to do before bringing this ( very lengthy) post to a close. We've shown that Chinchilla-optimal training of models produces worse results than OpenAI's original, we think longer, train. What would happen if we continued training for another two days? As I have it easily to hand, I want to use the FineWeb-Edu model for this. I want to start with the best checkpoint (which happens to be the last one), and train it on another 3.2B tokens from FineWeb-Edu. Let's see what we get. Getting a dataset is going to be a bit messy, as our existing script to generate the safetensors datasets just grabs tokens from the original dataset until it gets 534,200 batches of 6 sequences, each of 1024 tokens (3,282,124,800 total). Might as well hack it (and note that this is something worth improving for any later experiments). I'll just loop round the code to do that twice, throwing away the first set of 3.2B tokens. I was pretty sure that the ordering of the datasets I'm getting is fixed, but perhaps not -- it spent time regenerating the train/val split at the start of the script, so there's no guarantee we have different data this time. That feels like a note-to-self about data pipeline hygiene -- if the train/val split is randomised by the infra I'm using, I should persist the raw data in case I need to use more data than I though I would need to. Still, for this experiment, we can play relatively fast and loose. After all, GPT-2 small -- the original OpenAI weights -- was trained on multiple epochs, so it saw tokens multiple times. What we're trying to see here is what happens if you train for longer; a more scientific experiment can happen later (if at all...). Anyway, we have 3.2B tokens that should at least be reasonably different from the original 3.2B. Right, let's clean up some disk space so that we have enough for the new train (deleted some old optimiser checkpoints, keeping the metadata and the weights). Now, we create a new checkpoints directory, and we can copy the last/best checkpoint from the original FineWeb-Edu train there. Hack the in there to zero, create and symlinks, and then we can "restart" from that checkpoint. Due to the way the restart-from-checkpoint code works in the training script, that means that it will start with an offset of 1 into the dataset, so we're dropping one of about 530,000 iterations, but that's not exactly the end of the world. There are some interesting spikes on validation loss in there -- in particular that one at around iteration 300,000 where it goes up from 3.6 or so to 7.5 for two validation periods (which, remember, happen every ~30 minutes, or every 7020 iterations). My guess is that we got some kind of gradient spike prior to those, which led to a bad update to the parameters. However, it looks like the loss recovered really quickly after it, so while gradient clipping (that is, limiting the size of the gradients so that one-off spikes don't cause massive updates) might have prevented them, I don't think it would have improved matters much -- we might have "lost" an hour so of training, but out of a 44-hour train (48 hours including breaks for validation), it's not the end of the world. But, looking at the raw numbers, after our second two days of training on a fresh sample from FineWeb-Edu 10B, we've managed to get the loss on our validation set down from 3.693 to... drumroll... 3.661. And that's on the "best" measurement, which was an hour before the end. The last validation number was 3.663. By spending twice the time, we've managed to get our loss down by 0.032, which is a touch less than 1%. Even measured in terms of perplexity (which, being an exponential, is more sensitive to this kind of change), we've gone from 40.2 to 38.9, which is hardly show-stopping. Let's see how this one measures up against the non-edu FineWeb validation dataset that we originally used to calibrate our first training run. Run it, and: ...we get 4.13 -- that's opposed to 4.16 on the last model, trained on half as much data. Well, maybe it's a much better base model for instruction fine-tuning? Let's give that a go, again with the Alpaca training set from the book. 8 epochs turns out to be the right number: Certainly better than the 15.18 that we got on our Chinchilla-optimal FineWeb-Edu model, and a bit better than the 16.14 we got on the Chinchilla-optimal FineWeb one. So by training for double the time on twice the data, we've definitely got a better model. It's just not that much better. I think that's more -- significantly more -- than enough experimentation for one blog post, so let's do some analysis. I want to sanity-check the number of FLOPs spent on this train, just to make sure that I hadn't messed up. Feel free to skip this if you want to jump straight to the conclusion :-) In appendix F, the Chinchilla paper mentions a common approximation for how many FLOPs, C , you spend training a model with N parameters over D tokens: So based on that, each of those training runs cost us (using the exact numbers for N and D ) this many FLOPs: They also give a more carefully-worked out calculation; it doesn't look all that difficult -- it's just a case of plugging in the numbers from our architecture and pulling out a result 4 -- but the numbers they get from that are generally within 10% of the simpler calculations, so we may as well stick with the above. 5 Now, in terms of how many FLOPs we actually spent... well, manufacturers' datasheets for hardware are based on carefully-selected benchmarks and won't really be comparable to the code we were running (especially given that it's my crappy code based on top of a huge stack of PyTorch, CUDA kernels, CUDA itself, and so on), but we can do a Fermi estimate . From Wikipedia, the RTX 3090 has 35.58 TFLOPS performance on FP32. Way back earlier in this post, when I was measuring how many tokens per second I could get locally, the first experiment capped out at 12,599 tokens/second with FP32. showed the GPU usage at 100%, so let's say (again, this is very approximate) that we were getting about 35.58 TFLOPs and that enabled 12,599 tokens/second. We wound up training at about 19,921 tokens/second after adding in mixed precision and using the tensor cores. So, hand-wavingly we can say that we were getting Now, we trained for 44 hours (48 including validation), so the total number of training FLOPs should have been the number of seconds in that times the total FLOPS 6 of 56.27 × 10 12 That's pleasingly close to the 3.19 × 10 18 above! I can easily imagine that the stack we're using could somewhat-more-than-halve performance from the theoretically optimal, or that we're running at 50% of the GPU's theoretical capacity, or some combination of the two. We're in the same order of magnitude, and for a Fermi approximation, that's what matters. Now, looking at figure 3 in the Chinchilla paper, their IsoFLOP curves (each one showing the loss they got on their training set for models of a particular size, using the same number of FLOPs for each curve), we can see that the top one, which is training runs of 6 × 10 18 FLOPs, the lowest point is pretty much bang-on the 168M point on the X axis. So that is at least reassuring that we did do a proper Chinchilla-optimal train here. (Their loss on that chart is showing 3, but they're using a different dataset, so I don't think it's comparable.) Apart from the obvious answer of "skill issue", let's see if there are any obvious reasons why the base model I've trained (and retrained) in this post is worse than the original OpenAI GPT-2 small. Let's review the results first: The first row is not super-interesting, it's the second and third that matter. OpenAI is clearly winning by quite some margin! Earlier on I assumed that the difference was that they trained on more data, but let's be a bit more systematic here. What specific differences do we have to the original train? Again, the amount of data in the paper is frustratingly limited, but: Right at the start, I estimated that the WebText dataset they trained on was about 10B tokens. We've trained on 3.2B tokens for two of our models, and 6.4B tokens for the extended train one. That could well have an effect. There's more information in their larger dataset, both in terms of raw facts like "Jane Austen wrote Pride and Prejudice", and in terms of information about the structure of language. On the other hand, their dataset is, as they say, comprised of the contents of web pages that were linked from Reddit posts with more than three upvotes. FineWeb (and even more FineWeb-Edu) is a much more curated dataset, so you would expect it has more facts, and better structure -- less of the slop and junk that Andrej Karpathy talked about in his interview with Dwarkesh Patel. So I'm not sure that this is it, but it's worth keeping in mind. Again, we don't know how many epochs they trained on, but the report I linked to right at the start of this post estimated that they trained for 60, while I calculated based on their numbers that it would be 41 epochs with WebText. It certainly makes sense that grinding along, epoch after epoch, will get your loss down, at least on the training set! And there's also a phenomenon with certain kinds of neural networks where if keep training past the point where you're overfitting (that is, validation loss starts rising while training loss continues to fall), suddenly the model can have an "aha" moment and start generalising again . 8 It's not quite comparable, because it was not a second epoch, but rather continued training with more data, but we were able to eke out an extra reduction of 0.032 in loss by training our FineWeb-Edu model for twice as long. If we'd trained it for 40 times as long, then we presumably would have managed to grind it down even further. I have no idea how much further we could get it, but I'd guess that it's going to be worse than linear (that is, each extra two days gets you less loss reduction than the previous) -- so we can bound the loss reduction at a maximum of 39 × 0.032 = 1.248 . So... maybe? It would be a dull experiment to run, though, taking 78 days. If I want to do that, it would be better to find a way to do it quickly, so that I can get a better feedback loop going. The reason this post has taken so long has in part been because each training run has taken so long (as well as trips to London and other life stuff). The original GPT-2 model from OpenAI had bias on the W q , W k and W v projections -- that is, they were normal NN biased linear layers rather than simple matrices, so they did a projection into their respective spaces followed by a translation. In the book, Raschka says that this is not normally done these days, which is why I didn't do it for this base model train. But perhaps it actually is valuable with this architecture or size? Modern models presumably differ in multiple ways, and perhaps the bias would have been useful for this old design. Likewise, weight-tying -- the original GPT-2 re-used its embedding matrix to do the final projection from embedding space to vocab space, rather than having a separate one. That seems intuitively clever but not necessarily "right", given that it gives the model less flexibility in what it can output from the last layer. But perhaps with this size and architecture, it's the right thing to do? Contrariwise, having made those two changes to GPT-2 because I believed that modern models don't work that way, there was one "modern" change that I didn't make. In his post on the architectural changes since GPT-2, Raschka mentioned that dropout is normally not used nowadays. This looked to me like it was due to the move to single-epoch training. But single-epoch training was exactly what we were doing in this post! Perhaps I was holding myself back by keeping dropout in place. I don't have a good intuition as to what the right level is for this at the moment. My code blindly uses the optimiser setup from the book: I have at best a vague understanding of how those work, at least when using an optimiser (LR for simple gradient descent isn't too hard to understand, although it's hard to work out an intuition for what the right value might be in any given case). Additionally, in the Chinchilla paper, they talk about using a cosine function to vary the learning rate, which is something I'm completely unfamiliar with. I gained about a day in training time by using AMP and the TF32 tensor cores; however, I lost precision. I don't know for sure, but I suspect that the original weights were trained with pure full-fat FP32. Perhaps reducing precision lost something? I know that modern models are often trained with lower precisions, but perhaps that's balanced out by something else? This is the one that I think it least likely, but it's worth mentioning. The post that I linked to estimating the size of the training run for GPT-2 small mentioned that they used a batch size of 512, which (of course) is completely impossible on consumer hardware like mine. Indeed, I think you'd be lucky to get 512 onto a single 8-GPU node -- we're talking serious cluster training scale here. Larger batches lead to more stable updates to the gradients. So maybe that helped for OpenAI when they did their train? I suspect it did, but I'm pretty much certain that it's not a large part of the difference. (Counterpoint: Gemini thinks that this might actually be a big part of the problem! It recommends using gradient accumulation -- that is, not stepping the optimiser every iteration, but instead giving gradients time to build up -- as a way of getting a larger batch effective batch size.) While it doesn't look like we had any issues with these on the original FineWeb and FineWeb-Edu trains, they definitely did kick in on the extended Edu train. The code to clip them is easy enough, and I think it's likely that the original GPT-2 trains would have had it. I doubt this was a major part of the difference, but it probably would have helped, at least a bit. Anyway, I think that's it in terms of differences that I can see between my train and OpenAI's (as always, comments welcome -- let me know if you spot any others!), so it's time to (finally) wrap this post up. At the start of this (ridiculously long) post, I asked the question: can we train a GPT-2 style base model at home on a single RTX 3090. The answer is a resounding "yes we can", which is great! Training base models: not just for the GPU-rich. If you have a couple of days and a decent graphics card, you can train a Chinchilla-optimal GPT-2 pretty easily. But the model itself isn't quite as good as the original GPT-2 small one, and I have some ideas about why that might be. Testing any of those would take quite a long time, given that each training run takes two days. Now, my next planned step was to see whether I could work out how to move this up to the cloud and train the same model on an 8x A100 or similar machine on Lambda Labs. This still sounds like an excellent plan! With his project, Karpathy trains a larger model on more tokens in four hours; if we could get the experiment time down to one hour (plausible if training time is linear in both tokens and parameters) then it would be much easier to check out those hypotheses above. 9 So, I think that's still the right way to go: after training a base model at home for free (if you ignore the electricity costs -- and it's cold enough in Lisbon right now that the heat from the PC was probably saving me money on my home heating bill -- and the cost of having bought the RTX 3090 in the first place), the next step is to see how cheaply we can train it in the cloud. Stay tuned :-) It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩ . This is determined by the tokenizer, and I want to use the GPT-2 one, so it will need to be . . GPT-2 has a 1,024-token context length, so I'll stick with that. , , --- these define which of the different GPT-2 model classes we're training, and I want to stick to the smallest one, so they will be , and respectively . One of the most surprising things to me in the "architectural improvements" post linked above was that dropout is no longer used so much. However, this appears to be tied in to the one-epoch training that has taken off since GPT-2, so I think it would be best to stick to here. . From what Raschka says in the book, this doesn't add on much value, even though the original GPT-2 used it, so let's set it to . Crop all of the input sequences -- that is, each row in the dataset -- so that each one is no more than our 1,024 sequence length. Then we can pad them out with end-of-sequence tokens (as is the standard) so that they're all 1,024. This will lose us quite a lot of tokens, but has the big benefit of being easy. Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. Doing it this way would mean we'd use all of our training data. But it would be more complicated, especially if we hit memory constraints. We load enough GPT-2 tokens from FineWeb for batches of sequences each, every one of those sequences being long (plus one extra token for the targets we're comparing them to). Note that we're not bothering to separate them with anything for this test. We then loop over batch sizes from to . Then we create our model and put it on the CUDA device. We do this for each batch size rather than creating one and then using it for all of them so that they're all starting from the same point -- the should make sure that they're identical. For each batch size, we create input and output batches as tensors -- note that we're not putting these on CUDA yet, I wanted to do that in the training loop to mirror what a real training loop will have to do. When we're training with 3.2B tokens then having them all on CUDA will be a waste of VRAM, so we'll be pushing a batch there for each iteration. We do a stripped-down training loop -- for each batch, put the inputs and outputs onto CUDA, then a forward pass, work out the loss, backward pass, and optimiser step. We do the same iterations per batch size. Finally, we print out the number of tokens we trained on for this batch size, how long it took, and the number of tokens per second. Chinchilla heuristic, 20x parameters -- 3.2B tokens: 247,850 seconds, which is just less than three days Estimated GPT-2 train, 419B tokens: 32,452,947 seconds, which is just over a year. Create a model, optimiser and scaler. Train the model for a bit. Work out the loss. Save a checkpoint. Create a new model, optimiser, and scaler, and then restore the checkpoint into them. Work out the loss Train for a bit more to check that the optimiser and scaler still work. On our own validation set from FineWeb, our we have OpenAI > our FineWeb train > our FineWeb-Edu extended train > our FineWeb-Edu train On the answers judged by GPT-5.1 after instruction fine-tuning, we have OpenAI > our FineWeb-Edu extended train > our FineWeb train > our FineWeb-Edu train It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩

0 views
Stratechery 1 weeks ago

OpenAI Code Red, AWS and Google Cloud Networking

OpenAI is declaring code red and doubling down on ChatGPT, highlighting the company's bear case. Then, AWS makes it easier to run AI workloads on other clouds.

0 views