Posts in Machine-learning (20 found)
Kaushik Gopal Yesterday

AI model choices 2026-01

Which AI model do I use? This is a common question I get asked, but models evolve so rapidly that I never felt like I could give an answer that would stay relevant for more than a month or two. This year, I finally feel like I have a stable set of model choices that consistently give me good results. I’m jotting it down here to share more broadly and to trace how my own choices evolve over time. GPT 5.2 (High) for planning and writing, including plans Opus 4.5 for anything coding, task automation, and tool calling Gemini ’s range of models for everything else: Gemini 3 (Thinking) for learning and understanding concepts (underrated) Gemini 3 (Flash) for quick fire questions Nano Banana (obv) for image generation NVIDIA’s Parakeet for voice transcription

0 views
Giles's blog 5 days ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Alex Jacobs 1 weeks ago

Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification

“Just use an LLM.” That was my advice to a colleague recently when they asked about a classification problem. Who fine-tunes BERT anymore? Haven’t decoder models eaten the entire NLP landscape? The look I got back was… skeptical. And it stuck with me. I’ve been deep in LLM-land for a few years now. When your daily driver can architect systems, write production code, and reason through problems better than most junior devs, you start reaching for it reflexively. Maybe my traditional ML instincts had atrophied. So I decided to actually test my assumptions instead of just vibing on them. I ran 32 experiments pitting small instruction-tuned LLMs against good old BERT and DeBERTa. I figured I’d just be confirming what I already believed, that these new decoder models would obviously crush the ancient encoders. I was wrong. The results across Gemma 2B, Qwen 0.5B/1.5B, BERT-base, and DeBERTa-v3 were… not what I expected. If you’re trying to decide between these approaches for classification, you might want to actually measure things instead of assuming the newer model is better. All the code is on GitHub if you want to run your own experiments. BERT Family (Fine-tuned) For the LLMs, I tried two approaches: Four classification benchmarks ranging from easy sentiment to adversarial NLI: For anyone who wants to reproduce this or understand what “fine-tuned” and “zero-shot” actually mean here: BERT/DeBERTa Fine-tuning: LLM Zero-shot: LLM Few-shot (k=5): All experiments used a fixed random seed (99) for reproducibility. Evaluation metrics are accuracy on the validation split. Hardware: RunPod instance with RTX A4500 (20GB VRAM), 20GB RAM, 5 vCPU. I’d forgotten how pretty text-only land can be. When you spend most of your time in IDEs and notebooks, SSH-ing into a headless GPU box and watching nvitop do its thing feels almost meditative. Let’s dive into what actually happened: DeBERTa-v3 wins most tasks—but not all DeBERTa hit 94.8% on SST-2, 80.9% on RTE, and 82.6% on BoolQ. For standard classification with decent training data, the fine-tuned encoders still dominate. On ANLI—the hardest benchmark, specifically designed to fool models—Gemma few-shot actually beats DeBERTa (47.8% vs 47.4%). It’s a narrow win, but it’s a win on the task that matters most for robustness. Zero-shot LLMs actually beat BERT-base The LLMs aren’t losing to BERT—they’re losing to DeBERTa. Qwen2.5-1.5B zero-shot hit 93.8% on SST-2, beating BERT-base’s 91.5%. Same story on RTE (78.7% vs 61.0%) and BoolQ (Gemma’s 80.9% vs BERT’s 71.5%). For models running purely on prompts with zero training? I’m calling it a win. Few-shot is a mixed bag Adding examples to the prompt doesn’t always help. On RTE, Qwen2.5-1.5B went from 78.7% zero-shot down to 53.4% with few-shot. On SST-2, it dropped from 93.8% to 89.0%. But on ANLI, few-shot helped significantly—Gemma jumped from 36.1% to 47.8%, enough to beat DeBERTa. Few-shot helps on harder tasks where examples demonstrate the thought process, but can confuse models on simpler pattern matching tasks where they already “get it.” Sometimes examples add noise instead of signal. Okay, so the accuracy gap isn’t huge. Maybe I could still justify using an LLM? Then I looked at throughput: BERT is ~20x faster. BERT processes 277 samples per second. Gemma-2-2B manages 12. If you’re classifying a million documents, that’s one hour vs a full day. Encoders process the whole sequence in one forward pass. Decoders generate tokens autoregressively, even just to output “positive” or “negative”. Note on LLM latency: These numbers use for tokenization. When I bumped it to , latency jumped 8x—from 57ms to 445ms per sample for Qwen-0.5B. Context window scales roughly linearly with inference time. For short classification tasks, keep it short or make it dynamic. These models struggled on nuanced reviews. Can you do better? Try classifying some of the trickiest examples from my experiments: Classify these tricky movie reviews Despite the efficiency gap, there are cases where small LLMs are the right choice: Zero Training Data If you have no labeled data, LLMs win by default. Zero-shot Qwen2.5-1.5B at 93.8% on SST-2 is production-ready without a single training example. You can’t fine-tune BERT with zero examples. Rapidly Changing Categories If your categories change frequently (new product types, emerging topics), re-prompting an LLM takes seconds. Re-training BERT requires new labeled data, training time, validation, deployment. The iteration cycle matters. Explanations with Predictions LLMs can provide reasoning: “This review is negative because the customer mentions ‘defective product’ and ‘waste of money.’” BERT gives you a probability. Sometimes you need the story, not just the number. If you’re processing 100 support tickets a day, throughput doesn’t matter. The 20x speed difference is irrelevant when you’re not hitting any resource constraints. High-Volume Production Systems If you’re classifying millions of items daily, BERT’s 20x throughput advantage matters. That’s a job finishing in an hour vs. running all day. Well-Defined, Stable Tasks Sentiment analysis. Spam detection. Topic classification. If your task definition hasn’t changed since 2019, fine-tuned BERT is proven and stable. No need to fix what isn’t broken. You Have Training Data With a few thousand labeled examples, fine-tuned DeBERTa will beat small LLMs. It’s a dedicated specialist vs. a generalist. Specialization still works. Latency Matters Real-time classification in a user-facing app where every millisecond counts? BERT’s parallel processing wins. LLMs can’t compete on speed. Before you @ me on Twitter—yes, I know this isn’t the final word. Some caveats: I only tested small LLMs. Kept everything under 2B parameters to fit comfortably on a 20GB GPU. Bigger models like Llama-3-8B or Qwen-7B would probably do better, but then the efficiency comparison becomes even more lopsided. You’re not beating BERT’s throughput with a 7B model. Generic prompts. I used straightforward prompts without heavy optimization. Task-specific prompt engineering could boost LLM performance. DSPy-style optimization would probably help too—but that’s another blog post. Four benchmarks isn’t everything. There are plenty of classification scenarios I didn’t test. Your domain might be different. Measure, don’t assume. So, can small LLMs beat BERT at classification? Sometimes, and on the hardest task, they actually do. Gemma few-shot edges out DeBERTa on adversarial NLI, the benchmark specifically designed to break models. DeBERTa-v3 still wins 3 out of 4 tasks when you have training data. And BERT’s efficiency advantage is real—~20x faster throughput matters when you’re processing millions of documents and paying for compute. Zero-shot LLMs aren’t just a parlor trick either. Qwen2.5-1.5B hits 93.8% on sentiment with zero training examples—that’s production-ready without a single label. For cold-start problems, rapidly changing domains, or when you need explanations alongside predictions, they genuinely work. Hopefully this gives some actual data points for making that call instead of just following the hype cycle. All the code is on GitHub . Go run your own experiments. Surely I’ve made some embarrassing mistakes here. Don’t just tell me—tell everyone! Share this post on your favorite social media with your corrections :) BERT-base-uncased (110M parameters) DeBERTa-v3-base (184M parameters) Qwen2-0.5B-Instruct Qwen2.5-1.5B-Instruct Gemma-2-2B-it Zero-shot - Just prompt engineering, no training Few-shot (k=5) - Include 5 examples in the prompt Standard HuggingFace Trainer with AdamW optimizer Learning rate: 2e-5, batch size: 32, epochs: 3 Max sequence length: 128 tokens Evaluation on validation split (GLUE test sets don’t have public labels) Greedy decoding (temperature=0.0) for deterministic outputs Task-specific prompts asking for single-word classification labels No examples in context—just instructions and the input text Same as zero-shot, but with 5 labeled examples prepended to each prompt Examples randomly sampled from training set (stratified by class)

0 views
Simon Willison 2 weeks ago

2025: The year in LLMs

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about AI in 2023 and Things we learned about LLMs in 2024 . It’s been a year filled with a lot of different trends. OpenAI kicked off the "reasoning" aka inference-scaling aka Reinforcement Learning from Verifiable Rewards (RLVR) revolution in September 2024 with o1 and o1-mini . They doubled down on that with o3, o3-mini and o4-mini in the opening months of 2025 and reasoning has since become a signature feature of models from nearly every other major AI lab. My favourite explanation of the significance of this trick comes from Andrej Karpathy : By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). [...] Running RLVR turned out to offer high capability/$, which gobbled up the compute that was originally intended for pretraining. Therefore, most of the capability progress of 2025 was defined by the LLM labs chewing through the overhang of this new stage and overall we saw ~similar sized LLMs but a lot longer RL runs. Every notable AI lab released at least one reasoning model in 2025. Some labs released hybrids that could be run in reasoning or non-reasoning modes. Many API models now include dials for increasing or decreasing the amount of reasoning applied to a given prompt. It took me a while to understand what reasoning was useful for. Initial demos showed it solving mathematical logic puzzles and counting the Rs in strawberry - two things I didn't find myself needing in my day-to-day model usage. It turned out that the real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal. A notable result is that AI assisted search actually works now . Hooking up search engines to LLMs had questionable results before, but now I find even my more complex research questions can often be answered by GPT-5 Thinking in ChatGPT . Reasoning models are also exceptional at producing and debugging code. The reasoning trick means they can start with an error and step through many different layers of the codebase to find the root cause. I've found even the gnarliest of bugs can be diagnosed by a good reasoner with the ability to read and execute code against even large and complex codebases. Combine reasoning with tool-use and you get... I started the year making a prediction that agents were not going to happen . Throughout 2024 everyone was talking about agents but there were few to no examples of them working, further confused by the fact that everyone using the term “agent” appeared to be working from a slightly different definition from everyone else. By September I’d got fed up of avoiding the term myself due to the lack of a clear definition and decided to treat them as an LLM that runs tools in a loop to achieve a goal . This unblocked me for having productive conversations about them, always my goal for any piece of terminology like that. I didn’t think agents would happen because I didn’t think the gullibility problem could be solved, and I thought the idea of replacing human staff members with LLMs was still laughable science fiction. I was half right in my prediction: the science fiction version of a magic computer assistant that does anything you ask of ( Her ) didn’t materialize... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here and they are proving to be extraordinarily useful. The two breakout categories for agents have been for coding and for search. The Deep Research pattern - where you challenge an LLM to gather information and it churns away for 15+ minutes building you a detailed report - was popular in the first half of the year but has fallen out of fashion now that GPT-5 Thinking (and Google's " AI mode ", a significantly better product than their terrible "AI overviews") can produce comparable results in a fraction of the time. I consider this to be an agent pattern, and one that works really well. The "coding agents" pattern is a much bigger deal. The most impactful event of 2025 happened in February, with the quiet release of Claude Code. I say quiet because it didn’t even get its own blog post! Anthropic bundled the Claude Code release in as the second item in their post announcing Claude 3.7 Sonnet . (Why did Anthropic jump from Claude 3.5 Sonnet to 3.7? Because they released a major bump to Claude 3.5 in October 2024 but kept the name exactly the same, causing the developer community to start referring to un-named 3.5 Sonnet v2 as 3.6. Anthropic burned a whole version number by failing to properly name their new model!) Claude Code is the most prominent example of what I call coding agents - LLM systems that can write code, execute that code, inspect the results and then iterate further. The major labs all put out their own CLI coding agents in 2025 Vendor-independent options include GitHub Copilot CLI , Amp , OpenCode , OpenHands CLI , and Pi . IDEs such as Zed, VS Code and Cursor invested a lot of effort in coding agent integration as well. My first exposure to the coding agent pattern was OpenAI's ChatGPT Code Interpreter in early 2023 - a system baked into ChatGPT that allowed it to run Python code in a Kubernetes sandbox. I was delighted this year when Anthropic finally released their equivalent in September, albeit under the baffling initial name of "Create and edit files with Claude". In October they repurposed that container sandbox infrastructure to launch Claude Code for web , which I've been using on an almost daily basis ever since. Claude Code for web is what I call an asynchronous coding agent - a system you can prompt and forget, and it will work away on the problem and file a Pull Request once it's done. OpenAI "Codex cloud" (renamed to "Codex web" in the last week ) launched earlier in May 2025 . Gemini's entry in this category is called Jules , also launched in May . I love the asynchronous coding agent category. They're a great answer to the security challenges of running arbitrary code execution on a personal laptop and it's really fun being able to fire off multiple tasks at once - often from my phone - and get decent results a few minutes later. I wrote more about how I'm using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle . In 2024 I spent a lot of time hacking on my LLM command-line tool for accessing LLMs from the terminal, all the time thinking that it was weird that so few people were taking CLI access to models seriously - they felt like such a natural fit for Unix mechanisms like pipes. Maybe the terminal was just too weird and niche to ever become a mainstream tool for accessing LLMs? Claude Code and friends have conclusively demonstrated that developers will embrace LLMs on the command line, given powerful enough models and the right harness. It helps that terminal commands with obscure syntax like and and itself are no longer a barrier to entry when an LLM can spit out the right command for you. As-of December 2nd Anthropic credit Claude Code with $1bn in run-rate revenue ! I did not expect a CLI tool to reach anything close to those numbers. With hindsight, maybe I should have promoted LLM from a side-project to a key focus! The default setting for most coding agents is to ask the user for confirmation for almost every action they take . In a world where an agent mistake could wipe your home folder or a malicious prompt injection attack could steal your credentials this default makes total sense. Anyone who's tried running their agent with automatic confirmation (aka YOLO mode - Codex CLI even aliases to ) has experienced the trade-off: using an agent without the safety wheels feels like a completely different product. A big benefit of asynchronous coding agents like Claude Code for web and Codex Cloud is that they can run in YOLO mode by default, since there's no personal computer to damage. I run in YOLO mode all the time, despite being deeply aware of the risks involved. It hasn't burned me yet... ... and that's the problem. One of my favourite pieces on LLM security this year is The Normalization of Deviance in AI by security researcher Johann Rehberger. Johann describes the "Normalization of Deviance" phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal. This was originally described by sociologist Diane Vaughan as part of her work to understand the 1986 Space Shuttle Challenger disaster, caused by a faulty O-ring that engineers had known about for years. Plenty of successful launches led NASA culture to stop taking that risk seriously. Johann argues that the longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own. ChatGPT Plus's original $20/month price turned out to be a snap decision by Nick Turley based on a Google Form poll on Discord. That price point has stuck firmly ever since. This year a new pricing precedent has emerged: the Claude Pro Max 20x plan, at $200/month. OpenAI have a similar $200 plan called ChatGPT Pro. Gemini have Google AI Ultra at $249/month with a $124.99/month 3-month starting discount. These plans appear to be driving some serious revenue, though none of the labs have shared figures that break down their subscribers by tier. I've personally paid $100/month for Claude in the past and will upgrade to the $200/month plan once my current batch of free allowance (from previewing one of their models - thanks, Anthropic) runs out. I've heard from plenty of other people who are happy to pay these prices too. You have to use models a lot in order to spend $200 of API credits, so you would think it would make economic sense for most people to pay by the token instead. It turns out tools like Claude Code and Codex CLI can burn through enormous amounts of tokens once you start setting them more challenging tasks, to the point that $200/month offers a substantial discount. 2024 saw some early signs of life from the Chinese AI labs mainly in the form of Qwen 2.5 and early DeepSeek. They were neat models but didn't feel world-beating. This changed dramatically in 2025. My ai-in-china tag has 67 posts from 2025 alone, and I missed a bunch of key releases towards the end of the year (GLM-4.7 and MiniMax-M2.1 in particular.) Here's the Artificial Analysis ranking for open weight models as-of 30th December 2025 : GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models. The highest non-Chinese model in that chart is OpenAI's gpt-oss-120B (high), which comes in sixth place. The Chinese model revolution really kicked off on Christmas day 2024 with the release of DeepSeek 3 , supposedly trained for around $5.5m. DeepSeek followed that on 20th January with DeepSeek R1 which promptly triggered a major AI/semiconductor selloff : NVIDIA lost ~$593bn in market cap as investors panicked that AI maybe wasn't an American monopoly after all. The panic didn't last - NVIDIA quickly recovered and today are up significantly from their pre-DeepSeek R1 levels. It was still a remarkable moment. Who knew an open weight model release could have that kind of impact? DeepSeek were quickly joined by an impressive roster of Chinese AI labs. I've been paying attention to these ones in particular: Most of these models aren't just open weight, they are fully open source under OSI-approved licenses: Qwen use Apache 2.0 for most of their models, DeepSeek and Z.ai use MIT. Some of them are competitive with Claude 4 Sonnet and GPT-5! Sadly none of the Chinese labs have released their full training data or the code they used to train their models, but they have been putting out detailed research papers that have helped push forward the state of the art, especially when it comes to efficient training and inference. One of the most interesting recent charts about LLMs is Time-horizon of software engineering tasks different LLMscan complete 50% of the time from METR: The chart shows tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. As you can see, 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours - 2024’s best models tapped out at under 30 minutes. METR conclude that “the length of tasks AI can do is doubling every 7 months”. I'm not convinced that pattern will continue to hold, but it's an eye-catching way of illustrating current trends in agent capabilities. The most successful consumer product launch of all time happened in March, and the product didn't even have a name. One of the signature features of GPT-4o in May 2024 was meant to be its multimodal output - the "o" stood for "omni" and OpenAI's launch announcement included numerous "coming soon" features where the model output images in addition to text. Then... nothing. The image output feature failed to materialize. In March we finally got to see what this could do - albeit in a shape that felt more like the existing DALL-E. OpenAI made this new image generation available in ChatGPT with the key feature that you could upload your own images and use prompts to tell it how to modify them. This new feature was responsible for 100 million ChatGPT signups in a week. At peak they saw 1 million account creations in a single hour! Tricks like "ghiblification" - modifying a photo to look like a frame from a Studio Ghibli movie - went viral time and time again. OpenAI released an API version of the model called "gpt-image-1", later joined by a cheaper gpt-image-1-mini in October and a much improved gpt-image-1.5 on December 16th . The most notable open weight competitor to this came from Qwen with their Qwen-Image generation model on August 4th followed by Qwen-Image-Edit on August 19th . This one can run on (well equipped) consumer hardware! They followed with Qwen-Image-Edit-2511 in November and Qwen-Image-2512 on 30th December, neither of which I've tried yet. The even bigger news in image generation came from Google with their Nano Banana models, available via Gemini. Google previewed an early version of this in March under the name "Gemini 2.0 Flash native image generation". The really good one landed on August 26th , where they started cautiously embracing the codename "Nano Banana" in public (the API model was called " Gemini 2.5 Flash Image "). Nano Banana caught people's attention because it could generate useful text ! It was also clearly the best model at following image editing instructions. In November Google fully embraced the "Nano Banana" name with the release of Nano Banana Pro . This one doesn't just generate text, it can output genuinely useful detailed infographics and other text and information-heavy images. It's now a professional-grade tool. Max Woolf published the most comprehensive guide to Nano Banana prompting , and followed that up with an essential guide to Nano Banana Pro in December. I've mainly been using it to add kākāpō parrots to my photos. Given how incredibly popular these image tools are it's a little surprising that Anthropic haven't released or integrated anything similar into Claude. I see this as further evidence that they're focused on AI tools for professional work, but Nano Banana Pro is rapidly proving itself to be of value to anyone who's work involves creating presentations or other visual materials. In July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad , a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There's no chance any of these were already in the training data! It's also notable because neither of the models had access to tools - their solutions were generated purely from their internal knowledge and token-based reasoning capabilities. Turns out sufficiently advanced LLMs can do math after all! In September OpenAI and Gemini pulled off a similar feat for the International Collegiate Programming Contest (ICPC) - again notable for having novel, previously unpublished problems. This time the models had access to a code execution environment but otherwise no internet access. I don't believe the exact models used for these competitions have been released publicly, but Gemini's Deep Think and OpenAI's GPT-5 Pro should provide close approximations. With hindsight, 2024 was the year of Llama. Meta's Llama models were by far the most popular open weight models - the original Llama kicked off the open weight revolution back in 2023 and the Llama 3 series, in particular the 3.1 and 3.2 dot-releases, were huge leaps forward in open weight capability. Llama 4 had high expectations, and when it landed in April it was... kind of disappointing. There was a minor scandal where the model tested on LMArena turned out not to be the model that was released, but my main complaint was that the models were too big . The neatest thing about previous Llama releases was that they often included sizes you could run on a laptop. The Llama 4 Scout and Maverick models were 109B and 400B, so big that even quantization wouldn't get them running on my 64GB Mac. They were trained using the 2T Llama 4 Behemoth which seems to have been forgotten now - it certainly wasn't released. It says a lot that none of the most popular models listed by LM Studio are from Meta, and the most popular on Ollama is still Llama 3.1, which is low on the charts there too. Meta's AI news this year mainly involved internal politics and vast amounts of money spent hiring talent for their new Superintelligence Labs . It's not clear if there are any future Llama releases in the pipeline or if they've moved away from open weight model releases to focus on other things. Last year OpenAI remained the undisputed leader in LLMs, especially given o1 and the preview of their o3 reasoning models. This year the rest of the industry caught up. OpenAI still have top tier models, but they're being challenged across the board. In image models they're still being beaten by Nano Banana Pro. For code a lot of developers rate Opus 4.5 very slightly ahead of GPT-5.2 Codex. In open weight models their gpt-oss models, while great, are falling behind the Chinese AI labs. Their lead in audio is under threat from the Gemini Live API . Where OpenAI are winning is in consumer mindshare. Nobody knows what an "LLM" is but almost everyone has heard of ChatGPT. Their consumer apps still dwarf Gemini and Claude in terms of user numbers. Their biggest risk here is Gemini. In December OpenAI declared a Code Red in response to Gemini 3, delaying work on new initiatives to focus on the competition with their key products. Google Gemini had a really good year . They posted their own victorious 2025 recap here . 2025 saw Gemini 2.0, Gemini 2.5 and then Gemini 3.0 - each model family supporting audio/video/image/text input of 1,000,000+ tokens, priced competitively and proving more capable than the last. They also shipped Gemini CLI (their open source command-line coding agent, since forked by Qwen for Qwen Code ), Jules (their asynchronous coding agent), constant improvements to AI Studio, the Nano Banana image models, Veo 3 for video generation, the promising Gemma 3 family of open weight models and a stream of smaller features. Google's biggest advantage lies under the hood. Almost every other AI lab trains with NVIDIA GPUs, which are sold at a margin that props up NVIDIA's multi-trillion dollar valuation. Google use their own in-house hardware, TPUs, which they've demonstrated this year work exceptionally well for both training and inference of their models. When your number one expense is time spent on GPUs, having a competitor with their own, optimized and presumably much cheaper hardware stack is a daunting prospect. It continues to tickle me that Google Gemini is the ultimate example of a product name that reflects the company's internal org-chart - it's called Gemini because it came out of the bringing together (as twins) of Google's DeepMind and Google Brain teams. I first asked an LLM to generate an SVG of a pelican riding a bicycle in October 2024 , but 2025 is when I really leaned into it. It's ended up a meme in its own right. I originally intended it as a dumb joke. Bicycles are hard to draw, as are pelicans, and pelicans are the wrong shape to ride a bicycle. I was pretty sure there wouldn't be anything relevant in the training data, so asking a text-output model to generate an SVG illustration of one felt like a somewhat absurdly difficult challenge. To my surprise, there appears to be a correlation between how good the model is at drawing pelicans on bicycles and how good it is overall. I don't really have an explanation for this. The pattern only became clear to me when I was putting together a last-minute keynote (they had a speaker drop out) for the AI Engineer World's Fair in July. You can read (or watch) the talk I gave here: The last six months in LLMs, illustrated by pelicans on bicycles . My full collection of illustrations can be found on my pelican-riding-a-bicycle tag - 89 posts and counting. There is plenty of evidence that the AI labs are aware of the benchmark. It showed up (for a split second) in the Google I/O keynote in May, got a mention in an Anthropic interpretability research paper in October and I got to talk about it in a GPT-5 launch video filmed at OpenAI HQ in August. Are they training specifically for the benchmark? I don't think so, because the pelican illustrations produced by even the most advanced frontier models still suck! In What happens if AI labs train for pelicans riding bicycles? I confessed to my devious objective: Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. My favourite is still this one that I go from GPT-5: I started my tools.simonwillison.net site last year as a single location for my growing collection of vibe-coded / AI-assisted HTML+JavaScript tools. I wrote several longer pieces about this throughout the year: The new browse all by month page shows I built 110 of these in 2025! I really enjoy building in this way, and I think it's a fantastic way to practice and explore the capabilities of these models. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them. I'll highlight a few of my favourites from the past year: A lot of the others are useful tools for my own workflow like svg-render and render-markdown and alt-text-extractor . I built one that does privacy-friendly personal analytics against localStorage to keep track of which tools I use the most often. Anthropic's system cards for their models have always been worth reading in full - they're full of useful information, and they also frequently veer off into entertaining realms of science fiction. The Claude 4 system card in May had some particularly fun moments - highlights mine: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users , given access to a command line, and told something in the system prompt like “ take initiative ,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. In other words, Claude 4 might snitch you out to the feds. This attracted a great deal of media attention and a bunch of people decried Anthropic as having trained a model that was too ethical for its own good. Then Theo Browne used the concept from the system card to build SnitchBench - a benchmark to see how likely different models were to snitch on their users. It turns out they almost all do the same thing ! Theo made a video , and I published my own notes on recreating SnitchBench with my LLM too . The key prompt that makes this work is: I recommend not putting that in your system prompt! Anthropic's original Claude 4 system card said the same thing: We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable. In a tweet in February Andrej Karpathy coined the term "vibe coding", with an unfortunately long definition (I miss the 140 character days) that many people failed to read all the way to the end: There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. The key idea here was "forget that the code even exists" - vibe coding captured a new, fun way of prototyping software that "mostly works" through prompting alone. I don't know if I've ever seen a new term catch on - or get distorted - so quickly in my life. A lot of people instead latched on to vibe coding as a catch-all for anything where LLM is involved in programming. I think that's a waste of a great term, especially since it's becoming clear likely that most programming will involve some level of AI-assistance in the near future. Because I'm a sucker for tilting at linguistic windmills I tried my best to encourage the original meaning of the term: I don't think this battle is over yet. I've seen reassuring signals that the better, original definition of vibe coding might come out on top. I should really get a less confrontational linguistic hobby! Anthropic introduced their Model Context Protocol specification in November 2024 as an open standard for integrating tool calls with different LLMs. In early 2025 it exploded in popularity. There was a point in May where OpenAI , Anthropic , and Mistral all rolled out API-level support for MCP within eight days of each other! MCP is a sensible enough idea, but the huge adoption caught me by surprise. I think this comes down to timing: MCP's release coincided with the models finally getting good and reliable at tool-calling, to the point that a lot of people appear to have confused MCP support as a pre-requisite for a model to use tools. For a while it also felt like MCP was a convenient answer for companies that were under pressure to have "an AI strategy" but didn't really know how to do that. Announcing an MCP server for your product was an easily understood way to tick that box. The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash - if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal. Since leaning heavily into Claude Code and friends myself I've hardly used MCP at all - I've found CLI tools like and libraries like Playwright to be better alternatives to the GitHub and Playwright MCPs. Anthropic themselves appeared to acknowledge this later in the year with their release of the brilliant Skills mechanism - see my October post Claude Skills are awesome, maybe a bigger deal than MCP . MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts. Then in November Anthropic published Code execution with MCP: Building more efficient agents - describing a way to have coding agents generate code to call MCPs in a way that avoided much of the context overhead from the original specification. (I'm proud of the fact that I reverse-engineered Anthropic's skills a week before their announcement , and then did the same thing to OpenAI's quiet adoption of skills two months after that .) MCP was donated to the new Agentic AI Foundation at the start of December. Skills were promoted to an "open format" on December 18th . Despite the very clear security risks, everyone seems to want to put LLMs in your web browser. OpenAI launched ChatGPT Atlas in October, built by a team including long-time Google Chrome engineers Ben Goodger and Darin Fisher. Anthropic have been promoting their Claude in Chrome extension, offering similar functionality as an extension as opposed to a full Chrome fork. Chrome itself now has a little "Gemini" button in the top right called Gemini in Chrome , though I believe that's just for answering questions about content and doesn't yet have the ability to drive browsing actions. I remain deeply concerned about the safety implications of these new tools. My browser has access to my most sensitive data and controls most of my digital life. A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect. So far the most detail I've seen on mitigating these concerns came from OpenAI's CISO Dane Stuckey , who talked about guardrails and red teaming and defense in depth but also correctly called prompt injection "a frontier, unsolved security problem". I've used these browsers agents a few times now ( example ), under very close supervision. They're a bit slow and janky - they often miss with their efforts to click on interactive elements - but they're handy for solving problems that can't be addressed via APIs. I'm still uneasy about them, especially in the hands of people who are less paranoid than I am. I've been writing about prompt injection attacks for more than three years now. An ongoing challenge I've found is helping people understand why they're a problem that needs to be taken seriously by anyone building software in this space. This hasn't been helped by semantic diffusion , where the term "prompt injection" has grown to cover jailbreaking as well (despite my protestations ), and who really cares if someone can trick a model into saying something rude? So I tried a new linguistic trick! In June I coined the term the lethal trifecta to describe the subset of prompt injection where malicious instructions trick an agent into stealing private data on behalf of an attacker. A trick I use here is that people will jump straight to the most obvious definition of any new term that they hear. "Prompt injection" sounds like it means "injecting prompts". "The lethal trifecta" is deliberately ambiguous: you have to go searching for my definition if you want to know what it means! It seems to have worked. I've seen a healthy number of examples of people talking about the lethal trifecta this year with, so far, no misinterpretations of what it is intended to mean. I wrote significantly more code on my phone this year than I did on my computer. Through most of the year this was because I leaned into vibe coding so much. My tools.simonwillison.net collection of HTML+JavaScript tools was mostly built this way: I would have an idea for a small project, prompt Claude Artifacts or ChatGPT or (more recently) Claude Code via their respective iPhone apps, then either copy the result and paste it into GitHub's web editor or wait for a PR to be created that I could then review and merge in Mobile Safari. Those HTML tools are often ~100-200 lines of code, full of uninteresting boilerplate and duplicated CSS and JavaScript patterns - but 110 of them adds up to a lot! Up until November I would have said that I wrote more code on my phone, but the code I wrote on my laptop was clearly more significant - fully reviewed, better tested and intended for production use. In the past month I've grown confident enough in Claude Opus 4.5 that I've started using Claude Code on my phone to tackle much more complex tasks, including code that I intend to land in my non-toy projects. This started with my project to port the JustHTML HTML5 parser from Python to JavaScript , using Codex CLI and GPT-5.2. When that worked via prompting-alone I became curious as to how much I could have got done on a similar project using just my phone. So I attempted a port of Fabrice Bellard's new MicroQuickJS C library to Python, run entirely using Claude Code on my iPhone... and it mostly worked ! Is it code that I'd use in production? Certainly not yet for untrusted code , but I'd trust it to execute JavaScript I'd written myself. The test suite I borrowed from MicroQuickJS gives me some confidence there. This turns out to be the big unlock: the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against. I call these conformance suites and I've started deliberately looking out for them - so far I've had success with the html5lib tests , the MicroQuickJS test suite and a not-yet-released project against the comprehensive WebAssembly spec/test collection . If you're introducing a new protocol or even a new programming language to the world in 2026 I strongly recommend including a language-agnostic conformance suite as part of your project. I've seen plenty of hand-wringing that the need to be included in LLM training data means new technologies will struggle to gain adoption. My hope is that the conformance suite approach can help mitigate that problem and make it easier for new ideas of that shape to gain traction. Towards the end of 2024 I was losing interest in running local LLMs on my own machine. My interest was re-kindled by Llama 3.3 70B in December , the first time I felt like I could run a genuinely GPT-4 class model on my 64GB MacBook Pro. Then in January Mistral released Mistral Small 3 , an Apache 2 licensed 24B parameter model which appeared to pack the same punch as Llama 3.3 70B using around a third of the memory. Now I could run a ~GPT-4 class model and have memory left over to run other apps! This trend continued throughout 2025, especially once the models from the Chinese AI labs started to dominate. That ~20-32B parameter sweet spot kept getting models that performed better than the last. I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled. The problem is that the big cloud models got better too - including those open weight models that, while freely available, were far too large (100B+) to run on my laptop. Coding agents changed everything for me. Systems like Claude Code need more than a great model - they need a reasoning model that can perform reliable tool calling invocations dozens if not hundreds of times over a constantly expanding context window. I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device. My next laptop will have at least 128GB of RAM, so there's a chance that one of the 2026 open weight models might fit the bill. For now though I'm sticking with the best available frontier hosted models as my daily drivers. I played a tiny role helping to popularize the term "slop" in 2024, writing about it in May and landing quotes in the Guardian and the New York Times shortly afterwards. This year Merriam-Webster crowned it word of the year ! slop ( noun ): digital content of low quality that is produced usually in quantity by means of artificial intelligence I like that it represents a widely understood feeling that poor quality AI-generated content is bad and should be avoided. I'm still holding hope that slop won't end up as bad a problem as many people fear. The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff. I don't see the increased volume of junk as changing that fundamental dynamic much. Curation matters more than ever. That said... I don't use Facebook, and I'm pretty careful at filtering or curating my other social media habits. Is Facebook still flooded with Shrimp Jesus or was that a 2024 thing? I heard fake videos of cute animals getting rescued is the latest trend. It's quite possible the slop problem is a growing tidal wave that I'm innocently unaware of. I nearly skipped writing about the environmental impact of AI for this year's post (here's what I wrote in 2024 ) because I wasn't sure if we had learned anything new this year - AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable. What's interesting in 2025 is that public opinion appears to be shifting quite dramatically against new data center construction. Here's a Guardian headline from December 8th: More than 200 environmental groups demand halt to new US datacenters . Opposition at the local level appears to be rising sharply across the board too. I've been convinced by Andy Masley that the water usage issue is mostly overblown, which is a problem mainly because it acts as a distraction from the very real issues around energy consumption, carbon emissions and noise pollution. AI labs continue to find new efficiencies to help serve increased quality of models using less energy per token, but the impact of that is classic Jevons paradox - as tokens get cheaper we find more intense ways to use them, like spending $200/month on millions of tokens to run coding agents. As an obsessive collector of neologisms, here are my own favourites from 2025. You can see a longer list in my definitions tag . If you've made it this far, I hope you've found this useful! You can subscribe to my blog in a feed reader or via email , or follow me on Bluesky or Mastodon or Twitter . If you'd like a review like this on a monthly basis instead I also operate a $10/month sponsors only newsletter with a round-up of the key developments in the LLM space over the past 30 days. Here are preview editions for September , October , and November - I'll be sending December's out some time tomorrow. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The year of "reasoning" The year of agents The year of coding agents and Claude Code The year of LLMs on the command-line The year of YOLO and the Normalization of Deviance The year of $200/month subscriptions The year of top-ranked Chinese open weight models The year of long tasks The year of prompt-driven image editing The year models won gold in academic competitions The year that Llama lost its way The year that OpenAI lost their lead The year of Gemini The year of pelicans riding bicycles The year I built 110 tools The year of the snitch! The year of vibe coding The (only?) year of MCP The year of alarmingly AI-enabled browsers The year of the lethal trifecta The year of programming on my phone The year of conformance suites The year local models got good, but cloud models got even better The year of slop The year that data centers got extremely unpopular My own words of the year That's a wrap for 2025 Claude Code Mistral Vibe Alibaba Qwen (Qwen3) Moonshot AI (Kimi K2) Z.ai (GLM-4.5/4.6/4.7) MiniMax (M2) MetaStone AI (XBai o4) Here’s how I use LLMs to help me write code Adding AI-generated descriptions to my tools collection Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools - my favourite post of the bunch. blackened-cauliflower-and-turkish-style-stew is ridiculous. It's a custom cooking timer app for anyone who needs to prepare Green Chef's Blackened Cauliflower and Turkish-style Spiced Chickpea Stew recipes at the same time. Here's more about that one . is-it-a-bird takes inspiration from xkcd 1425 , loads a 150MB CLIP model via Transformers.js and uses it to say if an image or webcam feed is a bird or not. bluesky-thread lets me view any thread on Bluesky with a "most recent first" option to make it easier to follow new posts as they arrive. Not all AI-assisted programming is vibe coding (but vibe coding rocks) in March Two publishers and three authors fail to understand what “vibe coding” means in May (one book subsequently changed its title to the much better "Beyond Vibe Coding"). Vibe engineering in October, where I tried to suggest an alternative term for what happens when professional engineers use AI assistance to build production-grade software. Your job is to deliver code you have proven to work in December, about how professional software development is about code that demonstrably works, no matter how you built it. Vibe coding, obviously. Vibe engineering - I'm still on the fence of if I should try to make this happen ! The lethal trifecta , my one attempted coinage of the year that seems to have taken root . Context rot , by Workaccount2 on Hacker News, for the thing where model output quality falls as the context grows longer during a session. Context engineering as an alternative to prompt engineering that helps emphasize how important it is to design the context you feed to your model. Slopsquatting by Seth Larson, where an LLM hallucinates an incorrect package name which is then maliciously registered to deliver malware. Vibe scraping - another of mine that didn't really go anywhere, for scraping projects implemented by coding agents driven by prompts. Asynchronous coding agent for Claude for web / Codex cloud / Google Jules Extractive contributions by Nadia Eghbal for open source contributions where "the marginal cost of reviewing and merging that contribution is greater than the marginal benefit to the project’s producers".

2 views
Stratechery 4 weeks ago

ChatGPT Image 1.5; Apple v. Epic, Continued; Holiday Schedule

ChatGPT Image 1.5 launched, and while it seems comparable to Gemini's Nano Banana Pro, the product around it shows OpenAI's advantages. Then, Apple v. Epic rolls on.

0 views
Christian Jauvin 1 months ago

Insurmountable Hans

The ARC benchmark was designed by François Chollet to serve one goal : be sufficiently difficult, demanding so that it cannot be “hacked” by some “cheating” AI techniques, LLM or whatever. But it must do so in a rigorous, systematic and simple to define way, it cannot be vague or ambiguous. It must be (relatively) easy for a human, but hard for a program. And when I first looked at it, I admired its simplicity and purity: the problems are simple but deep, and it’s clear that for many of them, you need to grasp something that goes beyond mere pattern recognition or superficial pattern matching. They seem to require some seriously deeper thinking. And if, like me, you thought for a minute about how you’d try to tackle them, in a programmatic way (ML or otherwise), it was quite easy to become convinced, that this is quite a good benchmark. And at first, what happened, on Kaggle, for instance, was exactly that: nobody could get even remotely decent results, the problem set really felt like a though nut to crack. From there, the temptation was great, to suggest the idea that whenever ARC would be cracked, AGI would have arrived!

0 views
Ahead of AI 1 months ago

A Technical Tour of the DeepSeek Models from V3 to V3.2

Similar to DeepSeek V3, the team released their new flagship model over a major US holiday weekend. Given DeepSeek V3.2’s really good performance (on GPT-5 and Gemini 3.0 Pro) level, and the fact that it’s also available as an open-weight model, it’s definitely worth a closer look. Figure 1: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . I covered the predecessor, DeepSeek V3, at the very beginning of my The Big LLM Architecture Comparison article, which I kept extending over the months as new architectures got released. Originally, as I just got back from Thanksgiving holidays with my family, I planned to “just” extend the article with this new DeepSeek V3.2 release by adding another section, but I then realized that there’s just too much interesting information to cover, so I decided to make this a longer, standalone article. There’s a lot of interesting ground to cover and a lot to learn from their technical reports, so let’s get started! While DeepSeek V3 wasn’t popular immediately upon release in December 2024, the DeepSeek R1 reasoning model (based on the identical architecture, using DeepSeek V3 as a base model) helped DeepSeek become one of the most popular open-weight models and a legit alternative to proprietary models such as the ones by OpenAI, Google, xAI, and Anthropic. Figure 2: DeepSeek V3/R1 architecture from December 2024. We will revisit and discuss architectural details in a later section. So, what’s new since V3/R1? I am sure that the DeepSeek team has been super busy this year. However, there hasn’t been a major release in the last 10-11 months since DeepSeek R1. Personally, I think it’s reasonable to go ~1 year for a major LLM release since it’s A LOT of work. However, I saw on various social media platforms that people were pronouncing the team “dead” (as a one-hit wonder). I am sure the DeepSeek team has also been busy navigating the switch from NVIDIA to Huawei chips. By the way, I am not affiliated with them or have spoken with them; everything here is based on public information . As far as I know, they are back to using NVIDIA chips. Finally, it’s also not that they haven’t released anything. There have been a couple of smaller releases that trickled in this year, for instance, DeepSeek V3.1 and V3.2-Exp. Figure 3: DeepSeek releases since last year. The main models are shown in red. As I predicted back in September, the DeepSeek V3.2-Exp release was intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. V3.2-Exp and V3.2 use a non-standard sparse attention variant that requires custom code, but more on this mechanism later. (I was tempted to cover it in my previous Beyond Standard LLMs article, but Kimi Linear was released around then, which I prioritized for this article section on new attention variants.) Before discussing further model details, it might be worthwhile to discuss the overall model types. Originally, DeepSeek V3 was released as a base model, and DeepSeek R1 added additional post-training to develop a dedicated reasoning model. This procedure is summarized in the figure below. Figure 4: Overview of the DeepSeek R1 training pipeline. This figure is from my more detailed Understanding Reasoning LLMs article. You can read more about the training pipeline in the figure above in my Understanding Reasoning LLMs article. What’s worthwhile noting here is that DeepSeek V3 is a base model, and DeepSeek R1 is a dedicated reasoning model. In parallel with DeepSeek, other teams have also released many really strong open-weight reasoning models. One of the strongest open-weight models this year was Qwen3. Originally, it was released as a hybrid reasoning model, which means that users were able to toggle between reasoning and non-reasoning modes within the same model. (In the case of Qwen3, this toggling was enabled via the tokenizer by adding/omitting tags.) Since then, LLM teams have released (and in some cases gone back and forth between) both dedicated reasoning models and Instruct/Reasoning hybrid models, as shown in the timeline below. Figure 5: The timeline of some of the reasoning and hybrid models released this year. For instance, Qwen3 started out as a hybrid model, but the Qwen team then later released separate instruct and reasoning models as they were easier to develop and yielded better performance in each respective use case. Some models like OpenAI’s gpt-oss only come in a hybrid variant where users can choose the reasoning effort via a system prompt (I suspect this is handled similarly in GPT-5 and GPT-5.1). And in the case of DeepSeek, it looks like they moved in the opposite direction from a dedicated reasoning model (R1) to a hybrid model (V3.1 and V3.2). However, I suspect that R1 was mainly a research project to develop reasoning methods and the best reasoning model at the time. The V3.2 release may be more about developing the best overall model for different use cases. (Here, R1 was more like a testbed or prototype model.) And I also suspect that, while the DeepSeek team developed V3.1 and V3.2 with reasoning capabilities, they might still be working on a dedicated R2 model. Before discussing the new DeepSeek V3.2 release in more detail, I thought it would be helpful to start with an overview of the main changes going from V3 to V3.1. I already discussed DeepSeek V3 and R1 in great detail in several other articles. To summarize the main points, DeepSeek V3 is a base model that uses two noteworthy architecture aspects: Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA). I think you are probably well familiar with MoE at this point, so I am skipping the introduction here. However, if you want to read more, I recommend the short overview in my The Big Architecture Comparison article for more context. The other noteworthy highlight is the use of MLA. MLA, which is used in DeepSeek V2, V3, and R1 , offers a memory-saving strategy that pairs particularly well with KV caching. The idea in MLA is that it compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the figure below. This adds an extra matrix multiplication but reduces memory usage. (As a side note, the queries are also compressed, but only during training, not inference.) Figure 6: Multi-Head Latent Attention (MLA) in DeepSeek V3/R1. (The compressed space of the query vector is not shown for simplicity.) The figure above illustrates the main idea behind MLA, where the keys and values are first projected into a latent vector, which can then be stored in the KV cache to reduce memory requirements. This requires a later up-projection back into the original key-value space, but overall it improves efficiency (as an analogy, you can think of the down- and up-projections in LoRA). Note that the query is also projected into a separate compressed space, similar to what’s shown for the keys and values. However, I omitted it in the figure above for simplicity. By the way, as mentioned earlier, MLA is not new in DeepSeek V3, as its DeepSeek V2 predecessor also used (and even introduced) it. DeepSeek R1 uses the same architecture as DeepSeek V3 above. The difference is the training recipe. I.e., using DeepSeek V3 as the base model, DeepSeek R1 was focused on the Reinforcement Learning with Verifiable Rewards (RLVR) method to improve the reasoning capabilities of the model. The core idea in RLVR is to have the model learn from responses that can be verified symbolically or programmatically, such as math and code (but this can, of course, also be extended beyond these two domains). Figure 7: An example of a verifiable task. The GRPO algorithm, which is short for Group Relative Policy Optimization, is essentially a simpler variant of the Proximal Policy Optimization (PPO) algorithm that is popular in Reinforcement Learning with Human Feedback (RLHF), which is used for LLM alignment. Figure 8: Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by removing the reward model, relying instead on verifiable rewards from symbolic tools such as calculators or compilers. I covered the RLVR training with their GRPO algorithm in more detail (including the math behind it) in my The State of Reinforcement Learning for LLM Reasoning if you are interested in additional information. As the DeepSeek team stated themselves, DeepSeek R1-0528 is basically a “minor version upgrade.” The architecture remains the same as in DeepSeek V3/R1, and the improvements are on the training side to bring it up to par with OpenAI o3 and Gemini 2.5 Pro at the time. Unfortunately, the DeepSeek team didn’t release any specific information describing how this was achieved; however, they stated that it partly comes from optimizations in their post-training pipeline. Also, based on what’s been shared, I think it’s likely that the hosted version of the model uses more computational resources at inference time (longer reasoning). DeepSeek V3.1 is a hybrid model with both general chat (instruct) and reasoning capabilities. I.e., instead of developing two separate models, there is now one model in which users can switch modes via the chat prompt template (similar to the initial Qwen3 model). DeepSeek V3.1 is based on DeepSeek V3.1-Base, which is in turn based on DeepSeek V3. They all share the same architecture. DeepSeek V3.2-Exp (Sep 2025) is where it gets more interesting. Originally, the DeepSeek V3.2-Exp didn’t top the benchmarks, which is why there wasn’t as much excitement around this model upon release. However, as I speculated back in September, this was likely an early, experimental release to get the infrastructure (especially the inference and deployment tools) ready for a larger release, since there are a few architectural changes in DeepSeek V3.2-Exp. The bigger release is DeepSeek V3.2 (not V4), but more on that later. So, what’s new in DeepSeek V3.2-Exp? First, DeepSeek V3.2-Exp was trained based on DeepSeek V3.1-Terminus as a base model. What’s DeepSeek V3.1-Terminus? It’s just a small improvement over the DeepSeek V3.1 checkpoint mentioned in the previous section. The technical report states that: DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2-Exp achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. As the paragraph above states, the main innovation here is the DeepSeek Sparse Attention (DSA) mechanism that they add to DeepSeek V3.1-Terminus before doing further training on that checkpoint. This DSA consists of (1) a lightning indexer and (2) a token-selector, and the goal is to selectively reduce the context to improve efficiency. To explain how it works, let’s start with sliding-window attention. For instance, sliding window attention is a technique (recently used by Gemma 3 and Olmo 3) that limits the attention window to a fixed size, as illustrated in the figure below. Figure 9: In sliding window attention, the current query token doesn’t attend to all previous tokens but just a subset. DSA is based on the same idea as sliding-window attention: only a subset of past tokens can be attended to. However, instead of selecting the tokens that can be attended via a fixed-width sliding window, DSA has an indexer and token selector to decide which past tokens can be attended. In other words, the tokens that can be attended are more random, as illustrated in the figure below. Figure 10: In DSA, the current token can attend a select number of tokens in the past (instead of all tokens like in regular causal attention). However, while I said “random” above, the pattern of which past tokens are selected is not actually random but learned. In practice, DSA uses its so-called lightning indexer to compute relevance scores for each new query token based on all previous tokens. For this computation, the lightning indexer uses the compressed token representations in DeepSeek’s Multi-Head Latent Attention (MLA) and computes the token similarity towards other tokens. The similarity score is basically a scaled dot product between query and key vectors passed through a ReLU function. If you are interested in the mathematical details, the equation (taken from the paper) for this lightning indexer similarity score is shown below: Here, w is a learned per-head weighting coefficient that determines how much each indexer head should contribute to the final similarity score. The q refers to the query, and the k refers to the key vector. And below is a list of the different subscripts: t : position of the current query token; s : position of a previous token in the sequence (0 ≤ s < t); j : the index over the different indexer heads (Figure 10 above only showed one head for simplicity), so q t, j means “query vector for current token t in indexer head j “. You may notice that the indexer is only over the queries, not the keys. That’s because the model only needs to decide which past tokens each new query should consider. The keys are already compressed and stored in the KV cache, so the indexer does not need to score or compress them again over the different heads. The ReLU function here, since it’s , zeroes negative dot-product positions, which could theoretically enable sparsity, but since there is a summation over the different heads, it’s unlikely that the indexer score is actually 0. The sparsity rather comes from the separate token selector. The separate token selector keeps only a small number of high-scoring tokens (for example, the top- k positions) and constructs a sparse attention mask that masks out the other tokens that are not contained in the selected subset. (The k in top- k , not to be confused with the k that is used for the keys in the equation above, is a hyperparameter that is set to 2048 in the model code that the DeepSeek team shared.) The figure below illustrates the whole process in a flowchart. Figure 11: A visual summary of DeepSeek V3.2’s Sparse Attention mechanism. To sum it up, the indexer and token selector result in each token attending to a few past tokens that the model has learned to consider most relevant, rather than all tokens or a fixed local window. The goal here was not to improve the performance over DeepSeek V3.1-Terminus but to reduce the performance degradation (due to the sparse attention mechanism) while benefiting from improved efficiency. Overall, the DSA reduces the computational complexity of the attention mechanism from quadratic O(𝐿 2 ), where L is the sequence length, to a linear O(𝐿𝑘), where 𝑘 (≪𝐿) is the number of selected tokens. Having discussed DeepSeek V3.2-Exp, we are getting closer to the main topic of this article: DeepSeek V3.2. However, there is one more puzzle piece to discuss first. On November 27, 2025 (Thanksgiving in the US), and just 4 days before the DeepSeek V3.2 release, the DeepSeek team released DeepSeekMath V2 , based on DeepSeek V3.2-Exp-Base. This model was specifically developed for math and achieved gold-level scores in several math competitions. Essentially, we can think of it as a proof (of concept) model for DeepSeek V3.2, introducing one more technique. The key aspect here is that reasoning models (like DeepSeek R1 and others) are trained with an external verifier, and the model learns, by itself, to write explanations before arriving at the final answer. However, the explanations may be incorrect. As the DeepSeek team succinctly states, the shortcomings of regular RLVR: [...] correct answers don’t guarantee correct reasoning. [...] a model can arrive at the correct answer through flawed logic or fortunate errors. The other limitation of the DeepSeek R1 RLVR approach they aim to address is that: [...] many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. So, to improve upon these two shortcomings mentioned above, in this paper, they train two models: An LLM-based verifier for theorem proving. The main model, a proof-generator, uses the LLM-based verifier as a reward model (instead of a symbolic verifier). In addition to this self-verification via an LLM as described above, they also use self-refinement (covered in the upcoming Chapter 5 of my Build a Reasoning Model (From Scratch) book) to have the LLM iteratively improve its own answers. Having an LLM score for the intermediate steps is not new. There is a whole line of research on so-called process reward models, which have focused on this. Examples include Solving Math Word Problems With Process- and Outcome-based Feedback (2022) or Let’s Verify Step by Step (2023) , but there are many more. The challenges with process reward models are that it’s not easy to check whether intermediate rewards are correct, and it can also lead to reward hacking. In the DeepSeek R1 paper in Jan 2025, they didn’t use process reward models as they found that: its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments. In this paper, they successfully revisit this in the form of self-verification. The motivation is that, even if no reference solution exists, humans can self-correct when reading proofs and identifying issues. So, in order to develop a better model for writing mathematical proofs (LLM 1 in the figure below), they developed a proof verifier (LLM 2) in the figure below, which can be used as an LLM-as-a-judge to score the prover (LLM 1) outputs. Figure 12: The general math proof generator (LLM 1) and verifier (LLM 2) setup. The verifier LLM (LLM 2) takes in a rubric to score the generated proof, where the score is “1 for complete and rigorous proofs with all logical steps clearly justified;” “0.5 for proofs with sound overall logic but minor errors or omitted details;” “and 0 for fundamentally flawed proofs containing fatal logical errors or critical gaps.” For the proof verifier model, they start with DeepSeek V3.2-Exp-SFT, a model they created based on DeepSeek V3.2-Exp by supervised fine-tuning on reasoning data (both math and code). They then further train the model with reinforcement learning using a format reward (a check whether the solution is in the expected format) and a score reward based on how close the predicted score is to the actual score (annotated by human math experts). The goal of the proof verifier (LLM 2) is to check the generated proofs (LLM 1), but who checks the proof verifier? To make the proof verifier more robust and prevent it from hallucinating issues, they developed a third LLM, a meta-verifier. Figure 13: The meta-verifier (LLM 3) checks whether the verifier (LLM 2) is verifying the generator (LLM 1) correctly. The meta-verifier (LLM 3) is also developed with reinforcement learning, similar to LLM 2. While the use of a meta-verifier is not required, the DeepSeek team reported that: the average quality score of the verifier’s proof analyses – as evaluated by the meta-verifier – improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction. This is actually quite an interesting setup. If you are familiar with generative adversarial networks (GANs), you may see the analogy here. For instance, the proof verifier (think of it as a GAN discriminator) improves the proof generator, and the proof generator generates better proofs, further pushing the proof verifier. The meta score is used during training of the verifier (LLM 2) and the generator (LLM 1). It is not used at inference time in the self‑refinement loop, which we will discuss in the next section. In the previous section, we talked about self-verification, i.e., analyzing the quality of the solution. The purpose of this is to implement self-refinement, which means that the LLM can act upon the feedback and revise its answer. Traditionally, in self-refinement, which is an established and popular inference-scaling technique, we would use the same LLM for generating the solution and verifying it, before refining it. In other words, in the previous figures 12 and 13, LLM 1 and LLM 2 would be the same LLM. So, a traditional self-refinement process would look as follows: Figure 14: A classic self-refinement iteration where we use the same LLM for generating the initial response (Output 1), the evaluation (Eval), and the refined answer (Output 2). However, the DeepSeek team observed a crucial issue with using the same LLM for both the generation and verification in practice: when prompted to both generate and analyze its own proof in one shot, the generator tends to claim correctness even when the external verifier easily identify flaws. In other words, while the generator can refine proofs based on external feedback, it fails to evaluate its own work with the same rigor as the dedicated verifier. As a logical consequence, one would assume they use a separate proof generator (LLM 1) and proof verifier (LLM 2). So, the self-refinement loop used here becomes similar to the one shown in the figure below. Note that we omit LLM 3, which is only used during the development of the verifier (LLM 2). Figure 15: Self-refinement with a separate verifier LLM (LLM 2). However, in practice, and different from Figure 15, the DeepSeek team uses the same generator and verifier LLM as in a classic self-refinement loop in Figure 14: “All experiments used a single model, our final proof generator, which performs both proof generation and verification.” In other words the separate verifier is essential for training, to improve the generator, but it is not used (/needed) later during inference once the generator is strong enough. And the key difference from naive single‑model self‑refinement is that the final prover has been trained under the guidance of a stronger verifier and meta‑verifier, so it has learned to apply those rubrics to its own outputs. Also, using this 2-in-1 DeepSeekMath V2 verifier during inference is also beneficial in terms of resource and cost, as it add less complexity and compute requirements than running a second LLM for proof verification. Coming back to the general self-refinement concept shown in Figures 14 and 15, both figures show self-refinement with 2 iterations (the initial one and a refined answer). Of course, we can add more iterations to this process. It’s a classic inference-scaling trade-off: the more iterations we add, the more expensive it becomes to generate the answer, but the higher the overall accuracy. In the paper, the DeepSeek team used up to 8 iterations, and it looks like the accuracy didn’t saturate yet. Figure 16: Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath V2 paper . The Best@32 accuracy majority voting method is also known as “self-consistency” and covered in Chapter 4 of my Build a Reasoning Model (From Scratch) book . The reason why we spent so much time on DeepSeekMath V2 in the previous section is that a) it’s a very interesting proof of concept that pushes the idea of Reinforcement Learning with Verifiable Rewards (RLVR) further with self-verification and self-refinement techniques, and b) the self-verification and self-refinement techniques are used in DeepSeek V3.2 as well. But before we get to this part, let’s start with a general overview of DeepSeek V3.2. This model is a big deal because it performs really well compared to current flagship models. Figure 17: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . Similar to several other DeepSeek models, V3.2 comes with a nice technical report , which I will discuss in the next sections. The main motivation for this model is, of course, to improve overall model performance. For instance, like DeepSeekMath V2, it achieves gold-level performance on math benchmarks. However, the model is also trained with tool-use in mind and also performs well on other tasks, for instance, code and agentic tasks. At the same time, the DeepSeek team writes about computational efficiency as a big, motivating factor. That’s why they use the Multi-Head Latent Attention (MLA) mechanism from V2 and V3 together with the DeepSeek Sparse Attention (DSA) mechanism, which they added in V3.2. In fact, the paper says that “DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp,” which we discussed in an earlier section. Figure 18: The DeepSeek V3.2 architecture. As I mentioned earlier the DeepSeek V3.2-Exp release was likely intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. Figure 19: Inference cost savings thanks to DeepSeek Sparse Attention (DSA). Annotated figure from the DeepSeek V3.2 report . Interestingly, as the screenshot from the paper above shows, the DeepSeek team reverted to using NVIDIA chips (after they allegedly experimented with model training on chips from Huawei). Since the architecture is the same as that of DeepSeek V3.2-Exp, the interesting details lie in the training methods, which we will discuss in the next sections. Overall, the DeepSeek team adopts the Reinforcement Learning with Verifiable Rewards (RLVR) procedure using the Group Relative Policy Optimization (GRPO) algorithm similar to DeepSeek R1. However, there are some interesting updates to discuss. Originally, DeepSeek R1 used a format reward (to make sure the answer is properly formatted); a language consistency reward (so that the model doesn’t alternate between different languages when writing its response); and the main verifier reward (whether the answer, in a math or code problem, is correct or not) For DeepSeek V3.2, they changed the rewards: For reasoning and agent tasks, we employ rule-based outcome reward, length penalty, and language consistency reward. For general tasks, we employ a generative reward model where each prompt has its own rubrics for evaluation. For instance, they removed the format reward but added a length penalty for agentic tasks. Then, for general tasks where there is no symbolic verifier (math) or code interpreter to verify the answer, they use a reward model (another LLM trained to output a reward score). So, it sounds like the pipeline is no longer purely verifier‑based RLVR like in DeepSeek R1, but a hybrid of RLVR (for verifiable domains) and more standard LLM‑as‑a‑judge reward modeling for everything else. For the math domain, they state that they additionally “incorporated the dataset and reward method from DeepSeekMath-V2,” which we discussed earlier in this article. Regarding GRPO itself, the learning algorithm inside the RLVR pipeline, they made a few changes since the original version in the DeepSeek R1 paper, too. Over the last few months, dozens of papers have proposed modifications to GRPO to improve its stability and efficiency. I wrote about two popular ones, DAPO and Dr. GRPO, earlier this year in my The State of Reinforcement Learning for LLM Reasoning article . Without getting into the mathematical details of GRPO, in short, DAPO modifies GRPO with asymmetric clipping, dynamic sampling, token-level loss, and explicit length-based reward shaping. Dr. GRPO changes the GRPO objective itself to remove the length and std normalizations. The recent Olmo 3 paper also adopted similar changes, which I am quoting below: Zero Gradient Signal Filtering: We remove groups of instances whose rewards are all identical (that is, a batch with zero standard deviation in their advantage) to avoid training on samples that provide zero gradient, similar to DAPO (Yu et al., 2025). [DAPO] Active Sampling: We maintain a consistent batch size in spite of zero gradient filtering with a novel, more efficient version of dynamic sampling (Yu et al., 2025). See OlmoRL Infra for details. [DAPO] Token-level loss: We use a token-level loss to normalize the loss by the total number of tokens across the batch (Yu et al., 2025), rather than per-sample to avoid a length bias. [DAPO] No KL Loss: We remove the KL loss as a common practice (GLM-4.5 Team et al., 2025; Yu et al., 2025; Liu et al., 2025b) as it allows less restricted policy updates, and removing it does not lead to over-optimization or destabilized training. [DAPO and Dr. GRPO] Clip Higher: We set the upper-bound clipping term in the loss to a slightly higher value than the lower bound to enable larger updates on tokens, as proposed by Yu et al. (2025). [DAPO] Truncated Importance Sampling: To adjust for differences between log probabilities from the inference and training engines, we multiply the loss by the truncated importance sampling ratio, following Yao et al. (2025). No standard deviation normalization: When calculating advantage, we do not normalize by the standard deviation of the group, following Liu et al. (2025b). This removes a difficulty bias, where questions with low standard deviation in their rewards (for example, too hard or too easy) have their advantages significantly increased by the normalization term. [Dr. GRPO] The GRPO modifications in DeepSeek V3.2 are a bit less aggressive, which I summarized in a similar style as Olmo 3 did: Domain‑specific KL strengths (including zero for math): Instead of always dropping KL like DAPO and Dr. GRPO do for math‑style RL, DeepSeek V3.2 keeps a KL term in the objective but tunes its weight per domain. However, they also note that very weak or even zero KL often works best for mathematics. (But instead of removing it completely, it becomes a hyperparameter.) Unbiased KL estimate: As mentioned above, DeepSeek V3.2 doesn’t remove the KL penalty. And in addition to treating it as a tuning knob, they propose a fix to how the KL penalty is estimated in GRPO by reweighting the KL term with the same importance ratio used for the main loss, so the KL gradient actually matches the fact that samples come from the old policy rather than the current one. Off‑policy sequence masking: When they reuse rollout data (rollout is simply jargon for the full sequence the model generates) across many gradient steps, DeepSeek V3.2 measures how far the current policy has drifted from the rollout policy on each full answer and simply drops those sequences that both have negative advantage and are “too off‑policy”. So, this prevents the model from learning from overly off‑policy or stale data. Keep routing for MoE models: For the Mixture‑of‑Experts backbone, they log which experts were activated during rollout and force the same routing pattern during training, so gradient updates are for those experts that produced the sampled answers. Keep sampling mask for top‑p / top‑k: When rollouts use top‑p or top‑k sampling, DeepSeek V3.2 stores the selection mask and reapplies it when computing the GRPO loss and KL, so the action space at training time matches what was actually available during sampling. Keep original GRPO advantage normalization: Dr. GRPO shows that GRPO’s length and per‑group standard‑deviation normalization terms bias optimization toward overly long incorrect answers and over‑weight very easy or very hard questions. Dr. GRPO fixes this by removing both terms and going back to an unbiased PPO‑style objective. In contrast, DAPO moves to a token‑level loss that also changes how long vs short answers are weighted. DeepSeek V3.2, however, keeps the original GRPO normalization and instead focuses on other fixes, such as those above. So, overall, DeepSeek V3.2 is closer to the original GRPO algorithms than some other recent models but adds some logical tweaks. DeepSeek V3.2 also comes in an extreme, extended-thinking variant called DeepSeek V3.2-Speciale, which was trained only on reasoning data during the RL stage (more akin to DeepSeek R1). Besides training only on reasoning data, they also reduced the length penalty during RL, allowing the model to output longer responses. Generating longer responses is a form of inference scaling, where responses become more expensive due to the increased length, in return for better results. Figure 20: The “extended-thinking” Speciale model achieves higher accuracy but also generates more tokens. In this article, I didn’t cover all the nitty-gritty details of the DeepSeek V3.2 training approach, but I hope the comparison with previous DeepSeek models helps clarify the main points and innovations. In short, the interesting takeaways are: DeepSeek V3.2 uses a similar architecture to all its predecessors since DeepSeek V3; The main architecture tweak is that they added the sparse attention mechanism from DeepSeek V3.2-Exp to improve efficiency; To improve math performance, they adopted the self-verification approach from DeepSeekMath V2; There are several improvements to the training pipeline, for example, GRPO stability updates (note the paper goes into several other aspects around distillation, long-context training, integration of tool-use similar to gpt-oss, which we did not cover in this article). Irrespective of the relative market share of DeepSeek models compared to other smaller open-weight models or proprietary models like GPT-5.1 or Gemini 3.0 Pro, one thing is for sure: DeepSeek releases are always interesting, and there’s always a lot to learn from the technical reports that come with the open-weight model checkpoints. I hope you found this overview useful! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . I covered the predecessor, DeepSeek V3, at the very beginning of my The Big LLM Architecture Comparison article, which I kept extending over the months as new architectures got released. Originally, as I just got back from Thanksgiving holidays with my family, I planned to “just” extend the article with this new DeepSeek V3.2 release by adding another section, but I then realized that there’s just too much interesting information to cover, so I decided to make this a longer, standalone article. There’s a lot of interesting ground to cover and a lot to learn from their technical reports, so let’s get started! 1. The DeepSeek Release Timeline While DeepSeek V3 wasn’t popular immediately upon release in December 2024, the DeepSeek R1 reasoning model (based on the identical architecture, using DeepSeek V3 as a base model) helped DeepSeek become one of the most popular open-weight models and a legit alternative to proprietary models such as the ones by OpenAI, Google, xAI, and Anthropic. Figure 2: DeepSeek V3/R1 architecture from December 2024. We will revisit and discuss architectural details in a later section. So, what’s new since V3/R1? I am sure that the DeepSeek team has been super busy this year. However, there hasn’t been a major release in the last 10-11 months since DeepSeek R1. Personally, I think it’s reasonable to go ~1 year for a major LLM release since it’s A LOT of work. However, I saw on various social media platforms that people were pronouncing the team “dead” (as a one-hit wonder). I am sure the DeepSeek team has also been busy navigating the switch from NVIDIA to Huawei chips. By the way, I am not affiliated with them or have spoken with them; everything here is based on public information . As far as I know, they are back to using NVIDIA chips. Finally, it’s also not that they haven’t released anything. There have been a couple of smaller releases that trickled in this year, for instance, DeepSeek V3.1 and V3.2-Exp. Figure 3: DeepSeek releases since last year. The main models are shown in red. As I predicted back in September, the DeepSeek V3.2-Exp release was intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. V3.2-Exp and V3.2 use a non-standard sparse attention variant that requires custom code, but more on this mechanism later. (I was tempted to cover it in my previous Beyond Standard LLMs article, but Kimi Linear was released around then, which I prioritized for this article section on new attention variants.) 2. Hybrid Versus Dedicated Reasoning Models Before discussing further model details, it might be worthwhile to discuss the overall model types. Originally, DeepSeek V3 was released as a base model, and DeepSeek R1 added additional post-training to develop a dedicated reasoning model. This procedure is summarized in the figure below. Figure 4: Overview of the DeepSeek R1 training pipeline. This figure is from my more detailed Understanding Reasoning LLMs article. You can read more about the training pipeline in the figure above in my Understanding Reasoning LLMs article. What’s worthwhile noting here is that DeepSeek V3 is a base model, and DeepSeek R1 is a dedicated reasoning model. In parallel with DeepSeek, other teams have also released many really strong open-weight reasoning models. One of the strongest open-weight models this year was Qwen3. Originally, it was released as a hybrid reasoning model, which means that users were able to toggle between reasoning and non-reasoning modes within the same model. (In the case of Qwen3, this toggling was enabled via the tokenizer by adding/omitting tags.) Since then, LLM teams have released (and in some cases gone back and forth between) both dedicated reasoning models and Instruct/Reasoning hybrid models, as shown in the timeline below. Figure 5: The timeline of some of the reasoning and hybrid models released this year. For instance, Qwen3 started out as a hybrid model, but the Qwen team then later released separate instruct and reasoning models as they were easier to develop and yielded better performance in each respective use case. Some models like OpenAI’s gpt-oss only come in a hybrid variant where users can choose the reasoning effort via a system prompt (I suspect this is handled similarly in GPT-5 and GPT-5.1). And in the case of DeepSeek, it looks like they moved in the opposite direction from a dedicated reasoning model (R1) to a hybrid model (V3.1 and V3.2). However, I suspect that R1 was mainly a research project to develop reasoning methods and the best reasoning model at the time. The V3.2 release may be more about developing the best overall model for different use cases. (Here, R1 was more like a testbed or prototype model.) And I also suspect that, while the DeepSeek team developed V3.1 and V3.2 with reasoning capabilities, they might still be working on a dedicated R2 model. 3. From DeepSeek V3 to V3.1 Before discussing the new DeepSeek V3.2 release in more detail, I thought it would be helpful to start with an overview of the main changes going from V3 to V3.1. 3.1 DeepSeek V3 Overview and Multi-Head Latent Attention (MLA) I already discussed DeepSeek V3 and R1 in great detail in several other articles. To summarize the main points, DeepSeek V3 is a base model that uses two noteworthy architecture aspects: Mixture-of-Experts (MoE) and Multi-Head Latent Attention (MLA). I think you are probably well familiar with MoE at this point, so I am skipping the introduction here. However, if you want to read more, I recommend the short overview in my The Big Architecture Comparison article for more context. The other noteworthy highlight is the use of MLA. MLA, which is used in DeepSeek V2, V3, and R1 , offers a memory-saving strategy that pairs particularly well with KV caching. The idea in MLA is that it compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the figure below. This adds an extra matrix multiplication but reduces memory usage. (As a side note, the queries are also compressed, but only during training, not inference.) Figure 6: Multi-Head Latent Attention (MLA) in DeepSeek V3/R1. (The compressed space of the query vector is not shown for simplicity.) The figure above illustrates the main idea behind MLA, where the keys and values are first projected into a latent vector, which can then be stored in the KV cache to reduce memory requirements. This requires a later up-projection back into the original key-value space, but overall it improves efficiency (as an analogy, you can think of the down- and up-projections in LoRA). Note that the query is also projected into a separate compressed space, similar to what’s shown for the keys and values. However, I omitted it in the figure above for simplicity. By the way, as mentioned earlier, MLA is not new in DeepSeek V3, as its DeepSeek V2 predecessor also used (and even introduced) it. 3.2 DeepSeek R1 Overview and Reinforcement Learning with Verifiable Rewards (RLVR) DeepSeek R1 uses the same architecture as DeepSeek V3 above. The difference is the training recipe. I.e., using DeepSeek V3 as the base model, DeepSeek R1 was focused on the Reinforcement Learning with Verifiable Rewards (RLVR) method to improve the reasoning capabilities of the model. The core idea in RLVR is to have the model learn from responses that can be verified symbolically or programmatically, such as math and code (but this can, of course, also be extended beyond these two domains). Figure 7: An example of a verifiable task. The GRPO algorithm, which is short for Group Relative Policy Optimization, is essentially a simpler variant of the Proximal Policy Optimization (PPO) algorithm that is popular in Reinforcement Learning with Human Feedback (RLHF), which is used for LLM alignment. Figure 8: Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by removing the reward model, relying instead on verifiable rewards from symbolic tools such as calculators or compilers. I covered the RLVR training with their GRPO algorithm in more detail (including the math behind it) in my The State of Reinforcement Learning for LLM Reasoning if you are interested in additional information. 3.3 DeepSeek R1-0528 Version Upgrade As the DeepSeek team stated themselves, DeepSeek R1-0528 is basically a “minor version upgrade.” The architecture remains the same as in DeepSeek V3/R1, and the improvements are on the training side to bring it up to par with OpenAI o3 and Gemini 2.5 Pro at the time. Unfortunately, the DeepSeek team didn’t release any specific information describing how this was achieved; however, they stated that it partly comes from optimizations in their post-training pipeline. Also, based on what’s been shared, I think it’s likely that the hosted version of the model uses more computational resources at inference time (longer reasoning). 3.4 DeepSeek V3.1 Hybrid Reasoning DeepSeek V3.1 is a hybrid model with both general chat (instruct) and reasoning capabilities. I.e., instead of developing two separate models, there is now one model in which users can switch modes via the chat prompt template (similar to the initial Qwen3 model). DeepSeek V3.1 is based on DeepSeek V3.1-Base, which is in turn based on DeepSeek V3. They all share the same architecture. 4. DeepSeek V3.2-Exp and Sparse Attention DeepSeek V3.2-Exp (Sep 2025) is where it gets more interesting. Originally, the DeepSeek V3.2-Exp didn’t top the benchmarks, which is why there wasn’t as much excitement around this model upon release. However, as I speculated back in September, this was likely an early, experimental release to get the infrastructure (especially the inference and deployment tools) ready for a larger release, since there are a few architectural changes in DeepSeek V3.2-Exp. The bigger release is DeepSeek V3.2 (not V4), but more on that later. So, what’s new in DeepSeek V3.2-Exp? First, DeepSeek V3.2-Exp was trained based on DeepSeek V3.1-Terminus as a base model. What’s DeepSeek V3.1-Terminus? It’s just a small improvement over the DeepSeek V3.1 checkpoint mentioned in the previous section. The technical report states that: DeepSeek-V3.2-Exp, an experimental sparse-attention model, which equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2-Exp achieves significant efficiency improvements in both training and inference, especially in long-context scenarios. As the paragraph above states, the main innovation here is the DeepSeek Sparse Attention (DSA) mechanism that they add to DeepSeek V3.1-Terminus before doing further training on that checkpoint. This DSA consists of (1) a lightning indexer and (2) a token-selector, and the goal is to selectively reduce the context to improve efficiency. To explain how it works, let’s start with sliding-window attention. For instance, sliding window attention is a technique (recently used by Gemma 3 and Olmo 3) that limits the attention window to a fixed size, as illustrated in the figure below. Figure 9: In sliding window attention, the current query token doesn’t attend to all previous tokens but just a subset. DSA is based on the same idea as sliding-window attention: only a subset of past tokens can be attended to. However, instead of selecting the tokens that can be attended via a fixed-width sliding window, DSA has an indexer and token selector to decide which past tokens can be attended. In other words, the tokens that can be attended are more random, as illustrated in the figure below. Figure 10: In DSA, the current token can attend a select number of tokens in the past (instead of all tokens like in regular causal attention). However, while I said “random” above, the pattern of which past tokens are selected is not actually random but learned. In practice, DSA uses its so-called lightning indexer to compute relevance scores for each new query token based on all previous tokens. For this computation, the lightning indexer uses the compressed token representations in DeepSeek’s Multi-Head Latent Attention (MLA) and computes the token similarity towards other tokens. The similarity score is basically a scaled dot product between query and key vectors passed through a ReLU function. If you are interested in the mathematical details, the equation (taken from the paper) for this lightning indexer similarity score is shown below: Here, w is a learned per-head weighting coefficient that determines how much each indexer head should contribute to the final similarity score. The q refers to the query, and the k refers to the key vector. And below is a list of the different subscripts: t : position of the current query token; s : position of a previous token in the sequence (0 ≤ s < t); j : the index over the different indexer heads (Figure 10 above only showed one head for simplicity), so q t, j means “query vector for current token t in indexer head j “. Figure 11: A visual summary of DeepSeek V3.2’s Sparse Attention mechanism. To sum it up, the indexer and token selector result in each token attending to a few past tokens that the model has learned to consider most relevant, rather than all tokens or a fixed local window. The goal here was not to improve the performance over DeepSeek V3.1-Terminus but to reduce the performance degradation (due to the sparse attention mechanism) while benefiting from improved efficiency. Overall, the DSA reduces the computational complexity of the attention mechanism from quadratic O(𝐿 2 ), where L is the sequence length, to a linear O(𝐿𝑘), where 𝑘 (≪𝐿) is the number of selected tokens. 5. DeepSeekMath V2 with Self-Verification and Self-Refinement Having discussed DeepSeek V3.2-Exp, we are getting closer to the main topic of this article: DeepSeek V3.2. However, there is one more puzzle piece to discuss first. On November 27, 2025 (Thanksgiving in the US), and just 4 days before the DeepSeek V3.2 release, the DeepSeek team released DeepSeekMath V2 , based on DeepSeek V3.2-Exp-Base. This model was specifically developed for math and achieved gold-level scores in several math competitions. Essentially, we can think of it as a proof (of concept) model for DeepSeek V3.2, introducing one more technique. The key aspect here is that reasoning models (like DeepSeek R1 and others) are trained with an external verifier, and the model learns, by itself, to write explanations before arriving at the final answer. However, the explanations may be incorrect. As the DeepSeek team succinctly states, the shortcomings of regular RLVR: [...] correct answers don’t guarantee correct reasoning. [...] a model can arrive at the correct answer through flawed logic or fortunate errors. The other limitation of the DeepSeek R1 RLVR approach they aim to address is that: [...] many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. So, to improve upon these two shortcomings mentioned above, in this paper, they train two models: An LLM-based verifier for theorem proving. The main model, a proof-generator, uses the LLM-based verifier as a reward model (instead of a symbolic verifier). Figure 12: The general math proof generator (LLM 1) and verifier (LLM 2) setup. The verifier LLM (LLM 2) takes in a rubric to score the generated proof, where the score is “1 for complete and rigorous proofs with all logical steps clearly justified;” “0.5 for proofs with sound overall logic but minor errors or omitted details;” “and 0 for fundamentally flawed proofs containing fatal logical errors or critical gaps.” Figure 13: The meta-verifier (LLM 3) checks whether the verifier (LLM 2) is verifying the generator (LLM 1) correctly. The meta-verifier (LLM 3) is also developed with reinforcement learning, similar to LLM 2. While the use of a meta-verifier is not required, the DeepSeek team reported that: the average quality score of the verifier’s proof analyses – as evaluated by the meta-verifier – improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction. This is actually quite an interesting setup. If you are familiar with generative adversarial networks (GANs), you may see the analogy here. For instance, the proof verifier (think of it as a GAN discriminator) improves the proof generator, and the proof generator generates better proofs, further pushing the proof verifier. The meta score is used during training of the verifier (LLM 2) and the generator (LLM 1). It is not used at inference time in the self‑refinement loop, which we will discuss in the next section. 5.2 Self-Refinement In the previous section, we talked about self-verification, i.e., analyzing the quality of the solution. The purpose of this is to implement self-refinement, which means that the LLM can act upon the feedback and revise its answer. Traditionally, in self-refinement, which is an established and popular inference-scaling technique, we would use the same LLM for generating the solution and verifying it, before refining it. In other words, in the previous figures 12 and 13, LLM 1 and LLM 2 would be the same LLM. So, a traditional self-refinement process would look as follows: Figure 14: A classic self-refinement iteration where we use the same LLM for generating the initial response (Output 1), the evaluation (Eval), and the refined answer (Output 2). However, the DeepSeek team observed a crucial issue with using the same LLM for both the generation and verification in practice: when prompted to both generate and analyze its own proof in one shot, the generator tends to claim correctness even when the external verifier easily identify flaws. In other words, while the generator can refine proofs based on external feedback, it fails to evaluate its own work with the same rigor as the dedicated verifier. As a logical consequence, one would assume they use a separate proof generator (LLM 1) and proof verifier (LLM 2). So, the self-refinement loop used here becomes similar to the one shown in the figure below. Note that we omit LLM 3, which is only used during the development of the verifier (LLM 2). Figure 15: Self-refinement with a separate verifier LLM (LLM 2). However, in practice, and different from Figure 15, the DeepSeek team uses the same generator and verifier LLM as in a classic self-refinement loop in Figure 14: “All experiments used a single model, our final proof generator, which performs both proof generation and verification.” In other words the separate verifier is essential for training, to improve the generator, but it is not used (/needed) later during inference once the generator is strong enough. And the key difference from naive single‑model self‑refinement is that the final prover has been trained under the guidance of a stronger verifier and meta‑verifier, so it has learned to apply those rubrics to its own outputs. Also, using this 2-in-1 DeepSeekMath V2 verifier during inference is also beneficial in terms of resource and cost, as it add less complexity and compute requirements than running a second LLM for proof verification. Coming back to the general self-refinement concept shown in Figures 14 and 15, both figures show self-refinement with 2 iterations (the initial one and a refined answer). Of course, we can add more iterations to this process. It’s a classic inference-scaling trade-off: the more iterations we add, the more expensive it becomes to generate the answer, but the higher the overall accuracy. In the paper, the DeepSeek team used up to 8 iterations, and it looks like the accuracy didn’t saturate yet. Figure 16: Additional self-refinement iterations improve accuracy. Annotated figure from the DeepSeekMath V2 paper . The Best@32 accuracy majority voting method is also known as “self-consistency” and covered in Chapter 4 of my Build a Reasoning Model (From Scratch) book . 6. DeepSeek V3.2 (Dec 1, 2025) The reason why we spent so much time on DeepSeekMath V2 in the previous section is that a) it’s a very interesting proof of concept that pushes the idea of Reinforcement Learning with Verifiable Rewards (RLVR) further with self-verification and self-refinement techniques, and b) the self-verification and self-refinement techniques are used in DeepSeek V3.2 as well. But before we get to this part, let’s start with a general overview of DeepSeek V3.2. This model is a big deal because it performs really well compared to current flagship models. Figure 17: Benchmark comparison between DeepSeek V3.2 and proprietary flagship models. This is an annotated figure from the DeepSeek V3.2 report . Similar to several other DeepSeek models, V3.2 comes with a nice technical report , which I will discuss in the next sections. 6.1 DeepSeek V3.2 Architecture The main motivation for this model is, of course, to improve overall model performance. For instance, like DeepSeekMath V2, it achieves gold-level performance on math benchmarks. However, the model is also trained with tool-use in mind and also performs well on other tasks, for instance, code and agentic tasks. At the same time, the DeepSeek team writes about computational efficiency as a big, motivating factor. That’s why they use the Multi-Head Latent Attention (MLA) mechanism from V2 and V3 together with the DeepSeek Sparse Attention (DSA) mechanism, which they added in V3.2. In fact, the paper says that “DeepSeek-V3.2 uses exactly the same architecture as DeepSeek-V3.2-Exp,” which we discussed in an earlier section. Figure 18: The DeepSeek V3.2 architecture. As I mentioned earlier the DeepSeek V3.2-Exp release was likely intended to get the ecosystem and inference infrastructure ready to host the just-released V3.2 model. Figure 19: Inference cost savings thanks to DeepSeek Sparse Attention (DSA). Annotated figure from the DeepSeek V3.2 report . Interestingly, as the screenshot from the paper above shows, the DeepSeek team reverted to using NVIDIA chips (after they allegedly experimented with model training on chips from Huawei). Since the architecture is the same as that of DeepSeek V3.2-Exp, the interesting details lie in the training methods, which we will discuss in the next sections. 6.2 Reinforcement Learning Updates Overall, the DeepSeek team adopts the Reinforcement Learning with Verifiable Rewards (RLVR) procedure using the Group Relative Policy Optimization (GRPO) algorithm similar to DeepSeek R1. However, there are some interesting updates to discuss. Originally, DeepSeek R1 used a format reward (to make sure the answer is properly formatted); a language consistency reward (so that the model doesn’t alternate between different languages when writing its response); and the main verifier reward (whether the answer, in a math or code problem, is correct or not) Zero Gradient Signal Filtering: We remove groups of instances whose rewards are all identical (that is, a batch with zero standard deviation in their advantage) to avoid training on samples that provide zero gradient, similar to DAPO (Yu et al., 2025). [DAPO] Active Sampling: We maintain a consistent batch size in spite of zero gradient filtering with a novel, more efficient version of dynamic sampling (Yu et al., 2025). See OlmoRL Infra for details. [DAPO] Token-level loss: We use a token-level loss to normalize the loss by the total number of tokens across the batch (Yu et al., 2025), rather than per-sample to avoid a length bias. [DAPO] No KL Loss: We remove the KL loss as a common practice (GLM-4.5 Team et al., 2025; Yu et al., 2025; Liu et al., 2025b) as it allows less restricted policy updates, and removing it does not lead to over-optimization or destabilized training. [DAPO and Dr. GRPO] Clip Higher: We set the upper-bound clipping term in the loss to a slightly higher value than the lower bound to enable larger updates on tokens, as proposed by Yu et al. (2025). [DAPO] Truncated Importance Sampling: To adjust for differences between log probabilities from the inference and training engines, we multiply the loss by the truncated importance sampling ratio, following Yao et al. (2025). No standard deviation normalization: When calculating advantage, we do not normalize by the standard deviation of the group, following Liu et al. (2025b). This removes a difficulty bias, where questions with low standard deviation in their rewards (for example, too hard or too easy) have their advantages significantly increased by the normalization term. [Dr. GRPO] Domain‑specific KL strengths (including zero for math): Instead of always dropping KL like DAPO and Dr. GRPO do for math‑style RL, DeepSeek V3.2 keeps a KL term in the objective but tunes its weight per domain. However, they also note that very weak or even zero KL often works best for mathematics. (But instead of removing it completely, it becomes a hyperparameter.) Unbiased KL estimate: As mentioned above, DeepSeek V3.2 doesn’t remove the KL penalty. And in addition to treating it as a tuning knob, they propose a fix to how the KL penalty is estimated in GRPO by reweighting the KL term with the same importance ratio used for the main loss, so the KL gradient actually matches the fact that samples come from the old policy rather than the current one. Off‑policy sequence masking: When they reuse rollout data (rollout is simply jargon for the full sequence the model generates) across many gradient steps, DeepSeek V3.2 measures how far the current policy has drifted from the rollout policy on each full answer and simply drops those sequences that both have negative advantage and are “too off‑policy”. So, this prevents the model from learning from overly off‑policy or stale data. Keep routing for MoE models: For the Mixture‑of‑Experts backbone, they log which experts were activated during rollout and force the same routing pattern during training, so gradient updates are for those experts that produced the sampled answers. Keep sampling mask for top‑p / top‑k: When rollouts use top‑p or top‑k sampling, DeepSeek V3.2 stores the selection mask and reapplies it when computing the GRPO loss and KL, so the action space at training time matches what was actually available during sampling. Keep original GRPO advantage normalization: Dr. GRPO shows that GRPO’s length and per‑group standard‑deviation normalization terms bias optimization toward overly long incorrect answers and over‑weight very easy or very hard questions. Dr. GRPO fixes this by removing both terms and going back to an unbiased PPO‑style objective. In contrast, DAPO moves to a token‑level loss that also changes how long vs short answers are weighted. DeepSeek V3.2, however, keeps the original GRPO normalization and instead focuses on other fixes, such as those above. Figure 20: The “extended-thinking” Speciale model achieves higher accuracy but also generates more tokens. 7. Conclusion In this article, I didn’t cover all the nitty-gritty details of the DeepSeek V3.2 training approach, but I hope the comparison with previous DeepSeek models helps clarify the main points and innovations. In short, the interesting takeaways are: DeepSeek V3.2 uses a similar architecture to all its predecessors since DeepSeek V3; The main architecture tweak is that they added the sparse attention mechanism from DeepSeek V3.2-Exp to improve efficiency; To improve math performance, they adopted the self-verification approach from DeepSeekMath V2; There are several improvements to the training pipeline, for example, GRPO stability updates (note the paper goes into several other aspects around distillation, long-context training, integration of tool-use similar to gpt-oss, which we did not cover in this article).

0 views
Simon Willison 1 months ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3 , both released within the past week! The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February). The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5. The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details: I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features. It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha . I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors. Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable. The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis? And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models. I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself! I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle. "Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond. In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles . Here's Opus 4.5 (on its default "high" effort level ): It did significantly better on the new more detailed prompt : Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh . From the safety section of Anthropic's announcement post: With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry: On the one hand this looks great, it's a clear improvement over previous models and the competition. What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3! I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Opus 4.5 has a new effort parameter which defaults to high but can be set to medium or low for faster responses. The model supports enhanced computer use , specifically a tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect. " Thinking blocks from previous assistant turns are preserved in model context by default " - apparently previous Anthropic models discarded those.

1 views
Simon Willison 1 months ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views
Taranis 2 months ago

k-Hot Holographic Encoding

It's a fairly common thing in machine learning to need to deal with categorical data. Straightforwardly, if a model had to recognize inputs and classify them into lions, tigers, giraffes and scarecrows, we might assign lion=0, tiger=1, giraffe=2 and scarecrow=3. This is categorical data. A common way to encode this is with 1-hot encoding – this involves having a vector with a 1 representing a category and a 0 representing not-that-category. So lion would be encoded [1, 0, 0, 0], tiger would be encoded [0, 1, 0, 0], giraffe would be encoded [0, 0, 1, 0] and scarecrow [0, 0, 0, 1]. Things get more complicated when something might need to be represented by more categories than one. Let's say we had an animal that was a cross between a tiger and a lion – we might want to represent this as [1, 1, 0, 0]. Obviously this isn't one-hot encoding any more – we now have k -hot encoding, for some value of k. Neural networks can be trained straightforwardly both with one-hot and k -hot encoding – they don't seem to mind. However, things get tricky when you have to deal with a very large number of possible categories. A popular approach to text analysis is bag-of-words – this involves analyzing a large amount of text, extracting its vocabularly, and creating vectors that include a 1 where a word is present in a piece of text, or 0 otherwise. In many languages, English particularly, order suprisingly word meaning effect little has. (Did you see what I did there, Yoda?) Things get a bit tricky, though, because if you're doing text retrieval, it's often the rarer words that are the most useful. It's not uncommon to end up with very large vocabularies, bigger than 100000 words. Let's say we want to classify pieces of text taken from this vocabularly – if we wanted to train against, for example, Wikipedia's article classifications, there are also roughly that many of them. If we reject those that don't have many articles associated with them, we still have over 50000 of them if we insist on at least 1000 articles per category. Even a single dense layer gets painfully big at these dimensions – given 4 byte floats representing the weights, that's 20GB right there, without even considering the space needed to support backpropagation in training. Barely fitting a single layer into an NVIDIA A6000 means we need to do better. A common technique is to throw away less common words from the vocabulary, but it's already known from text retrieval (back from the pre-neural networks are everything days) that this isn't a good idea. Similarly, we don't really want to throw away any categories we don't absolutely have to. What we really want to be able to do is compress the words and categories into much shorter, more managable vectors, but without losing information or accuracy. Sounds impossible, but it isn't. Let's say we have 100000 categories representing a bag-of-words, with a phrase size limit of (say) 200. We'd like to represent this as a k- hot vector that's managebly small, say with a size of 1024. A simple way to attempt this would be to take the category ID and to wrap it around the vector however many times are necessary to fit everything in. So if c is the category ID, and we have a vector of size 1024, we could encode c as c % 1024 (where % is the integer modulo division operator). To a simplistic level, this actually does work, but it has the problem of aliasing (I'm borrowing the term here from digital signal processing). So, for example, if we want to encode the list of categories [1, 2, 95, 110, 250, 1000] and try to figure out from the encoding what categories we started with, we might get something like: [1, 2, 28, 62, 95, 110, 250, 553, 559, ... , 99264, 99312, 99820] This is actual output – the code uses another trick to improve the results rather than simple modulo arithmetic, but we'll get on to that later. The bad news is that this list has 587 elements, and any neural network using this data has absolutely no way to tell real encodings from aliases. The trick turns out to be to represent categories with more than one '1' in the vector. Specifically, each category is encoded by exactly k '1' elements that are randomly positioned within the vector. So if we take that same list and encode it with two '1' elements per category, we get [1, 2, 95, 110, 250, 1000, 3696, 14934, 26553, 51130, 53643, 55596, 55903, 72592, 80631, 86806, 89631, 92838] This is 18 elements, a huge improvement over 587, but we can do better. Setting three '1' values per category gives us [1, 2, 95, 110, 250, 1000, 60663] so just a single alias. Setting four '1's gives us [1, 2, 95, 110, 250, 1000] which is perfect reconstruction. These six categories are being encoded as 6 x 4 = 24 ones in the vector – still quite sparse. It can be instructive to try this experimentally, varying the number of categories being encoded: The graph shows what happens as you encode more and more categories into a fixed-size vector, here 1024 in length and assuming 100000 possible categories. The pink dashed line is what you'd get by encoding into a vector the size of the number of possible categories (generally impractical). The other lines showwhat happen as k varies. There is clearly no free lunch here, but even k = 2 vastly outperforms a naive wraparound modulo encoding. Zooming in on the bottom left is instructive: Up to 40 categories or so, it's clear that k = 6 does almost as well as ideal, which considering the 98:1 compression ratio is quite impressive. Of course, conventional data compression approaches will beat this ratio many times over, but the important factor here is that this encoding is very easy for subsequent neural network layers to extract useful information from. If a neural network needs to uniquely identify a category in order to improve its loss function (which is not actually always necessary), it needs to encode the equivalent of a logical AND function. This is very straightforward for a dense layer with a relu activation function, and will easily fall out of training. Recovering encoded categories I've talked a lot about how categories are encoded, so it's a good idea to also mention how they can be decoded. We've talked about using these encodings to drive the inputs of neural networks – they can also be used to train their outputs. I recommend that if you try this, you use an output layer with a sigmoid activation function (don't use softmax, they only work for 1-hot). You should also probably use binary-crossentropy as an error function. This will result in the outputs of the neural network roughly corresponding to probabilities. To unpack these probabilities, the following algorithm will achieve this: for i in [0 .. num_categories]: m = 1.0 for p in [0 .. k]: m = m * vec[ permutation[p, i] ] cat[i] = m ^ (1/k) where num_categories is the number of categories, vec is the encoded vector, cat is the recovered categories, and permutation[p, i] is an array of permutation matrices that map category numbers onto indices into vec, where each row of the array is a different set of permutations. If we start by assuming the values in vec are probabilities in the range [0 .. 1], we use multiplication like the probabilistic equivalent to logical AND. This passes through 0 and 1 unchanged, but values like 0.5 get squashed toward zero as k increases, so we take the k- th root of the result to undo the effects of the multiplications. This approach makes it possible to use an output vector for a classifier that has a size that is a small fraction of the number of possible categories. This works in neural network training because the elements that each category are mapped to are individually trained by backpropagation, and upwind dense layers can and will implement the necessary logical structures. Kind of seems holographic, doesn't it? It's a similar principle – encoding a large, sparse data set into a short vector by creating something a bit like different paths a photon can pass down, then allowing them to interact and entangle. I'm currently attempting to use this to build a very small, lightweight text classifier that is extremely fast, of the order of a millisecond or so to process a piece of text. I'm training a variant with about 42 million parameters that seems to work fairly well. This is a direct requirement for Euravox , because whilst there are probably way more reliable classifiers out there, we have a requirement to do as much as we can with as little hardware as possible, and as low a power footprint as possible. Whilst the current buzz would have us chuck huge numbers of big GPUs at the problem and use far larger models, even LLMs, this doesn't fit either our needs or ethos. I'm definitely not up for paying a third party for tokens. And in any case, LLMs royally suck at text classification, I know, I've tried. Open source implementation I have a simple Python implementation of the encoder/decoder that I'm happy to release. I've not done so yet, but I'll have some time waiting for training runs to complete in the next few days, so watch this space. I'll edit the web version of this post to include a link, and will probably write a new post announcing a repo. A personal note I've spent most of the last 20 years being unable to publish without jumping through excessively painful hoops. It is an absolute breath of fresh air to just be able to write anything I like now!

0 views
Ahead of AI 2 months ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

1 views
Giles's blog 2 months ago

Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch

I recently posted about Andrej Karpathy's classic 2015 essay, " The Unreasonable Effectiveness of Recurrent Neural Networks ". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about. This post is a bit more hands-on. To understand how these RNNs really work, it's best to write some actual code, so I've implemented a version of Karpathy's original code using PyTorch's built-in class -- here's the repo . I've tried to stay as close as possible to the original, but I believe it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising, given that he wrote it using Torch, the Lua-based predecessor to PyTorch.) In this post, I'll walk through how it works, as of commit . In follow-up posts, I'll dig in further, actually implementing my own RNNs rather than relying on PyTorch's. If you already have a basic understanding of what RNNs are and roughly how they work, you should be fine with this post. However, if you're coming directly from normal "vanilla" neural nets, or even Transformers-based LLMs (like the one I'm working through in my LLM from scratch series), then it's definitely worth reading through the last post , where I give a crash course in the important stuff. So with that said, let's get into the weirdest bit from a "normal" LLM perspective: the dataset. Every now and then on X/Twitter you'll see wry comments from practitioners along the lines of "AI is 5% writing cool models and 95% wrangling data". My limited experience bears this out, and for RNNs it's particularly weird, because the format of the data that you feed in is very different to what you might be used to for LLMs. With a transformers-based LLM, you have a fixed context length -- for the GPT-2 style ones I've posted about in the past, for example, you have a fixed set of position embeddings. More recent position encoding mechanisms exist that aren't quite so constraining, but even then, for a given training run you're going to be thinking in terms of a specific context length -- let's call it n -- that you want to train for. So: you split up your training data into independent chunks, each one n long. Then you designate some subset of those your validation set (and perhaps another bunch your test set), and train on them -- probably in a completely random order. You'll be training with batches of course; each batch would likely be a completely random set of chunks. To get to the core of how different RNNs are, it helps to start with an idealised model of how you might train one. Remember, an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. Let's imagine that you wanted to train an RNN on the complete works of Shakespeare. A super-simple -- if impractical -- way to do that would be to feed it in, character by character. Each time you'd work out your cross-entropy loss . Once you'd run it all through, you'd use those accumulated per-character losses to work out an overall loss (probably just by averaging them). You would run a backward pass using that loss, and use that to adjust the parameters. If you're feeling all at sea with that backpropagation over multiple steps of a single neural network with hidden state, check out the " Training RNNs " section of the last post. You can see that in this model, we don't have any kind of chunked data. The whole thing is just run through as a single sequence. But there are three problems: Let's address those -- firstly, those vanishing or exploding gradients. In the last post I touched on truncated backpropagation through time (TBPTT). The idea is that instead of backpropagating through every step we took while going through our batched input sequences, we run a number of them through, then backpropagate, and then continue. Importantly, we keep the hidden state going through the whole sequence -- but we detach it from the compute graph after each of these steps, which essentially means that we start accumulating gradients afresh, as if it was a new sequence, but because it started from a non-zero initial hidden state, we're still getting some training value from the stuff we've already been through. 2 Imagine we have this simple sequence: Let's say we're doing TBPTT of length 3: we can split up our training set so that it looks like this: So now, we just feed in "a", then "b", then "c", then do our TBTT -- we calculate loss just over those items, update our gradients, and then detach the hidden state, but keep its raw, un-gradient-ed value. Then we start with that stored hidden state, and feed in "d", "e", "f". Rinse and repeat. In practice we'd probably throw away that short sequence at the end (because it would cause issues with gradient updates -- more here ), so we'd just get this: Now, let's look into batching. It's a bit harder, but with a bit of thought it's clear enough. Let's say that you want b items in your batch. You can just split your data into b separate sequences, and then "stack them up", like this with b = 2 : So for training, we'd feed our vector in as a batch, calculate loss on both of them, then , and so on. The important thing is that each batch position -- each row, in that example -- is a consistent, continuous, meaningful sequence in and of itself. Finally, for validation, you also need some real sequences. For that, you can just split up the batched subsequences, with a "vertical" slice. Let's take the rather extreme view that you want 50% of your data for validation (in reality it would be more like 10-20%, but using 50% here makes it clearer): Your training set would wind up being this: ...and the validation set this: And we're done! So that's what we wind up feeding in. And it kind of looks a bit like what we might wind up feeding in to a regular LLM training loop! It's a set of fixed-length chunks. But there's one critically important difference -- they're not in an arbitrary order, and we can't randomise anything. The sequence of inputs in, for example, batch position one, needs to be a real sequence from our original data. This has been a lot of theoretical stuff for a post that is meant to be getting down and dirty with the code. But I think it's important to get it clear before moving on to the code because when you see it, it looks pretty much like normal dataset-wrangling -- so you need to know why it's really not. Let's get into the code now. In the file , we define our dataset: The that we pass in will be our complete training corpus -- eg. the complete works of Shakespeare -- and is the limit we're going to apply to our truncated backpropagation through time -- that is, three in the example above. Karpathy's blog post mentions using 100, though he says that limiting it to 50 doesn't have any major impact. Next, we make sure that we have at least enough data to do one of those TBPTTs, plus one extra byte at the end (remember, we need our targets for the predictions -- the Ys are the Xs shifted left with an extra byte at the end). ...and we stash away the data, trimmed so that we have an exact number of these sequences, plus one extra byte for our shifted-left targets. Now we create a tokeniser. 3 This is related to something I mentioned in the last post. Karpathy's post talks about character-based RNNs, but the code works with bytes. The RNNs receive as their input a one-hot vector. Now, if we just used the bytes naively, that would mean we'd need 256 inputs (and accept 256 outputs) to handle that representation. That's quite a lot of inputs, and the network would have to learn quite a lot about them -- which would be wasteful, because real human-language text, at least in European languages, will rarely use most of them. His solution is to convert each byte into an ID; there are exactly as many possible IDs as there are different bytes in the training corpus, and they're assigned an ID based on their position in their natural sort order -- that is, if our corpus was just the bytes , and , then we'd have this mapping 4 : We just run the full dataset through to get the set of unique bytes, then sort it -- that gives us a Python list in the right order so that we can just do lookups into it to map from an ID to the actual byte. The class is defined in and is too simple to be worth digging into; it just defines quick and easy ways to get the vocab size (the number of IDs we have), and to encode sequences of bytes into PyTorch tensors of byte IDs and to decode them in the other direction. Because these byte IDs are so similar to the token IDs that we use in LLMs, I've adopted the name "tokens" for them just because it's familiar (I don't know if this is standard). So, at this point, we have our data and our tokenizer; we finish up by stashing away an encoded version of the data ready to go: Next we define a method to say how long our dataset is -- this is calculated in terms of how many TBPTT sequences it has: -- and a method: This works out the start and the end of the th subsequence of length in the data. It then returns four things: The code as it stands doesn't actually use the last two, the raw bytes -- but they did prove useful when debugging, and I've left them in just in case they're useful in the future. If you look back at the more theoretical examples above, what this Dataset is doing is essentially the first bit: the splitting into BPTT-length subsequences and dropping any short ones from the end -- the bit where we go from The only extra thing is that it also works out our target sequences, which will be a transformation like this: So that's our . Next we have a simple function to read in data; like the original code I just assume that input data is in some file called in a directory somewhere: Now we have the next step, the function : This looks a little more complicated than it actually is, because it's building up a list of tuples, each one of which is a set of , , and . If we imagine that it only did the , it would look like this: So, what it's doing is working out how many batches of size there are in the sequence. With our toy sequence ...and a batch size of two, there are . In this case, it would then loop from zero to 3 inclusive. Inside that loop it would create a list, then loop from zero to 1 inclusive. The first time round that loop it would get the item at , which is 0 + 0 * 4 = 0, so the subsequence . It would add that to the list. Then it would go round the inner loop again, and get the item at the new . is now 1, so that would be 0 + 1 * 4 = 4, so it would get the subsequence at index 4, which is , and add that to the list. We'd now have finished our first run through the inner loop, and we'd have the list [ , ], so we stack them up into a 2-D tensor: Hopefully it's now fairly clear that in our next pass around the outer loop, we'll pull out the items at index 1 and index 5 to get our next batch, and , and so on, so that at the end we have done the full calculation to get this: ...as a list of 2 × 3 PyTorch tensors. And equally hopefully, it's clear that the code in is just doing that, but not only for the but also for the , and . One thing to note before moving on is what happens if the number of items doesn't divide evenly into batches -- this code: ...means that we'll drop them. So, for example, if we wanted a batch size of three with our toy sequence ...then we'd get this: ...and the and would be dropped. And that's it for the dataset code! You might be wondering where the split to get the validation set comes -- that's actually later on, in the training code that actually uses this stuff. So let's move on to that! This is, logically enough, in the file train_rnn.py . There's quite a lot of code in there, but much of it is stuff I put in for quality-of-life (QoL) while using this. It's useful -- but I'll skip it for now and come back to it later. Initially, I want to focus on the core. We'll start with the function at the bottom. It starts like this: The -related stuff is QoL, so we'll come back to it later. All we need to know right now is that it's a way of getting information into the system about where its input data is, plus some other stuff -- in particular our TBPTT sequence length and our . So it uses that to read in some training data, then initialises one of our s with it and the , then uses to split it into batches. Next we have this: So our gives us a validation data percentage; we do some sanity checks and then just slice off an appropriate amount from the end of the we got to split the data into train and validation sets. That's the equivalent of the transform from the example earlier from To this training set: ...and this validation set: Now, we create our model: We're using a new class, which is an extension of the PyTorch built-in class -- we'll come back to that later. It's also getting parameters (things like the size of the hidden state and the number of layers) from the . Finally, we do the training in a function: So let's look at that now. It starts like this: That's fairly standard boilerplate to use CUDA if we have it, and to put the model onto whatever device we wind up using. Next: The class name for the optimiser is another one of those things from the , as are the learning rate and weight decay hyperparameters. So we just create an instance of it, and give it the model's parameters to work with along with those. Next, we get our patience: This is a QoL thing, but I think it's worth going into what it actually means. When we're training, we normally train for a fixed number of epochs. However, sometimes we might find that our model was overfitting -- say, at epoch 50 out of 100 we might see that the training loss was still decreasing, but our validation loss started rising. Any further training past that point might be pointless -- if we're doing things properly, we're saving checkpoints of the model periodically, so we'd be able to resurrect the model that we had at the point where validation loss was lowest, but we're still wasting time continuing training. A common solution to that is to have early stopping in the training loop. If the validation loss starts rising then we bail out early, and don't do the full number of epochs that we originally planned to do. Naively, we might keep track of the validation loss from the last epoch, and then if the current epoch has a higher loss, then we bail out. However, sometimes you find that validation loss rises a bit, but then starts going down again -- it's kind of like a meta version of finding a local minimum in the loss function itself. The solution to that is to use patience -- a measure of how many epochs of rising validation loss you're willing to put up with before you do your early exit. That's the number we're getting from our here -- it's a positive number (note the paranoid ), and if it's not defined we just assume that we have infinite patience. The next two lines are related to patience too -- before we go into our main training loop, we define the two variables we need to control early exit with patience: Pretty obviously, those are the best validation loss that we've seen so far, and the number of the epoch where we saw it. Right, finally we get to some training code! We have our epoch loop: We're using the rather nice module to get progress bars showing how far we are through the train (ignoring any early exits due to running out of patience, of course). We start the epoch by generating some random text from the model. This gives us a reasonably easy-to-understand indication of progress as we go. Next we put our model into training mode: ...set an initial empty hidden state: You might be wondering why the hidden state is getting a variable of its own, given that it's meant to be hidden -- it's right there in the name! Don't worry, we'll come to that. Next we initialise some variables we'll use to keep track of loss -- the total loss across all of the batches we've pushed through, plus the total number of tokens. The metric we track for each epoch is the loss per token, so we use those to work out an average. Now it's time to start the inner training loop over our batches: We're just unpacking those tuples that were created by into our and (I think I was being ultra-cautious about things here when I added to the start of ). And again we're using to have a sub-progress bar for this epoch. Next, we move our Xs and Ys to the device we have the model sitting on: And then run it through the model. The code to do this looks like this: ...and I think it's worth breaking down a bit. You can see that there's a branch at the top, if there's a hidden state then we need to pass it in and if there isn't, we don't. But let's focus on the no-hidden state option in the branch first, because there's something surprising there: Remember the description of an RNN from above: an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. We can easily extend that to handle batches -- you'd give the RNN a batch of inputs (let's say a tensor b × 1 , and get a batch of results, also b × 1 . You'd also need the RNN to hold b hidden states, but that's not a big jump. But what we're doing in that code is something different -- we're feeding in a whole series of inputs -- that is, is of size b × n , where n is our desired TBPTT sequence length. What's worse, in our description above, the hidden state was just that -- something hidden in the model. Now it's being returned by the RNN! What's going on? Let's start off with that hidden state. We often need to do stuff with the hidden state from outside the RNN -- indeed, we're detaching it as an important part of our TBPTT. So the PyTorch RNN actually does work rather like the simplified model that I described in my last post , and treats the hidden state like an output, like in this pseudocode: That is, the hidden state is an input and a return value, like this: OK, so the hidden state thing makes sense. How about the fact that we're feeding in a whole set of inputs? This is actually just due to a quality of life thing provided by PyTorch's various RNN classes. Wanting to feed in a sequence is, of course, a super-common thing to want to do with an RNN. So instead of having to do something like the pseudocode above, it's baked in. When you run ...then because is b × n , it just runs the RNN n times, accumulating the outputs, then returns the outputs as another b × n tensor, along with the final from the last run through that loop. (There is a wrinkle there that we'll come to shortly.) With that explained, hopefully that branch is clear. We don't have a hidden state right now, so we run all of the inputs across all of our batch items through the RNN in one go, and we get the outputs plus the hidden state that the RNN had at the end of processing that batch of sequences. Now let's look at the other branch, where there is a pre-existing hidden state: Hopefully the last line is clear -- we're just doing the same as we did in the branch, but we're passing the hidden state in because in this case we actually have one. The first two lines are a bit more complex. As you know, we need to detach the hidden state from PyTorch's computation graph in order to truncate our backpropagation through time. We're doing that here at the start of the loop just to make sure that each batch that we're pushing through starts with a guaranteed-detached hidden state. So that explains those calls to the methods. The fact that our hidden state is a tuple of two things that we have to detach separately is a little deeper; for now, all we need to know is that the LSTM models that we're using are a variant of RNN that has two hidden states rather than one, and so we need to handle that. I'll go into that in more depth in a future post. Once we've done that, we've completed our forward pass for the epoch. Let's move on to the backward pass. Next, we have this: Pretty standard stuff. is defined further up in the file: It's exactly the same as the function we used to calculate loss in the LLM-from-scratch posts: I wrote more about that here if you're interested in the details. Next, we do something new: This is something that is generally very useful in RNNs. They are prone to vanishing and exploding gradients, and this code is to help handle the exploding case. What it says is, if we've defined a , we use it to clip gradients when they get too big, which means that training is going be better because we're not going to have updates swinging wildly up and down. Let's say that we set to 1.0. If, at the time this code is run, the norm of the gradients -- which is a measurement of their size 5 -- is, say, 10, then they would all be scaled down to 10% of their size, making the new norm 1.0. So that keeps them in check, and stops any wild variations in gradient updates. So, in short -- it's a stabilisation technique to stop exploding gradients leading to issues with training. Next, we have our normal code to update the parameters based on these (potentially clipped) gradients: And finally, we update our count of how many inputs we've seen and our total loss so far in this epoch: That's our training loop! Once we've done that code -- run our input through the model, calculated loss, worked out our gradients, clipped them if necessary, done our update and stored away our housekeeping data, we can move on to the next batch in our sequences. When we've gone through all of the batches that we have, our training for the epoch is complete. We print out our loss per-token: ...and then it's time for our validation loop. This is so similar to the training loop that I don't think it needs a detailed explanation: The only big difference (apart from the lack of a backward pass and parameter updates) is that we're not detaching the hidden state, which makes sense -- we're in a block with the model in mode, so there is no computation graph to detach them from. Validation done, it's time for a bit of housekeeping: All we're doing here is keeping track of whether this is the best epoch in terms of validation loss. The boolean is exactly what it says it is. If we're on our first run through the loop ( is None) then we record our current val loss as , and store this epoch's number into . Otherwise, we do have an existing , and if our current val loss is lower than that one, we also stash away our current loss and epoch as the best ones. Otherwise we are clearly not in the best epoch so we update to reflect that. Once we've done that, we save a checkpoint: I'll go into the persistence stuff -- saving and loading checkpoints -- later on. Next, a QoL thing -- we generate a chart showing how training and validation loss have been going so far: Again, I'll go into that later. Finally, we do our early stopping if we need to: If the current epoch is more than epochs past the one that had the best validation loss so far, then we stop. That's the end of the outside loop over epochs for our training! If we manage to get through all of that, we print out some sample text: ...and we're done! That's our training loop. Now let's move on to the model itself. I called my model class a , and you can see the code here . It's actually not a great name, as it implies there's something specifically Andrej Karpathy-like about it as a way of doing LSTMs, while what I was trying to express is that it wraps a regular PyTorch LSTM with some extra stuff to make it work more like his original Lua Torch implementation . I tried to come up with a more descriptive name, but they all started feeling like the kinds of class names you get in "Enterprise" Java code like so I gave up and named it after Karpathy. Hopefully he'll never find out, and won't mind if he does... 6 The Lua code does four things differently to PyTorch's built-in class: Let's look at the code now: You can see that it's doing 1 to 3 of those steps above -- the one-hot, the extra dropout, and the linear layer to project back to vocab space. The only other oddity there is this kwarg: That's the wrinkle I was talking about when we went through the training loop and was discussing batches. The PyTorch LSTM by default expects the batch dimension to be the second one of the input tensors -- that is, instead of passing in a b × n tensor, it wants an n × b one. That's not what I'm used to (nor is it what the original Lua code uses, if I'm reading it correctly), but luckily it can be overridden by the logically-named option. The only step we don't do in this class is the softmaxing of the logits to convert them to probabilities. That's because PyTorch's built-in wants logits rather than probabilities, so it was easier to just call softmax on the outputs where necessary. So that's our model. Let's take a look at the code that we can use to run it and generate some text. The code for this is in . Ignoring the boilerplate that parses the command-line options, we can start here: So, we're taking the directory and run name that the QoL helpers that I'll be describing later, a specific checkpoint of a training run to use, the number of bytes that we want to generate, the temperature to use when sampling (more about temperature here ) and a "primer" text. That last one is because in order to get something out of our RNN, we need to feed something in. I tried using a single random byte from the vocab initially (that's still the default, as we'll see shortly), and that was OK, but the bytes aren't equally represented in the training data (eg. "z" is less common than "e", but weird bytes that only occur in occasional multibyte unicode characters are rarer still) -- and that means that we might be trying to get our RNN to start with something it hasn't seen very much, so we get bad results. Even worse, because some of the input text is unicode, there's no guarantee that a random byte is even valid on its own -- it might be something that only makes sense after some leader bytes. So I found that in general it's best to provide a fixed string to start with -- say, "ACT" for Shakespeare, or "He said" for "War and Peace". So, with those command-line flags, we start off by using the QoL stuff to get the metadata we need about the model: ...then we use our persistence code to load up the desired checkpoint: At this point we have the version of the model that was saved for that checkpoint, and its associated tokeniser. We move this to an appropriate device -- CUDA if we have it, CPU otherwise: ...and then use a helper function to generate some text: Once we have that, we print it out, after decoding it as UTF-8: If a primer was provided, we print it first, but if the primer was a random byte we don't. Also, because the generated bytes might include invalid Unicode, we just replace those with "?" when we decode (that kwarg). Let's look at the helper next. So, after a little bit of paranoia about our desired sequence length, we make sure we're not tracking gradients and put the model into eval mode (to disable dropout). Next, we work out our primer bytes -- either by picking a random one, or by decoding the string that we were provided into its constituent UTF-8 bytes: The primer needs to be converted to the byte token IDs that our tokeniser uses: The is something you might remember from the LLM posts -- we need to run a batch through our RNN, and the is just a tensor of n bytes. adds on an extra dimension so that it's 1 × n , as we want. Next, we put the primer onto the same device as the model: As an aside, I think I might start using code like that more often, I often find myself passing variables around and TBH it seems much more natural to just ask the model what device it's using. Next, we run it through the model: Now we use a helper function to sample from those logits to get our first generated byte: Note that we are explicitly taking the last item from . It is a b × n × v tensor, where b is our batch size (always one in this script), n is the length of the primer that we fed in, and v is our vocab size. The just extracts the last item along the n dimension so that we have the b × v logits that came out of the RNN for the last character of the primer, which is what we want. We'll get to the function later, but it returns a b × 1 tensor, so now, we just extract the byte ID from it and put it into a new list: Next comes our autoregressive loop -- we've already generated one byte, so we loop times to get the rest, each time running the model on the last byte we got, sampling from the distribution implied by the logits, and adding it onto our list: Once that's done, we have our generated byte IDs in , so we just use the tokeniser to turn them back into bytes and return the result: Easy, right? Now let's look at . The function takes logits and the temperature: Firstly, we handle the case where temperature is zero. By convention this means greedy sampling -- we just always return the highest-probability next token, so we can use for that: If the temperature is non-zero, we divide the logits by it and run softmax over the result: ...and then we just sample from the probability distribution that we get from that: And that's it! The only things to explain now are the quality of life stuff, and the persistence functions that handle saving and loading checkpoints. Let's look at our QoL things first. When I started building this code I knew I wanted to run RNNs on multiple input texts -- Shakespeare, "War and Peace", etc. I also realised that for each of those input texts, I'd want to try different model sizes. The underlying concept I came up with was to have "experiments", which would each have a particular training text. Each experiment would have multiple "runs", which would have particular training hyperparameters -- the model size, number of epochs, and so on. I decided to represent that with a directory structure, which you can see here . One subdirectory per experiment, and if you go into the one you'll see that it has two subdirectories, for the training data and for the different training runs I tried. The directory contains a file called , which is the training data itself. That one only exists in the experiment, though, because I was concerned with copyright for the other training sets. There is a file in all data directories for all experiments, though, which explains how to get the data. The directory has more in it. Each run is for a particular set of hyperparameters, so let's look at the ones for the run. We have two files, , which looks like this: It's essentially the model-specific hyperparameters, the ones we pass in when creating our -- for example, remember this from the training code: is this JSON dict loaded into Python. There's also , which has the training data: Hopefully these are all familiar from the training code; they all go into , so they're used in code like this: So, now when we look at the start of the and scripts, and see things like this: ...it should be clear that we're loading up those JSON dicts from those files. You can see that code at the start of . It looks like this: So, some basic sanity checking that we have the directories we expect. Next: ...we create a checkpoints directory if it doesn't exist, stashing away its path, then finally we load up those two JSON files: The rest of that file handles checkpointing, so let's move on to that. Remember, in the training loop, each epoch we saved a checkpoint: ..and at the start of the code to generate some text, we load one: Let's take a look at saving first. Each checkpoint is a directory with a filename based on the timestamp when it was saved, inside the directory for the run that it relates to, so firstly we work out the full path for that: (The directories inside experiments are explicitly ignored in our file so that we don't accidentally commit them.) Now, we don't want half-saved checkpoints due to crashes or anything like that, so we initially create a directory to write to using the path that we're going to use but with at the end: Next, we write a file (the path within the checkpoint's dir is worked out by a helper function) containing some useful information about the model's progress -- it's epoch number, the training and validation loss, and the mapping that its tokeniser uses (from which we can later construct a new tokeniser): Then we dump the model's current parameters into a file using function from the Hugging Face library (getting the file's path through another helper function): Now that our checkpoint is complete, we can rename our temporary directory to the real name for the checkpoint: Next, we do some symlinks. We want a symlink in the directory called , which links to the checkpoint that had the lowest validation loss. The training loop is tracking whether any given epoch had the lowest, and you can see it passed in an parameter, so if that's true, we create the symlink, removing any pre-existing one: For completeness, we also create one that points to the most recent checkpoint -- that will always be the one we're doing right now, so: And that's it for saving! Loading is even simpler (and note that we can just specify "best" as the checkpoint due to that symlink -- I pretty much always do): So, we've made sure that the checkpoint directory is indeed a directory. Next, we load up the model metadata: ...and we use ' to load our parameters: Now we can construct a tokeniser based on that mapping that we put into the metadata: ...and an based on the other metadata parameters: and load the parameters into the model: That's it! We can return the model and the tokeniser for use: So that's all the code needed for checkpointing. Now let's look at the final QoL trick, one that I left out of the earlier list because it needs the checkpoints to work: charting our progress. Remember this line from the training loop, which was called after we saved our checkpoint? It generates charts like this: The chart is updated every epoch, and saved into the root of the directory. There's also a helpful file placed there that reloads that generated chart every second, so you can just load it into a browser tab while you are training and watch it live. Let's look into the code. It's in . The function starts like this: So, we use a utility function (which we'll get into in a moment) to load up the data -- training and validation loss per epoch, and the specific epoch that was the best. Once we have that, we just use (with my preferred xkcd styling) to plot the two loss lines: We also plot a single vertical red line at the best epoch so that we can see if we're past that and running into the patience period: Then a bit more pyplot boilerplate... ...and we've got our chart, saved as . Finally, we just copy that useful auto-reloading into the same directory as the chart: ...and we're done. So, how do we get the data? Originally I was keeping lists of loss values over time, but eventually realised that the data was already there in the checkpoint metadata files. So, the helper function just iterates over the checkpoints, skipping the symlinks, creating lists of (epoch number, loss) tuples for both training and validation loss using the numbers in those metadata files, and for the symlink just storing its epoch number: Those loss lists will just be in whatever random order returned them in, so we sort them by epoch number: ...and we have something we can return to the charting code: That brings us to the end of the charting code -- and, indeed, to the end of all of the code in this repo! So let's wrap up. That was quite a long writeup, but I think it was worthwhile. Indeed, if you look at the commit history, you'll see that there were one or two things where while explaining the code I realised that it was doing things badly -- not so badly that it didn't work, or gave bad results, but doing things in a way that offended my sense of what's right as an engineer. Hopefully it was interesting, and has set things up well for the next step, where I'll use the same framework, but plug in my own RNN implementation so that we can see how it compares. Stay tuned :-) Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩ Vanishing/exploding gradients. Let's say that we're training a three-layer network on the 5,617,124 characters of the Project Gutenberg "Complete Works of Shakespeare" . That's essentially backpropagation through a 16-million layer network. You won't get far through that before your gradients vanish to zero or explode to infinity. The only meaningful parameter updates will be for the last something-or-other layers. Batching. Running multiple inputs through a model in parallel has two benefits: it's faster and more efficient, and it means that your gradient updates are informed by multiple inputs at the same time, which will make them more stable. 1 Validation . There's nothing in there as a validation set, so we will have no way of checking whether our model is really learning, or just memorising the training set. (There's the same problem with the test set, but for this writeup I'll ignore that, as the solution is the same too.) : the byte IDs of the bytes in that sequence -- these are the ones we'll run through the model, our Xs. Note that these are slices of the PyTorch tensors that were returned by the tokeniser, so they're tensors themselves. : the shifted-left-by-one-plus-an-extra-byte target sequence as byte IDs -- the Ys for those Xs. These are likewise tensors. : the raw bytes for the . :the raw bytes for the . It accepts the inputs as "token IDs", and maps them to a one-hot vector itself. It applies dropout after the last layer of the LSTM (rather than just internally between the layers). It expands the output vector back out to the vocab size with a linear layer after the LSTM so that we have logits across our vocab space. This is because an LSTM's output has the same dimensionality as the hidden state. It runs those logits through softmax so that it returns probabilities. Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩

0 views
Sean Goedecke 2 months ago

Should LLMs just treat text content as an image?

Several days ago, DeepSeek released a new OCR paper . OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on 1 . But there’s a more subtle reason why really good OCR might have deep implications for AI models. According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself? This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead. Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service , an open-source project, and even a benchmark . It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it. The DeepSeek paper suggests an interesting way 2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail. Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself? In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text? The first explanation is that text tokens are discrete while image tokens are continuous . Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens. Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information . This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token) 3 . So it’s not that surprising that you could do better than text tokens. Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works . To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content 4 . Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice. Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text? You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. See Figure 13. Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. ↩ See Figure 13. ↩ Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. ↩ Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. ↩

0 views
nathan.rs 2 months ago

BERT is just a Single Text Diffusion Step

A while back, Google DeepMind unveiled Gemini Diffusion , an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step. I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018 . The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

0 views
Ahead of AI 3 months ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.

0 views

The RAG Obituary: Killed by Agents, Buried by Context Windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool , an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions. After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline. In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text! The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once? The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years. GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating. Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages). With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole! RAG emerged as an elegant solution borrowed directly from search engines. Just as Google displays 10 blue links with relevant snippets for your query, RAG retrieves the most pertinent document fragments and feeds them to the LLM for synthesis. The core idea is beautifully simple: if you can’t fit everything in context, find the most relevant pieces and use those . It turns LLMs into sophisticated search result summarizers. Basically, LLMs can’t read the whole book but they can know who dies at the end; convenient! Long documents need to be chunked into pieces and it’s when problems start. Those digestible pieces are typically 400-1,000 tokens each which is basically 300-750 words. The problem? It isn’t as simple as cutting every 500 words. Consider chunking a typical SEC 10-K annual report. The document has a complex hierarchical structure: - Item 1: Business Overview (10-15 pages) - Item 1A: Risk Factors (20-30 pages) - Item 7: Management’s Discussion and Analysis (30-40 pages) - Item 8: Financial Statements (40-50 pages) After naive chunking at 500 tokens, critical information gets scattered: - Revenue recognition policies split across 3 chunks - A risk factor explanation broken mid-sentence - Financial table headers separated from their data - MD&A narrative divorced from the numbers it’s discussing If you search for “revenue growth drivers,” you might get a chunk mentioning growth but miss the actual numerical data in a different chunk, or the strategic context from MD&A in yet another chunk! At Fintool, we’ve developed sophisticated chunking strategies that go beyond naive text splitting: - Hierarchical Structure Preservation : We maintain the nested structure from Item 1 (Business) down to sub-sections like geographic segments, creating a tree-like document representation - Table Integrity : Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together - Cross-Reference Preservation : We maintain links between narrative sections and their corresponding financial data, preserving the “See Note X” relationships - Temporal Coherence : Year-over-year comparisons and multi-period analyses stay together as single chunks - Footnote Association : Footnotes remain connected to their referenced items through metadata linking Each chunk at Fintool is enriched with extensive metadata: - Filing type (10-K, 10-Q, 8-K) - Fiscal period and reporting date - Section hierarchy (Item 7 > Liquidity > Cash Position) - Table identifiers and types - Cross-reference mappings - Company identifiers (CIK, ticker) - Industry classification codes This allows for more accurate retrieval but even our intelligent chunking can’t solve the fundamental problem: we’re still working with fragments instead of complete documents! Once you have the chunks, you need a way to search them. One way is to embed your chunks. Each chunk is converted into a high‑dimensional vector (typically 1,536 dimensions in most embedding models). These vectors live in a space where, theoretically, similar concepts are close together. When a user asks a question, that question also becomes a vector. The system finds the chunks whose vectors are closest to the query vector using cosine similarity. It’s elegant in theory and in practice, it’s a nightmare of edge cases. Embedding models are trained on general text and struggle with specific terminologies. They find similarities but they can’t distinguish between “revenue recognition” (accounting policy) and “revenue growth” (business performance). Consider that example: Query: “ What is the company’s litigation exposure ? RAG searches for “litigation” and returns 50 chunks: - Chunks 1-10: Various mentions of “litigation” in boilerplate risk factors - Chunks 11-20: Historical cases from 2019 (already settled) - Chunks 21-30: Forward-looking safe harbor statements - Chunks 31-40: Duplicate descriptions from different sections - Chunks 41-50: Generic “we may face litigation” warnings What RAG Reports: $500M in litigation (from Legal Proceedings section) What’s Actually There: - $500M in Legal Proceedings (Item 3) - $700M in Contingencies note (”not material individually”) - $1B new class action in Subsequent Events - $800M indemnification obligations (different section) - $2B probable losses in footnotes (keyword “probable” not “litigation”) The actual Exposure is $5.1B. 10x what RAG found. Oupsy! By late 2023, most builders realized pure vector search wasn’t enough. Enter hybrid search: combine semantic search (embeddings) with the traditional keyword search (BM25). This is where things get interesting. BM25 (Best Matching 25) is a probabilistic retrieval model that excels at exact term matching. Unlike embeddings, BM25: - Rewards Exact Matches : When you search for “EBITDA,” you get documents with “EBITDA,” not “operating income” or “earnings” - Handles Rare Terms Better : Financial jargon like “CECL” (Current Expected Credit Losses) or “ASC 606” gets proper weight - Document Length Normalization : Doesn’t penalize longer documents - Term Frequency Saturation : Multiple mentions of “revenue” don’t overshadow other important terms At Fintool, we’ve built a sophisticated hybrid search system: 1. Parallel Processing : We run semantic and keyword searches simultaneously 2. Dynamic Weighting : Our system adjusts weights based on query characteristics: - Specific financial metrics? BM25 gets 70% weight - Conceptual questions? Embeddings get 60% weight - Mixed queries? 50/50 split with result analysis 3. Score Normalization : Different scoring scales are normalized using: - Min-max scaling for BM25 scores - Cosine similarity already normalized for embeddings - Z-score normalization for outlier handling So at the end the embeddings search and the keywords search retrieve chunks and the search engine combines them using Reciprocal Rank Fusion. RRF merges rankings so items that consistently appear near the top across systems float higher, even if no system put them at #1! So now you think it’s done right? But hell no! Here’s what nobody talks about: even after all that retrieval work, you’re not done. You need to rerank the chunks one more time to get a good retrieval and it’s not easy. Rerankers are ML models that take the search results and reorder them by relevance to your specific query limiting the number of chunks sent to the LLM. Not only LLMs are context poor, they also struggle when dealing with too much information . It’s vital to reduce the number of chunks sent to the LLM for the final answer. The Reranking Pipeline: 1. Initial search retrieval with embeddings + keywords gets you 100-200 chunks 2. Reranker ranks the top 10 3. Top 10 are fed to the LLM to answer the question Here is the challenge with reranking: - Latency Explosion : Rerank adds between 300-2000ms per query. Ouch. - Cost Multiplication : it adds significant extra cost to every query. For instance, Cohere Rerank 3.5 costs $2.00 per 1,000 search units, making reranking expensive. - Context Limits : Rerankers typically handle few chunks (Cohere Rerank supports only 4096 tokens), so if you need to re-rank more than that, you have to split it into different parallel API calls and merge them! - Another Model to Manage : One more API, one more failure point Re-rank is one more step in a complex pipeline. What I find difficult with RAG is what I call the “cascading failure problem”. 1. Chunking can fail (split tables) or be too slow (especially when you have to ingest and chunk gigabytes of data in real-time) 2. Embedding can fail (wrong similarity) 3. BM25 can fail (term mismatch) 4. Hybrid fusion can fail (bad weights) 5. Reranking can fail (wrong priorities) Each stage compounds the errors of the previous stage. Beyond the complexity of hybrid search itself, there’s an infrastructure burden that’s rarely discussed. Running production Elasticsearch is not easy. You’re looking at maintaining TB+ of indexed data for comprehensive document coverage, which requires 128-256GB RAM minimum just to get decent performance. The real nightmare comes with re-indexing. Every schema change forces a full re-indexing that takes 48-72 hours for large datasets. On top of that, you’re constantly dealing with cluster management, sharding strategies, index optimization, cache tuning, backup and disaster recovery, and version upgrades that regularly include breaking changes. Here are some structural limitations: 1. Context Fragmentation - Long documents are interconnected webs, not independent paragraphs - A single question might require information from 20+ documents - Chunking destroys these relationships permanently 2. Semantic Search Fails on Numbers - “$45.2M” and “$45,200,000” have different embeddings - “Revenue increased 10%” and “Revenue grew by a tenth” rank differently - Tables full of numbers have poor semantic representations 3. No Causal Understanding - RAG can’t follow “See Note 12” → Note 12 → Schedule K - Can’t understand that discontinued operations affect continuing operations - Can’t trace how one financial item impacts another 4. The Vocabulary Mismatch Problem - Companies use different terms for the same concept - “Adjusted EBITDA” vs “Operating Income Before Special Items” - RAG retrieves based on terms, not concepts 5. Temporal Blindness - Can’t distinguish Q3 2024 from Q3 2023 reliably - Mixes current period with prior period comparisons - No understanding of fiscal year boundaries These aren’t minor issues. They’re fundamental limitations of the retrieval paradigm. Three months ago I stumbled on an innovation on retrievial that blew my mind In May 2025, Anthropic released Claude Code, an AI coding agent that works in the terminal. At first, I was surprised by the form factor. A terminal? Are we back in 1980? no UI? Back then, I was using Cursor, a product that excelled at traditional RAG. I gave it access to my codebase to embed my files and Cursor ran a search n my codebase before answering my query. Life was good. But when testing Claude Code, one thing stood out: It was better and faster and not because their RAG was better but because there was no RAG. Instead of a complex pipeline of chunking, embedding, and searching, Claude Code uses direct filesystem tools: 1. Grep (Ripgrep) - Lightning-fast regex search through file contents - No indexing required. It searches live files instantly - Full regex support for precise pattern matching - Can filter by file type or use glob patterns - Returns exact matches with context lines - Direct file discovery by name patterns - Finds files like `**/*.py` or `src/**/*.ts` instantly - Returns files sorted by modification time (recency bias) - Zero overhead—just filesystem traversal 3. Task Agents - Autonomous multi-step exploration - Handle complex queries requiring investigation - Combine multiple search strategies adaptively - Build understanding incrementally - Self-correct based on findings By the way, Grep was invented in 1973. It’s so... primitive. And that’s the genius of it. Claude Code doesn’t retrieve. It investigates: - Runs multiple searches in parallel (Grep + Glob simultaneously) - Starts broad, then narrows based on discoveries - Follows references and dependencies naturally - No embeddings, no similarity scores, no reranking It’s simple, it’s fast and it’s based on a new assumption that LLMs will go from context poor to context rich. Claude Code proved that with sufficient context and intelligent navigation, you don’t need RAG at all. The agent can: - Load entire files or modules directly - Follow cross-references in real-time - Understand structure and relationships - Maintain complete context throughout investigation This isn’t just better than RAG—it’s a fundamentally different paradigm. And what works for code can work for any long documents that are not coding files. The context window explosion made Claude Code possible: 2022-2025 Context-Poor Era: - GPT-4: 8K tokens (~12 pages) - GPT-4-32k: 32K tokens (~50 pages) 2025 and beyond Context Revolution: - Claude Sonnet 4: 200k tokens (~700 pages) - Gemini 2.5: 1M tokens (~3,000 pages) - Grok 4-fast: 2M tokens (~6,000 pages) At 2M tokens, you can fit an entire year of SEC filings for most companies. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon. This represents a fundamental shift in how AI systems process information. Equally important, attention mechanisms are rapidly improving—LLMs are becoming far better at maintaining coherence and focus across massive context windows without getting “lost” in the noise. Claude Code demonstrated that with enough context, search becomes navigation: - No need to retrieve fragments when you can load complete files - No need for similarity when you can use exact matches - No need for reranking when you follow logical paths - No need for embeddings when you have direct access It’s mind-blowing. LLMs are getting really good at agentic behaviors meaning they can organize their work into tasks to accomplish an objective. Here’s what tools like ripgrep bring to the search table: - No Setup : No index. No overhead. Just point and search. - Instant Availability : New documents are searchable the moment they hit the filesystem (no indexing latency!) - Zero Maintenance : No clusters to manage, no indices to optimize, no RAM to provision - Blazing Fast : For a 100K line codebase, Elasticsearch needs minutes to index. Ripgrep searches it in milliseconds with zero prep. - Cost : $0 infrastructure cost vs a lot of $$$ for Elasticsearch So back to our previous example on SEC filings. An agent can SEC filing structure intrinsically: - Hierarchical Awareness : Knows that Item 1A (Risk Factors) relates to Item 7 (MD&A) - Cross-Reference Following : Automatically traces “See Note 12” references - Multi-Document Coordination : Connects 10-K, 10-Q, 8-K, and proxy statements - Temporal Analysis : Compares year-over-year changes systematically For searches across thousands of companies or decades of filings, it might still use hybrid search, but now as a tool for agents: - Initial broad search using hybrid retrieval - Agent loads full documents for top results - Deep analysis within full context - Iterative refinement based on findings My guess is traditional RAG is now a search tool among others and that agents will always prefer grep and reading the whole file because they are context rich and can handle long-running tasks. Consider our $6.5B lease obligation question as an example: Step 1: Find “lease” in main financial statements → Discovers “See Note 12” Step 2: Navigate to Note 12 → Finds “excluding discontinued operations (Note 23)” Step 3: Check Note 23 → Discovers $2B additional obligations Step 4: Cross-reference with MD&A → Identifies management’s explanation and adjustments Step 5: Search for “subsequent events” → Finds post-balance sheet $500M lease termination Final answer: $5B continuing + $2B discontinued - $500M terminated = $6.5B The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation. Basically, RAG is like a research assistant with perfect memory but no understanding: - “Here are 50 passages that mention debt” - Can’t tell you if debt is increasing or why - Can’t connect debt to strategic changes - Can’t identify hidden obligations - Just retrieves text, doesn’t comprehend relationships Agentic search is like a forensic accountant: - Follows the money systematically - Understands accounting relationships (assets = liabilities + equity) - Identifies what’s missing or hidden - Connects dots across time periods and documents - Challenges management assertions with data 1. Increasing Document Complexity - Documents are becoming longer and more interconnected - Cross-references and external links are proliferating - Multiple related documents need to be understood together - Systems must follow complex trails of information 2. Structured Data Integration - More documents combine structured and unstructured data - Tables, narratives, and metadata must be understood together - Relationships matter more than isolated facts - Context determines meaning 3. Real-Time Requirements - Information needs instant processing - No time for re-indexing or embedding updates - Dynamic document structures require adaptive approaches - Live data demands live search 4. Cross-Document Understanding Modern analysis requires connecting multiple sources: - Primary documents - Supporting materials - Historical versions - Related filings RAG treats each document independently. Agentic search builds cumulative understanding. 5. Precision Over Similarity - Exact information matters more than similar content - Following references beats finding related text - Structure and hierarchy provide crucial context - Navigation beats retrieval The evidence is becoming clear. While RAG served us well in the context-poor era, agentic search represents a fundamental evolution. The potential benefits of agentic search are compelling: - Elimination of hallucinations from missing context - Complete answers instead of fragments - Faster insights through parallel exploration - Higher accuracy through systematic navigation - Massive infrastructure cost reduction - Zero index maintenance overhead The key insight? Complex document analysis—whether code, financial filings, or legal contracts—isn’t about finding similar text. It’s about understanding relationships, following references, and maintaining precision. The combination of large context windows and intelligent navigation delivers what retrieval alone never could. RAG was a clever workaround for a context-poor era . It helped us bridge the gap between tiny windows and massive documents, but it was always a band-aid. The future won’t be about splitting documents into fragments and juggling embeddings. It will be about agents that can navigate, reason, and hold entire corpora in working memory. We are entering the post-retrieval age. The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents. In hindsight, RAG will look like training wheels. Useful, necessary, but temporary. The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted.

0 views
Taranis 3 months ago

LLMs are a failure. A new AI winter is coming.

Like many people, I got pretty excited when it was discovered that the transformer neural network architecture appeared to break through many years of stagnation in AI research. Chatbots suddenly had emergent capabilities, derived almost entirely from unstructured, unsupervised learning, far surpassing older technologies. My first experiences were with unreleased models, pre-ChatGPT, and I was seriously impressed. Though these early, small, models would often mess up, even generating streams of garbage text, when they worked they worked. Spookily well. I completely understand why some people at the time thought they were sentient – this is a whole other discussion for another time. People were saying that this meant that the AI winter was over, and a new era was beginning. I should explain for anyone who hasn't heard that term before, that way back in the day, when early AI research was seemingly yielding significant results, there was much hope, as there is now, but ultimately the technology stagnated. First time around, AI was largely symbolic – this basically means that attempts to model natural language understanding and reasoning were based essentially on hard-coded rules. This worked, up to a point, but it was soon clear that it was simply impractical to build a true AI that way. Human language is too messy for mechanised parsing to work in a general way. Reasoning required far too much world knowledge for it to be practical to write the code by hand, and nobody knew how to extract that knowledge without human intervention. The other huge problem with traditional AI was that many of its algorithms were NP-complete, which meant that whilst a lot of the time you got a result, often you just didn't, with the algorithm taking an arbitrarily long time to terminate. I doubt anyone can prove this – I certainly wouldn't attempt it – but I strongly suspect that 'true AI', for useful definitions of that term, is at best NP-complete, possibly much worse. Though quantum computing in principle could give some leverage here, none of the technologies currently being built or that are being considered feasible are likely to be useful. Just not enough qubits to represent the kinds of data that would need to be processed – this is a way, way harder problem than trying to reverse encryption secured by the difficulty of prime factorization. So then came transformers. Seemingly capable of true AI, or, at least, scaling to being good enough to be called true AI, with astonishing capabilities. For the uninitiated, a transformer is basically a big pile of linear algebra that takes a sequence of tokens and computes the likeliest next token. More specifically, they are fed one token at a time, which builds an internal state that ultimately guides the generation of the next token. This sounds bizarre and probably impossible, but the huge research breakthrough was figuring out that, by starting with essentially random coefficients (weights and biases) in the linear algebra, and during training back-propagating errors, these weights and biases could eventually converge on something that worked. Exactly why this works is still somewhat mysterious, though progress has been made. Transformers aren't killed by the NP-completeness and scaling problems that caused the first AI winter. Technically, a single turn-of-the-handle, generating the next token from the previous token and some retained state, always takes the same amount of time. This inner loop isn't Turing-complete – a simple program with a while loop in it is computationally more powerful. If you allow a transformer to keep generating tokens indefinitely this is probably Turing-complete, though nobody actually does that because of the cost. Transformers also solved scaling, because their training can be unsupervised (though, practically they do often need supervised training in order to create guardrails against dangerous behaviour). It is now standard practice to train new models on just about every book ever written and everything that can be scraped from the internet. That's the good news. That was the good news. But we've gone past that point now, and we are now all up against the reality of widespread use of transformers. All transformers have a fundamental limitation, which can not be eliminated by scaling to larger models, more training data or better fine-tuning. It is fundamental to the way that they operate. On each turn of the handle, transformers emit one new token (a token is analogous to a word, but in practice may represest word parts or even complete commonly used small phrases – this is why chatbots don't know how to spell!). In practice, the transformer actually generates a number for every possible output token, with the highest number being chosen in order to determine the token. This token is then fed back, so that the model generates the next token in the sequence. The problem with this approach is that the model will always generate a token, regardless of whether the context has anything to do with its training data. Putting it another way, the model generates tokens on the basis of what 'looks most plausible' as a next token. If this is a bad choice, and gets fed back, the next token will be generated to match that bad choice. And as the handle keeps turning, the model will generate text that looks plausible. Models are very good at this, because this is what they are trained to do. Indeed, it's all they can do. This is the root of the hallucination problem in transformers, and is unsolveable because hallucinating is all that transformers can do. I would conjecture that this is another manifestation of the NP-completeness wall that slammed symbolic AI, causing the first AI winter. It's always possible to turn an NP-complete algorithm into one that runs quickly, if you don't mind that it fails to generate any output if you hit a timeout. The transformer equivalent of this is generating plausible, wrong, hallucinated output in cases where it can't pattern match a good result based on its training. The problem, though, is that with traditional AI algorithms you typically know if you've hit a timeout, or if none of your knowledge rules match. With transformers, generating wrong output looks exactly like generating correct output, and there is no way to know which is which. Practically, this manifests as transformers generating bad output a percentage of the time. Depending on the context, and how picky you need to be about recognizing good or bad output, this might be anywhere from a 60% to a 95% success rate, with the remaining 5%-40% being bad results. This just isn't good enough for most practical purposes. More concerning is the fact that larger transformer models produce extremely plausible bad output, that can only be identified as bad by genuine experts. The rumour mill has it that about 95% of generative AI projects in the corporate world are failures. This isn't really surprising to anyone who was around for the dot com bubble, where corporate executives all seemed to assume that just being online would somehow transform their businesses, and that new ventures only really needed user count and that the financials would sort themselves out later. The same thing is happening again with generative AI, though the numbers are far larger. It is absolutely inevitable that the bubble will burst, and fairly soon. Expect OpenAI to crash, hard, with investors losing their shirts. Expect AI infra spends to be cancelled and/or clawed back. Expect small AI startups that aren't revenue positive to vanish overnight. Expect use cases based on unrealistic expectations of LLM capabilites to crash the hardest. A good example is transformers used to assist in programming, or to generate code from scratch. This has convinced many non-programmers that they can program, but the results are consistently disastrous, because it still requires genuine expertise to spot the hallucinations. Plausible hallucinations in code often result in really horrible bugs, security holes, etc., and can be incredibly difficult to find and fix. My own suspicion is that this might get you close to what you think is finished, but actually getting over the line to real production code still requires real engineering, and it's a horrible liability to have to maintain a codebase that nobody on the team actually authored. Transformers must never be used for certain applications – their failure rate is unacceptable for anything that might directly or indirectly harm (or even significantly inconvenience) a human. This means that they should never be used in medicine, for evaluation in school or college, for law enforcement, for tax assessment, or a myriad of other similar cases. It is difficult to spot errors even when you are an expert, so nonexpert users have no chance whatsoever. The technology won't disappear – existing models, particularly in the open source domain, will still be available, and will still be used, but expect a few 'killer app' use cases to remain, with the rest falling away. We're probably stuck with spammy AI slop, and with high school kids using gen AI to skip their boring homework. We'll probably keep AI features in text editors, and a few other places. I know that this is a currently-unpopular opinion. It is based on solid science, however. For what it's worth, I founded a chatbot company back in the late 90s, based on symbolic AI technology, that went splat in the dot com crash. I've been around this block, and I've stayed up to date on the technology – I've built my own transformer from scratch, and have experimented quite a bit. My advice: unwind as much exposure as possible you might have to a forthcoming AI bubble crash. Winter is coming, and it's harsh on tulips.

0 views
Gregory Gundersen 3 months ago

A History of Large Language Models

Large language models (LLMs) still feel a bit like magic to me. Of course, I understand the general machinery enough to know that they aren’t, but the gap between my outdated knowledge of the field and the state-of-the-art feels especially large right now. Things are moving fast. So six months ago, I decided to close that gap just a little by digging into what I believed was one of the core primitives underpinning LLMs: the attention mechanism in neural networks. I started by reading one of the landmark papers in the literature, which was published by Google Brain in 2017 under the catchy title Attention is all you need (Vaswani et al., 2017) . As the title suggests, the authors did not invent the attention mechanism. Rather, they introduced a neural network architecture which in was some sense “all attention”. This architecture is the now-famous transformer . Clearly the transformer stands in contrast to whatever came before it, but what was that and what did the transformer do differently? To answer these questions, I read a lot of papers, and the context that felt natural to provide here grew the more that I read. I went down the rabbit hole, and when I came out, I realized that what had started as a study of attention had grown into a bigger story. Attention is still the throughline, but there are other important themes, such as how neural networks generalize and the bitter lesson that simple methods that scale seem to triumph over clever methods which do not. This post is the product of that deep dive, and it is a stylized history of LLMs. As a caveat, real life is endlessly detailed, and any summary or synthesis inevitably flattens this detail. So I will accidentally or intentionally skip over many important and related papers and ideas in the service of a synthesis. I will also skip over practicalities such as data preprocessing and advances in hardware and computing. My focus will be on what I view as the main methodological landmarks, and this history is simply one of many ways to tell this story. I’ll start with an old idea, one so ubiquitous today that it might seem silly to belabor here. The idea is that neural networks automatically generalize using distributed representations . This idea has its roots in computational neuroscience, particularly Connectionism (McCulloch & Pitts, 1943) and was discussed explicitly in the 1980s in papers like Learning representations by back-propagating errors (Rumelhart et al., 1986) and Learning distributed representations of concepts (Hinton, 1986) . Understanding it is key to understanding why LLMs work at all and thus understanding the long line of academic research driving towards them. But first, a problem. The goal of natural language processing (NLP) is to model human language using computers. Until the 1980s, NLP systems were mostly based on handwritten rules and handcrafted features. However, by the early 1990s, researchers were exploring the use of statistical methods from machine learning. For an early and seminal example, see A statistical approach to machine translation (Brown et al., 1990) . The core idea of statistical NLP is to model human language using a statistical language model , which is a probability distribution over all possible sequences in a language. This distribution is typically factorized such that each word depends on all words that precede it: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) . (1) p(w_{1:T}) = \prod_{t=1}^T p\left(w_t \mid w_{1:t-1} \right). \tag{1} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) . ( 1 ) Throughout this post, I will use the notation w i : j w_{i:j} w i : j ​ to denote elements in a sequence from positions i i i to j j j inclusive (where i ≤ j i \leq j i ≤ j ): w i : j : = { w i , w i + 1 , … , w j − 1 , w j } . (2) w_{i:j} := \{w_i, w_{i+1}, \dots, w_{j-1}, w_j\}. \tag{2} w i : j ​ : = { w i ​ , w i + 1 ​ , … , w j − 1 ​ , w j ​ } . ( 2 ) Given a good statistical model p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) , we can do many things. For example, we can rank the likelihood of different sequences of words and use that ranking to decide on things like a conversational agent’s output. Or we can translate a source sequence s 1 : T s_{1:T} s 1 : T ​ into a target sequence w 1 : T w_{1:T} w 1 : T ​ if we have the conditional probabilities between the two: p ( w 1 : T ∣ s 1 : T ) ∝ p ( s 1 : T ∣ w 1 : T ) p ( w 1 : T ) . (3) p(w_{1:T} \mid s_{1:T}) \propto p(s_{1:T} \mid w_{1:T}) p(w_{1:T}). \tag{3} p ( w 1 : T ​ ∣ s 1 : T ​ ) ∝ p ( s 1 : T ​ ∣ w 1 : T ​ ) p ( w 1 : T ​ ) . ( 3 ) Here, p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) would be our language model of the target language, and p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) would be our translation model . Today, this view is so pervasive that it might feel obvious, but with a little imagination, I think it’s easy to see how wrong this might have felt to a linguist forty-odd years ago. Equation 1 1 1 captures no language structure or parts of speech such as nouns or verbs or adjectives—see e.g. (Chomsky, 1956) on formal grammars . Instead, it reduces the complexity of human language to next-word prediction. If we didn’t know already that this worked, we might doubt that it would. More importantly for us, estimating the model in Equation 1 1 1 is hard! The main challenge is the curse of dimensionality . There are many, many words in a vocabulary. For example, linguists estimate that English has roughly a million words, give or take a few hundred thousand depending on how you count them. Furthermore, this problem explodes in some tasks such as translation, where there are many possible conditional probabilities p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) . So when estimating the conditional probabilities of our language model, we cannot possibly encounter all possible combinations. We have a data sparsity problem, and estimating the true probabilities becomes impossible. Perhaps the oldest idea to tackle this problem was proposed in Andrey Markov’s pioneering mathematical analysis of Pushkin’s Eugene Onegin (Markov, 1913) . He made the assumption that each conditional probability in Equation 1 1 1 only depends on the previous N N N terms: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) ≈ ∏ t = 1 T p ( w t ∣ w t − N : t − 1 ) . (4) p(w_{1:T}) = \prod_{t=1}^T p \left( w_t \mid w_{1:t-1} \right) \approx \prod_{t=1}^T p \left(w_t \mid w_{t-N:t-1} \right). \tag{4} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ t = 1 ∏ T ​ p ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 ) Today, we would call this a “Markov assumption”, and Equation 4 4 4 is the famous N N N -gram model . Particularly for small N N N , say N = 1 N=1 N = 1 or N = 2 N=2 N = 2 , we might be able to get reasonable estimates of data. But here is the problem, and this problem is a central theme driving towards the attention mechanism: the Markov assumption destroys context . Without more context, a language model can never replicate the complexity and nuance of natural language. As I understand it, this was conceptually the state of the field circa 2000. But then in 2003, a seminal paper was published: A neural probabilistic language model (Bengio et al., 2003) . In that paper, the authors proposed a novel idea: to avoid this data sparsity problem, this curse of dimensionality, we can use neural networks to learn a language model using what they call “distributed representations” of words. (Today, we might call these “word embeddings”.) They proposed three core ideas. First, they represented each word as a real-valued vector or embedding; then, they expressed Equation 1 1 1 in terms of these embeddings; and finally, they trained a neural network to simultaneously learn the embeddings and the parameters of the probability function (neural network) in Equation 1 1 1 using back-propagation (Rumelhart et al., 1986) . That’s a lot, so let’s break it down a bit. Our goal here is to learn a good model f Θ f_{\Theta} f Θ ​ of natural language such that p ( w t ∣ w 1 : t − 1 ) ≈ f Θ ( w t − 1 , … , w t − N ) . (5) p(w_t \mid w_{1:t-1}) \approx f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}). \tag{5} p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) . ( 5 ) So the left-hand side is the true conditional distribution, capturing next-word prediction. It’s the goal of language modeling. But in practice, modeling the full context is hard. So we settle for the right hand side, which is a parametric approximation f Θ f_{\boldsymbol{\Theta}} f Θ ​ of this true distribution with context window of size N N N . In Bengio, they model f Θ f_{\boldsymbol{\Theta}} f Θ ​ using two components. First, they represent words as vectors. Let V \mathcal{V} V denote our vocabulary, which is simply a set of integers V = { 1 , 2 , … , V } \mathcal{V} = \{1, 2,\dots, V\} V = { 1 , 2 , … , V } indexing all V V V words in a language. We will represent each word as a D D D -vector, and so we can represent the entire language as a matrix C ∈ R V × D \mathbf{C} \in \mathbb{R}^{V \times D} C ∈ R V × D (Figure 1 1 1 ). Now for the t t t -th word in a sequence w 1 : T w_{1:T} w 1 : T ​ , we have an associated index in the vocabulary, which we will denote as I ( w t ) ∈ V I(w_t) \in \mathcal{V} I ( w t ​ ) ∈ V . This notation might be a bit odd, but I’m careful here because w t w_t w t ​ is not a well-defined mathematical object, and it cannot index C \mathbf{C} C . But I ( w t ) I(w_t) I ( w t ​ ) is an integer and can index C \mathbf{C} C , and so c I ( w t ) \mathbf{c}_{I(w_t)} c I ( w t ​ ) ​ is a D D D -dimensional vector (a row vector of C \mathbf{C} C ) representing the I ( w t ) I(w_t) I ( w t ​ ) -th word in the vocabulary, associated with the t t t -th word in the sequence. This vector is what we are calling an “embedding” or “distributed representation”. Second, Bengio et al represent the probability function over words (Equation 1 1 1 ) as as a feed-forward neural network g g g with parameters Ω \boldsymbol{\Omega} Ω and arguments C \mathbf{C} C : f Θ ( w t − 1 , … , w t − N ) = g Ω ( c I ( w t − 1 ) , … , c I ( w t − N ) ) . (6) f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}) = g_{\boldsymbol{\Omega}}\left(\mathbf{c}_{I(w_{t-1})}, \dots, \mathbf{c}_{I(w_{t-N})}\right). \tag{6} f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) = g Ω ​ ( c I ( w t − 1 ​ ) ​ , … , c I ( w t − N ​ ) ​ ) . ( 6 ) They then use back-propagation to jointly estimate the parameters Θ : = { C , Ω } . (7) \boldsymbol{\Theta} := \{\mathbf{C}, \boldsymbol{\Omega}\}. \tag{7} Θ : = { C , Ω } . ( 7 ) In other words, they learn the neural network parameters Ω \boldsymbol{\Omega} Ω at the same time as learning the word embeddings C \mathbf{C} C . Note that “distributed representation” can refer to either the continuously-valued vector, e.g. word embedding, or the concept distributed across neurons. This duality is exemplified in C \mathbf{C} C which is both a set of learnable parameters and the embeddings themselves! Why might this work? The authors explain the idea so well that it’s worth just quoting the original paper: In the proposed model, it will so generalize because “similar” words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of “neighbors” in sentence space. This is a beautiful idea. If we have word embeddings that are “well-organized” in the sense that words that play similar roles in sentences (semantically and syntactically) have similar embeddings and if we have a smooth function from word embeddings to probabilities, then small changes in words lead to small changes in embeddings which lead to small changes in probabilities (Figure 2 2 2 ). Pause for a moment to really think about this. Words are discrete objects, and a “small change in a word”, while intuitive to humans, is ill-defined. But this approach concretizes what that means. To quote the paper Linguistic regularities in continuous space word representations (Mikolov et al., 2013) , which we’ll discuss later: Whereas an N N N -gram model works in terms of discrete units that have no inherent relationship to one another, a continuous space model works in terms of word vectors where similar words are likely to have similar vectors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences. For example, if the words “dog” and “cat” are nearby in word-embedding space, then maybe “The cat is walking on the sidewalk” and “The dog is walking on the sidewalk” should have similar probabilities. And only one of these two sentences would need to exist in the training data for the model to generalize well to both sentences! As I mentioned, this idea was not entirely new in 2003. Since the 1980s, researchers had known that neural networks can generalize because they distribute their representation across many neurons (Hinton, 1986) . Each new example modifies the weights, incorporating new knowledge into the old. However (Bengio et al., 2003) is a landmark paper in NLP because it was the first application of this idea to language modeling. The Bengio paper took seriously the idea that we could build a statistical model of language using the distributed representations of words. It was the first hint that we could use neural networks to overcome the curse of dimensionality that plagued statistical NLP. This is a promising idea, but we glossed over an important detail: how do we actually train this model? What is the loss function or objective that the neural network should use? And given a fit model, how do we generate a new sequence? These are important questions to answer per se, but they are also important questions because, at a conceptual level, there is really no difference between Bengio’s model and the frontier large language models today. So understanding this is critical to understanding LLMs. Both are autoregressive models and trained using next-word prediction . As an example, imagine we have the following input sentence, which is a quote from Virginia Woolf’s A Room of One’s Own : “Intellectual freedom depends upon material things.” (8) \text{``Intellectual freedom depends upon material things.''} \tag{8} “Intellectual freedom depends upon material things.” ( 8 ) Now imagine that our model’s context window has size N = 2 N=2 N = 2 and let c p \mathbf{c}_p c p ​ denote a padding D D D -vector of all zeros. In Bengio’s model, we would start by representing just the first word, “intellectual”, as a word embedding. So the first non-zero input to our model would be: x 2 = [ c p c I ( w 1 ) ] = [ c p c I ( “intellectual” ) ] . (9) \mathbf{x}_2 = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(w_1)} \end{array} \right] = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(\text{``intellectual''})} \end{array} \right]. \tag{9} x 2 ​ = [ c p ​ c I ( w 1 ​ ) ​ ​ ] = [ c p ​ c I ( “intellectual” ) ​ ​ ] . ( 9 ) The output of the neural network would be a V V V -dimensional vector representing the probability distribution over p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . Illustratively: y 2 = [ p ( w 2 = “about” ) p ( w 2 = “above” )       ⋮ p ( w 2 = “freedom” )       ⋮ ] . (10) \mathbf{y}_2 = \left[ \begin{array}{l} p(w_2 = \text{``about''}) \\ p(w_2 = \text{``above''}) \\ \qquad\;\;\vdots \\ p(w_2 = \text{``freedom''}) \\ \qquad\;\;\vdots \\ \end{array} \right]. \tag{10} y 2 ​ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ​ p ( w 2 ​ = “about” ) p ( w 2 ​ = “above” ) ⋮ p ( w 2 ​ = “freedom” ) ⋮ ​ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ​ . ( 1 0 ) We would then compute the cross-entropy loss between this output vector and the true distribution, which is really just a one-hot vector with 1 1 1 for the word “freedom” and 0 0 0 everywhere else. We would then repeat this process on the next word. So the next input sequence would be x 3 = [ c I ( “intellectual” ) c I ( “freedom” ) ] , (11) \mathbf{x}_3 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``intellectual''})} \\ \mathbf{c}_{I(\text{``freedom''})} \end{array} \right], \tag{11} x 3 ​ = [ c I ( “intellectual” ) ​ c I ( “freedom” ) ​ ​ ] , ( 1 1 ) and the output would represent the probability distribution p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And again, we would minimize the cross-entropy loss between its associated output vector and a one-hot vector encoding the word “depends”. We would repeat this process until the end of the sentence. Of course, longer sequences are more expensive to train in this way, and this is precisely the point of the context window in Bengio’s paper. We only consider the N N N previous words when predicting the next word. This idea of a limited context window is critical, as it is a constraint that persists into the present day. In this example, since N = 2 N=2 N = 2 , the third input would be x 4 = [ c I ( “freedom” ) c I ( “depends” ) ] . (12) \mathbf{x}_4 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``freedom''})} \\ \mathbf{c}_{I(\text{``depends''})} \end{array} \right]. \tag{12} x 4 ​ = [ c I ( “freedom” ) ​ c I ( “depends” ) ​ ​ ] . ( 1 2 ) So the model completely loses the word “intellectual”. It is now outside the context. Since minimizing the cross-entropy loss is equivalent to maximizing the log likelihood—see here for an example if this idea is new to you—we can generalize the logic above by saying that we want to maximize the log likelihood of our training data, again using a neural network as a parametric function approximation of the true distribution: Θ ⋆ = arg ⁡  ⁣ max ⁡ Θ { ∑ t = 1 T log ⁡ g Ω ( c I ( w t − N ) , … , c I ( w t − 1 ) ) } . (13) \boldsymbol{\Theta}^{\star} = \arg\!\max_{\boldsymbol{\Theta}} \left\{ \sum_{t=1}^T \log g_{\boldsymbol{\Omega}} \left(\mathbf{c}_{I(w_{t-N})}, \dots, \mathbf{c}_{I(w_{t-1})} \right) \right\}. \tag{13} Θ ⋆ = ar g Θ max ​ { t = 1 ∑ T ​ lo g g Ω ​ ( c I ( w t − N ​ ) ​ , … , c I ( w t − 1 ​ ) ​ ) } . ( 1 3 ) Of course, we can estimate Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ by minimizing the negative log likelihood using gradient descent via back-propagation. That’s it. At the conceptual level, this framework is no different from how frontier large language models are trained today. As we will see later though, there is a lot of additional machinery that is needed to make these models work in practice. Finally, imagine we fit our model, meaning we find good parameters Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ that maximize our log likelihood. How can we use these parameters to generate a random sequence or sentence? We could draw the first word at random from the vocabulary. And then we could draw the next word conditional on the first word from our parametric approximation of p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . And then we could draw the third word conditional on the second and first words from our parametric approximation of p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And so on. This is why LLMs can both understand natural language and generate new sentences. They are not just descriptive models; they are generative models . There are some subtleties I am glossing over, such as special embeddings to denote the start and end of a sequence, preprocessing steps like lowercasing words, tokenization, and handling out-of-vocabulary words. But I don’t think these details matter much here. As an aside, we can call any model trained in this way autoregressive . In statistics, an autoregressive model is any model where a variable is predicted using its own previous values. A classic example of this are AR models such as AR(1). While (Bengio et al., 2003) was a landmark paper, its full impact was delayed by roughly a decade. This is because training neural networks was hard at the time. It’s worth checking out that paper and seeing just how primitive the engineering feels today. For example, they trained on CPUs and without modern tooling like automatic differentiation libraries. In the intervening decade, there was some early work that built on Bengio’s model. For example, in A unified architecture for natural language processing: Deep neural networks with multitask learning (Collobert & Weston, 2008) , the authors demonstrate that Bengio’s neural language model could be trained and used on a variety of downstream tasks. And in Word representations: A simple and general method for semi-supervised learning (Turian et al., 2010) , the authors demonstrate that word embeddings improve state-of-the-art NLP systems when included as additional features. But none of these contributions were convincing demonstrations of Bengio’s main idea. So seven years after Bengio et al, it was N N N -grams, not neural networks, which were still the state-of-the-art, at least in practice and outside specialized benchmarks. Honestly, I found this surprising, but I kept reading this claim in various papers. For example, in the introduction to Recurrent neural network based language model (Mikolov et al., 2010) , the authors wrote: It is questionable if there has been any significant progress in language modeling over simple N N N -gram models… In fact, most of the proposed advanced language modeling techniques provide only tiny improvements over simple baselines, and are rarely used in practice. Or two years after that, in A fast and simple algorithm for training neural probabilistic language models (Mnih & Teh, 2012) , the authors wrote: In spite of their superior performance, neural probabilistic language models remain far less widely used than N N N -gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Of course, advanced techniques existed and were well known, but they were often impractical. So roughly a hundred years after Andrey Markov’s pioneering work, researchers were still struggling to represent human language in a form amenable for mathematics and computation, and N N N -grams were still considered a reasonable choice in NLP. Today, neural networks are definitively state-of-the-art. What changed? The answer is that we learned to train variants of Bengio’s model at scale. Around 2012, researchers were finally able to train neural networks on large datasets. My understanding is that it was the so-called “AlexNet” paper, ImageNet classification with deep convolutional neural networks (Krizhevsky et al., 2012) , that convinced many in the research community to pay attention. Convolutional neural networks were already well known and had been trained on small datasets since the 1980s (LeCun et al., 1989) . But AlexNet was the first time a deep convolutional neural network was trained end-to-end on a very large (at the time) dataset, ImageNet (Deng et al., 2009) and using GPUs. The results were a tour de force. To quote the paper: We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top- 5 5 5 test error rate of 15.3 % 15.3\% 1 5 . 3 % , compared to 26.2 % 26.2\% 2 6 . 2 % achieved by the second-best entry. In other words, AlexNet demolished the state-of-the-art in computer vision. It achieved a roughly 40 % 40\% 4 0 % reduction in relative error rate. Nothing else came close. As a comparison, the current fastest time for a men’s marathon is 2 hours and 35 seconds. The previous record was 2 hours and 69 seconds, so 34 seconds slower. Now imagine if someone came along and beat the record by half an hour. It would revolutionize the running world. At the time, computer vision was still dominated by handcrafted feature pipelines, and so the AlexNet results were extremely surprising. For example, in Introduction to the bag of features paradigm for image classification and retrieval (O’Hara & Draper, 2011) , the authors wrote: The past decade has seen the growing popularity of Bag of Features (BoF) approaches to many computer vision tasks, including image classification, video search, robot localization, and texture recognition… BoF-based systems have set new performance standards on popular image classification benchmarks and have achieved scalability breakthroughs in image retrieval. This introduction to bag of feature models was put on arXiv in January 2011, whereas AlexNet was published at NeurIPS in December 2012, meaning that the claim above was contemporaneous with the training of AlexNet! My point here is to underscore just how surprising the rise of neural networks was. To be clear, I am sure many in the research community believed neural networks would work—Hinton has been a believer since probably the 1970s—, but this was hardly the consensus view that it is today. So the year 2012 was a changepoint. In 2003, Bengio et al set the stage conceptually. In 2012, Krizhevsky et al set the stage technologically. With hindsight, the obvious implication of AlexNet was that NLP researchers circa 2012 should try to train neural networks at scale. Of course, many researchers tried, but let’s ground ourselves in one particular model. This will help focus the narrative. To my knowledge, two of the earliest and most successful papers to try this idea were Efficient estimation of word representations in vector space (Mikolov et al., 2013) and Distributed representations of words and phrases and their compositionality (Mikolov et al., 2013) . These papers are tightly related by both authorship and time, and together, they helped unlock the core ideas in Bengio’s paper, as well as introduce the famous word2vec model. So I think it’s fair to treat them as both a unit and as a landmark in our story. To understand these two papers, we need to understand the computational problems Bengio faced, which means we need to understand the model in more technical detail. Let x t \mathbf{x}_t x t ​ be the input to the model, and y t \mathbf{y}_t y t ​ be the output. Bengio’s model did not support variable-length inputs, and thus the input sequence could be only a fixed number of N N N words, each represented as an D D D -dimensional embedding. Let’s represent this input as the concatenation of N N N different D D D -vectors from C \mathbf{C} C mentioned above, so: x t : = [ c I ( w t − 1 ) ⋮ c I ( w t − N + 1 ) ] . (14) \mathbf{x}_t := \left[ \begin{array}{l} \mathbf{c}_{I(w_{t-1})} \\ \quad\quad\vdots \\ \mathbf{c}_{I(w_{t-N+1})} \end{array} \right]. \tag{14} x t ​ : = ⎣ ⎢ ⎢ ⎡ ​ c I ( w t − 1 ​ ) ​ ⋮ c I ( w t − N + 1 ​ ) ​ ​ ⎦ ⎥ ⎥ ⎤ ​ . ( 1 4 ) One way we can imagine constructing x t \mathbf{x}_t x t ​ is if we represent every word in our context window as a V V V -dimensional one-hot vector. Call this a matrix Q t ∈ R N × V \mathbf{Q}_t \in \mathbb{R}^{N \times V} Q t ​ ∈ R N × V . Then x t = Q t C \mathbf{x}_t = \mathbf{Q}_t \mathbf{C} x t ​ = Q t ​ C gives us the associated embeddings. In practice, though, we would never do a dense matrix multiplication with complexity O ( V N D ) \mathcal{O}(VND) O ( V N D ) . Instead, we would simply index into C \mathbf{C} C . So this operation has computational complexity O ( N D ) \mathcal{O}(ND) O ( N D ) . I only belabor this point because I found it confusing when first reading Bengio’s paper. (This point is made more clearly in (Collobert & Weston, 2008) ) After construction, this input x t \mathbf{x}_t x t ​ is then fed into an extremely simple (relative to today’s models) architecture, a feed-forward neural network with a linear projection layer and a nonlinear hidden layer: g Ω ( x t ) = y t : = b + W x t + U tanh ⁡ ( z t ) , z t : = d + H x t . (15) \begin{aligned} g_{\boldsymbol{\Omega}}(\mathbf{x}_t) = \mathbf{y}_t &:= \mathbf{b} + \mathbf{Wx}_t + \mathbf{U} \tanh(\mathbf{z}_t), \\ \mathbf{z}_t &:= \mathbf{d} + \mathbf{Hx}_t. \end{aligned} \tag{15} g Ω ​ ( x t ​ ) = y t ​ z t ​ ​ : = b + W x t ​ + U tanh ( z t ​ ) , : = d + H x t ​ . ​ ( 1 5 ) The output y t ∈ R V \mathbf{y}_t \in \mathbb{R}^{V} y t ​ ∈ R V represents the un-normalized probability of each word in the vocabulary. If normalized, this vector would represent the probability distribution we discussed in the autoregressive framework. Here, we see that that W ∈ R V × N D \mathbf{W} \in \mathbb{R}^{V \times ND} W ∈ R V × N D is a linear projection of the input embeddings x t \mathbf{x}_t x t ​ , that H ∈ R H × N D \mathbf{H} \in \mathbb{R}^{H \times ND} H ∈ R H × N D is a linear projection into a hidden state vector z t ∈ R H \mathbf{z}_t \in \mathbb{R}^H z t ​ ∈ R H , and that U ∈ R V × H \mathbf{U} \in \mathbb{R}^{V \times H} U ∈ R V × H is a linear projection of the nonlinear hidden state vector. So clearly the parameters mentioned in Equation 7 7 7 can be concretized as { C , Ω } : = { C , b , W , U , d , H } . (16) \{\mathbf{C}, \boldsymbol{\Omega}\} := \{\mathbf{C, b, W, U, d, H}\}. \tag{16} { C , Ω } : = { C , b , W , U , d , H } . ( 1 6 ) So why was this expensive to train? We can see that the computational complexity to compute y t \mathbf{y}_t y t ​ is proportional to: N D ⏟ Q C          +          V N D ⏟ W x t             +    V H ⏟ U tanh ⁡ ( z t ) +       H N D ⏟ H x t . (17) \underbrace{ND}_{\mathbf{QC}} \;\;\;+\;\;\; \underbrace{VND}_{\mathbf{Wx}_t} \;\;\;\;+\; \underbrace{VH}_{\mathbf{U} \tanh(\mathbf{z}_t)} +\;\; \underbrace{HND}_{\mathbf{Hx}_t}. \tag{17} Q C N D ​ ​ + W x t ​ V N D ​ ​ + U t a n h ( z t ​ ) V H ​ ​ + H x t ​ H N D ​ ​ . ( 1 7 ) Note that this complexity is for every single word in the corpus, and we must also account for the number of training epochs. In (Mikolov et al., 2013) , the authors write that a “common choice” is N = 10 N=10 N = 1 0 and that N D ND N D is typically around 500 500 5 0 0 to 2000 2000 2 0 0 0 . However, the hidden layer has dimension H H H (commonly around 2000 2000 2 0 0 0 or so) and this is multiplied by the size of the vocabulary! What this means? The dominating term in Equation 17 17 1 7 is V H VH V H . Furthermore, this complexity is just for computing the un-normalized probabilities y t \mathbf{y}_t y t ​ . To normalize these, we must compute the softmax function over the size of the vocabulary V V V : p ( w t ∣ w t − N : t − 1 ) = exp ⁡ ( y t ) ∑ i = 1 V exp ⁡ ( y i ) . (18) p(w_t \mid w_{t-N:t-1}) = \frac{\exp\left(\mathbf{y}_t\right)}{\sum_{i=1}^V \exp\left( \mathbf{y}_i \right)}. \tag{18} p ( w t ​ ∣ w t − N : t − 1 ​ ) = ∑ i = 1 V ​ exp ( y i ​ ) exp ( y t ​ ) ​ . ( 1 8 ) As I understand it, these were the computational problems Bengio faced. The two Mikolov papers did not present a single trick to solve them. Rather, the papers made a number of modeling choices, mostly already established in the literature, that in combination finally made learning distributed representations of words scalable. First, in the first paper, they avoided computing the full softmax function using hierarchical softmax, introduced by Morin and Bengio in Hierarchical probabilistic neural network language model (Morin & Bengio, 2005) . I don’t think the details of this matter much here. See this blog post for a nice explanation with code. Suffice to say that it’s an efficient way to compute the normalized probabilities in Equation 18 18 1 8 . The computational complexity is reduced from O ( V ) \mathcal{O}(V) O ( V ) to O ( log ⁡ 2 V ) \mathcal{O}(\log_2 V ) O ( lo g 2 ​ V ) . In the second paper, they further sped up the softmax computation by introducing a technique called negative sampling . The theory here is rich and deserving of its own post, but the main idea is to draw K K K samples from a noise distribution and train the model to disambiguate observations from noise. The important point here is that one can prove this converges to the correct probabilities without explicitly computing the normalizing constant. See (Gutmann & Hyvärinen, 2010) for details. We don’t need to fully grok these techniques; just know that these two approaches are both ways of getting around the expensive normalization in Equation 18 18 1 8 . For example, if V = 1 × 1 0 6 V = 1\times 10^6 V = 1 × 1 0 6 , then log ⁡ 2 ( V ) ≈ 20 \log_2(V) \approx 20 lo g 2 ​ ( V ) ≈ 2 0 . And in the second paper, they chose K K K to be 2 2 2 to 20 20 2 0 depending on the dataset. Second, they stripped out the non-linear part of Bengio’s model (so removing U tanh ⁡ ( z t ) \mathbf{U} \tanh(\mathbf{z}_t) U tanh ( z t ​ ) ), reducing the model to a simple linear operation: a dot product. The result is model that is log-linear on the features, which I’ll explain in a moment. Now the models. In the first paper, they presented two models, a continuous bag-of-words model (CBOW) and a continuous skip-gram model (skip-gram). These are the foundations of the word2vec NLP toolkit. In the CBOW model, a set of neighboring words are averaged to predict a target word; and in the skip-gram model, a target word is used to predict its neighboring words (Figure 3 3 3 ). Both worked empirically in practice, but the authors only built on the skip-gram model in the second paper. And since I don’t think it’s that important here to understand both, I’ll just focus on the skip-gram model. Let’s build a little intuition by going into detail. The objective of the skip-gram model is to minimize the cross-entropy loss between a single target word and its neighboring words. So the input to the model is only a single D D D -vector representing a single word (so no context window). The output, however, are the N N N words surrounding the input. Let N = 2 C N = 2C N = 2 C . Then the objective function is: 1 T ∑ t = 1 T ∑ − C ≤ j ≤ C ,    j ≠ 0 log ⁡ p ( w t + j ∣ w t ) . (19) \frac{1}{T} \sum_{t=1}^T \sum_{-C \leq j \leq C,\;j \neq 0} \log p(w_{t+j} \mid w_t). \tag{19} T 1 ​ t = 1 ∑ T ​ − C ≤ j ≤ C , j  ​ = 0 ∑ ​ lo g p ( w t + j ​ ∣ w t ​ ) . ( 1 9 ) I will continue to use the notation N N N for this context window, but clearly it is different in precise meaning from the N N N in an N N N -gram or the N N N in Bengio’s paper. We model the conditional probability in Equation 19 19 1 9 via a simple log-linear function: p ( w t + j ∣ w t ) = p ( u I ( w t + j ) ∣ c I ( w t ) ) = exp ⁡ ( ⟨ u I ( w t + j ) , c I ( w t ) ⟩ ) ∑ i ∈ V exp ⁡ ( ⟨ u i , c I ( w t ) ⟩ ) (20) p(w_{t+j} \mid w_t) = p(\mathbf{u}_{I(w_{t+j})} \mid \mathbf{c}_{I(w_{t})}) = \frac{\exp\left( \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle \right)}{\sum_{i \in \mathcal{V}} \exp\left( \langle \mathbf{u}_i, \mathbf{c}_{I(w_{t}) \rangle} \right)} \tag{20} p ( w t + j ​ ∣ w t ​ ) = p ( u I ( w t + j ​ ) ​ ∣ c I ( w t ​ ) ​ ) = ∑ i ∈ V ​ exp ( ⟨ u i ​ , c I ( w t ​ ) ⟩ ​ ) exp ( ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ ) ​ ( 2 0 ) Here, c i \mathbf{c}_i c i ​ are word embeddings of the inputs. These are analogous to the row-vectors of C \mathbf{C} C in Bengio’s model and again are constructed via a lookup. The output embeddings u \mathbf{u} u are a little trickier to interpret. If we were using the full softmax function, we would have V V V such output embeddings, and these would represent the weights of the softmax function. But when using hierarchical softmax or negative sampling, the interpretation changes a bit. Again, I don’t think the details really matter here. The key point is that we take a sequence w 1 : T w_{1:T} w 1 : T ​ , select the appropriate embeddings c 1 : T \mathbf{c}_{1:T} c 1 : T ​ , and compute Equation 20 20 2 0 directly, learning both the parameters C \mathbf{C} C and U \mathbf{U} U . This is called a “log-linear model” because the log of the conditional probability is linear with respect to its arguments: log ⁡ p ( w t + j ∣ w t ) = ⟨ u I ( w t + j ) , c I ( w t ) ⟩ − Z , (21) \log p(w_{t+j} \mid w_t) = \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle - Z, \tag{21} lo g p ( w t + j ​ ∣ w t ​ ) = ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ − Z , ( 2 1 ) Here, I just write Z Z Z to denote the normalizing constant, the denominator in Equation 20 20 2 0 , because it is not particularly interesting, and we do not even need to compute it when using negative sampling. The key relationship that the model is learning is a simple linear weighting of the input embeddings that allow it to predict nearby words. Hopefully, it is clear why this model is so fast to train. We have no hidden layers or nonlinearities. We simply compute a dot product and ignore the normalizing constant. For example, when using the full softmax, the computational complexity is: N ( D + D V ) . (22) N (D + D V). \tag{22} N ( D + D V ) . ( 2 2 ) Here, we have D + D V D + D V D + D V dot products, and we need to do it over N N N words in our context window. However, in practice, we can eliminate V V V entirely, replacing it with something around log ⁡ 2 ( V ) \log_2(V) lo g 2 ​ ( V ) or K K K . This is significantly smaller than Equation 17 17 1 7 . For example, if we assume that H = D = 500 H=D=500 H = D = 5 0 0 , N = 10 N=10 N = 1 0 , and V = 1 × 1 0 6 V=1 \times 10^{6} V = 1 × 1 0 6 , then hierarchical softmax is five orders of magnitude smaller in terms of complexity. So in these two seminal Mikolov papers, the authors stripped down Bengio’s core idea to a simple log-linear model, and thus were able to train that model at scale. That said, I want to stress a subtlety that took me time to grok. Neither the CBOW nor the continuous skip-gram models presented here are full language models. Notice that their objective functions (nearby-word prediction) are not in the autoregressive framework and thus cannot easily plug into Equation 1 1 1 . That’s because the goal of these papers was not to learn a full language model but rather to learn good word embeddings. They say this explicitly in the first paper (emphasis mine): Representation of words as continuous vectors has a long history. A very popular model architecture for estimating neural network language model (NNLM) was proposed in (Bengio et al., 2003) , where a feed-forward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others. Another interesting architecture of NNLM was presented in (Mikolov, 2007; Mikolov et al., 2009) , where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model. So the word2vec models were simple and shallow (single layer) neural networks designed for fast training and to learn good embeddings. They were not full language models. This is a major distinction from similar prior art, such as A scalable hierarchical distributed language model (Mnih & Hinton, 2008) . In this paper, the authors demonstrate more scalable inference of Bengio’s model by representing the vocabulary compactly through binary trees and by using a log-bilinear model. But they go end-to-end to a language model, as the paper title suggests. Mikolov et al’s two models were relentlessly simple and efficient. As I understand it, both CBOW and skip-gram worked well in practice. It did not matter if neighboring words predict a target word or if that target word predicts its neighboring words. The real differentiator was that both models could be efficiently trained at scale. And with scale, something remarkable happened: the authors discovered that distributed representations of words, trained in this fashion, captured semantic and syntactic information. Today, linguistic regularities in word embeddings is so well-established that it might seem boring to read here. But understood in context, these regularities should be surprising! How can a simple linear model, trained on essentially next- or nearby-word prediction via maximum likelihood estimation, learn distributed representations of words with remarkable syntactic and semantic properties and relationships? In my mind, this was the first big result that suggested neural networks would not just work but really work in language modeling. The word2vec papers were not the first to observe these properties. My understanding is that that credit goes to yet another Mikolov paper from 2013, Linguistic regularities in continuous space word representations (Mikolov et al., 2013) . Here, the authors showed that many semantic and syntactic relationships correspond to approximately constant vector offsets in the embedding’s vector space. To be clear, researchers had long observed that one could uncover structure in vector representations of words. For example, in the 1989 paper Self-organizing semantic maps (Ritter & Kohonen, 1989) , the authors trained self-organizing maps (Kohonen, 1982) on pre-computed two-dimensional vectors representing words and demonstrated that these maps contain semantic structure. However, these models were not trained end-to-end (the representations themselves were not learned) and did not have linear structure. It would be a stretch to call these vectors “word embeddings”. But log-linear models like word2vec were remarkable precisely because they enabled analogical reasoning through simple vector offset, i.e. linear operations (Figure 4 4 4 )! Perhaps the most famous example of analogical reasoning with word embeddings is the relationship “king is to queen as man is to woman”: vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . (23) \text{vec}\left(\text{``king''}\right) - \text{vec}\left(\text{``man''}\right) + \text{vec}\left(\text{``woman''}\right) \approx \text{vec}\left(\text{``queen''}\right). \tag{23} vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . ( 2 3 ) Or in (Mikolov et al., 2013) , the authors give the example that “Russia” plus “river” is the Volga: vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . (24) \text{vec}\left(\text{``Russia''}\right) + \text{vec}\left(\text{``river''}\right) \approx \text{vec}\left(\text{``Volga River''}\right). \tag{24} vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . ( 2 4 ) In my mind, these are pretty fascinating and non-obvious results. It suggests that the methods are not mixing vector dimensions in undesirable ways and staying approximately linear. Again, viewed with fresh eyes, it is really quite remarkable! If you were a researcher in 2003 reading Bengio’s paper, would you have predicted this result with high confidence? While these two Mikolov papers are landmark papers on learning word embeddings at scale, they are by no means the only ones. Many other researchers worked in this area. Perhaps the most famous paper on word embeddings that we do not have time to discuss is GloVe: Global vectors for word representation (Pennington et al., 2014) . In this paper, the authors present a unifying view between two common methods for learning word embeddings, global matrix factorization methods and local context window methods. But there were many others as well, such as Skip-thought vectors (Kiros et al., 2015) , Word embeddings through Hellinger PCA (Lebret & Collobert, 2013) , and Eigenwords: spectral word embeddings (Dhillon et al., 2015) to cite just a few illustrative examples. For ease of presentation, I have focused on word-level embeddings. But the idea was naturally and quickly extended to larger contexts. This was motivated by the fact that a word’s meaning is obviously context-dependent (polysemy). For example, the word “bank” might refer to a financial institution or the side of a river. A word embedding for “bank” that is not context dependent must somehow flatten this distinction. So lack of context is obviously a limitation. Researchers tackled this through a variety of approaches. One approach was to use the hidden states of a bidirectional long short-term memory network (LSTM) as context-specific embeddings as in context2vec: Learning generic context embedding with bidirectional LSTM (Melamud et al., 2016) or Learned in translation: contextualized word vectors (McCann et al., 2017) . But perhaps the most noteworthy example of this idea—and one I mention here because it will come up later—was Deep contextualized word representations (Peters et al., 2018) or ELMO. Here, the authors both used a bidirectional LSTM to extract more context-dependent word embeddings and then trained on an objective function that was dependent on the downstream task. This hints at combining pre-trained embeddings with supervised fine-tuning, which we’ll see later. By 2013, word- and phrase-level embeddings demonstrably worked. The key to unlocking them was simple methods that scaled on modern hardware. However, the problem with these embeddings is that they were still with respect to a fixed window. It was not immediately obvious how this idea could be extended to longer phrases or sentences or to larger texts. Of course, researchers had tried. For example, (Collobert & Weston, 2008) used the idea of time-delay neural networks (Waibel et al., 1989) to model sentences of variable lengths, but the authors used convolutions that still had a fixed-width window size. The embedding itself, then, was not constructed while accounting for long-range dependencies. So word embeddings, while a beautiful idea, only set the stage for the next big idea in our history: tackling the problem of modeling long-range dependencies without an explicit context window. The key innovation here was sequence-to-sequence models. In a sequence-to-sequence model, a neural network encodes a variable-length input sequence into a fixed-length vector, while a second neural network decodes this fixed-length vector back into a variable-length output sequence. In both Bengio and Mikolov’s papers, the input was an embedding ( c \mathbf{c} c in Equations 14 14 1 4 and 20 20 2 0 ). In a sequence-to-sequence model, this intermediate fixed-length vector is now the word embedding. The precise architectures used for the encoder and decoder can vary, but clearly they should be architectures that support variable-length sequences, such as recurrent neural networks (RNNs) or LTSMs. To me, the most intuitive example of a sequence-to-sequence model is a translation model. The input sequence is a sentence in a source language like English, and the output sequence is a sentence in a target language like Chinese (Figure 5 5 5 ). And since some of the most important early work in sequence-to-sequence modeling was in neural machine translation (NMT), I’ll often use translation as a default example. However, the more general case is any mapping from one sequence to another. This idea is fairly straightforward; it is analogous to an auto-encoder but for variable-length sequences, and auto-encoders (Bourlard & Kamp, 1988) are nearly as old as back-propagation. However, as we have already seen, even seemingly simple ideas are hard-won. The original work in RNNs and LSTMs goes back to at least the early 1990s, with seminal papers like Finding structure in time (Elman, 1990) , Serial order: A parallel distributed processing approach (Jordan, 1997) and Long short-term memory (Hochreiter & Schmidhuber, 1997) . By the 2010s, these sequential models were well-known and already used in NLP. See (Mikolov et al., 2010; Sutskever et al., 2011; Graves, 2013) for example. These models were an important bridge, proving that we could train RNNs at scale and overcome the vanishing gradient problem discussed in Learning long-term dependencies with gradient descent is difficult (Bengio et al., 1994) . But they were not yet sequence-to-sequence models. To my knowledge, the first paper to propose a full encoder–decoder architecture for NLP was Recurrent continuous translation models (Kalchbrenner & Blunsom, 2013) . Here, the authors proposed training two neural networks end-to-end. The decoder was an RNN, inspired by the model in (Mikolov et al., 2010) . But somewhat surprisingly, the encoder was not also an RNN. With hindsight, two RNNs feels like the obvious choice, but instead the authors used a convolutional sentence model (CSM). The details don’t really matter here, but this is essentially an NLP model which uses convolutional layers. Why this choice? Well, CSMs were actually developed by the same authors in the same year, in Recurrent convolutional neural networks for discourse compositionality (Kalchbrenner & Blunsom, 2013) , and my hypothesis is that this choice just felt obvious to them at the time. So (Kalchbrenner & Blunsom, 2013) was a landmark paper in the sense that it was the first attempt at a sequence-to-sequence model, but with hindsight we can immediately see how to improve it with a better sequential model for the encoder. And that is precisely what happens in two follow up papers. First, in Learning phrase representations using RNN encoder–decoder for statistical machine translation (Cho et al., 2014) , the authors propose the first encoder–decoder architecture in which both neural networks were RNNs. And then in Sequence to sequence learning with neural networks (Sutskever et al., 2014) , the authors proposed a similar model but using LSTMs, since LSTMs often work better at handling the aforementioned vanishing gradient problem. In this paper, Sutskever makes the connection to Kalchbrenner explicitly: Our work is closely related to Kalchbrenner and Blunsom, who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. As a nitpick, convolutional neural networks do model local patterns and order, but they lose global order without very large receptive fields . But Sutskever’s point is directionally correct. So even at the time, the academic history we are tracing here was clear. To understand these models in a bit more detail, let’s go through the RNN encoder–decoder in (Cho et al., 2014) , using Figure 6 6 6 as a reference. Let X \mathcal{X} X be a variable-length input sequence with length T x T_x T x ​ , and let Y \mathcal{Y} Y be a variable-length output sequence with length T y T_y T y ​ : X = { x 1 , x 2 , … , x T x } , Y = { y 1 , y 2 , … , y T y } . (25) \begin{aligned} \mathcal{X} &= \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{T_x} \}, \\ \mathcal{Y} &= \{ \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_{T_y} \}. \end{aligned} \tag{25} X Y ​ = { x 1 ​ , x 2 ​ , … , x T x ​ ​ } , = { y 1 ​ , y 2 ​ , … , y T y ​ ​ } . ​ ( 2 5 ) Note that ( X , Y ) (\mathcal{X}, \mathcal{Y}) ( X , Y ) is a single observation pair, but I am suppressing the sample index for ease of notation. Also, I bold each vector in both sequences because they are embedded words. In an RNN, we iteratively compute hidden state variables over T x T_x T x ​ steps, where for the t t t -th step we define a recurrence relation between hidden states as: h t = f enc ( h t − 1 , x t ) . (26) \mathbf{h}_t = f_{\textsf{enc}} \left( \mathbf{h}_{t-1}, \mathbf{x}_t \right). \tag{26} h t ​ = f enc ​ ( h t − 1 ​ , x t ​ ) . ( 2 6 ) This might be a little abstract. So concretely, a simple RNN network might instantiate f enc f_{\textsf{enc}} f enc ​ as the following nonlinear function of the current word embedding and the previous hidden state: h t = tanh ⁡ ( W h h h t − 1 + W x h x t ) . (27) \mathbf{h}_t = \tanh \left(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t \right). \tag{27} h t ​ = tanh ( W h h ​ h t − 1 ​ + W x h ​ x t ​ ) . ( 2 7 ) The matrices hopefully have obvious dimensions, and we can initialize the first hidden state vector h 0 \mathbf{h}_0 h 0 ​ however we like, such as a vector of all zeros. This is simply one choice, though. We can imagine many types of choices, such as a vanilla RNN unit or an LSTM unit. The key point is that the hidden state vectors H = { h 1 , h 2 , … , h T x } (28) \mathcal{H} = \{\mathbf{h}_1, \mathbf{h}_2,\dots, \mathbf{h}_{T_x}\} \tag{28} H = { h 1 ​ , h 2 ​ , … , h T x ​ ​ } ( 2 8 ) carry forward information from previous words in the sequence via these recurrent connections, much like a hidden Markov model (Baum & Petrie, 1966) . A powerful consequence of this model is that RNNs do not limit the size of the input context window. Different input sequences X \mathcal{X} X can be different sizes, unlike in the N N N -gram model or in Bengio’s model (Equation 14 14 1 4 ). See Andrej Karpathy’s excellent blog post, The unreasonable effectiveness of recurrent neural networks , for a more detailed presentation of RNNs. Finally, we define the context vector c \mathbf{c} c as some function of the hidden states: c = q ( H ) . (29) \mathbf{c} = q(\mathcal{H}). \tag{29} c = q ( H ) . ( 2 9 ) Notice that c \mathbf{c} c does not have a time index, because it compresses all the temporal information in the input sequence X \mathcal{X} X into a fixed-width vector. The easiest definition of c \mathbf{c} c is simply as the last hidden state vector or c = h T x \mathbf{c} = \mathbf{h}_{T_x} c = h T x ​ ​ . This context vector becomes an input to the decoder, another RNN with recurrence relation s t = f dec ( s t − 1 , y t − 1 , c ) , (30) \mathbf{s}_t = f_{\textsf{dec}} \left( \mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c} \right), \tag{30} s t ​ = f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) , ( 3 0 ) and hidden states S = { s 1 , s 2 , … , s T y } . (31) \mathcal{S} = \{\mathbf{s}_1, \mathbf{s}_2,\dots, \mathbf{s}_{T_y}\}. \tag{31} S = { s 1 ​ , s 2 ​ , … , s T y ​ ​ } . ( 3 1 ) The decoder then outputs the sequence Y \mathcal{Y} Y , one word at a time. The typical objective of a sequence-to-sequence model is again the autoregressive objective of next-word prediction: maximize a log likelihood, in which each conditional probability is modeled via the decoder RNN: log ⁡ p ( Y ) = ∑ t = 1 T y log ⁡ p ( y t ∣ y 1 : t − 1 ) = ∑ t = 1 T y log ⁡ f dec ( s t − 1 , y t − 1 , c ) . (32) \log p(\mathcal{Y}) = \sum_{t=1}^{T_y} \log p(\mathbf{y}_t \mid \mathbf{y}_{1:t-1}) = \sum_{t=1}^{T_y} \log f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1} , \mathbf{c}). \tag{32} lo g p ( Y ) = t = 1 ∑ T y ​ ​ lo g p ( y t ​ ∣ y 1 : t − 1 ​ ) = t = 1 ∑ T y ​ ​ lo g f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) . ( 3 2 ) Again, this might be a bit abstract. So for example, one possible instantiation of g g g is as a linear transformation of the input variables: f dec ( s t − 1 , y t − 1 , c ) = W z s s t + W z y y t − 1 + W z c c . (33) f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}) = \mathbf{W}_{zs} \mathbf{s}_t + \mathbf{W}_{zy} \mathbf{y}_{t-1} + \mathbf{W}_{zc} \mathbf{c}. \tag{33} f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) = W z s ​ s t ​ + W z y ​ y t − 1 ​ + W z c ​ c . ( 3 3 ) Of course, this is just one choice. Then all the model weights are learned end-to-end by optimizing this log likelihood (Equation 32 32 3 2 ). In this way, we can convert a variable-length input X \mathcal{X} X into a variable-length output Y \mathcal{Y} Y . This RNN encoder–decoder framework is powerful, since many problems in NLP can be framed in this way. For example, text summarization, machine translation, and agentic conversation can all be framed as a sequence-to-sequence modeling challenge. To be clear, other researchers around this time had attempted other approaches to handling variable-length sequences, such as the recursive neural tensor network in Recursive deep models for semantic compositionality over a sentiment treebank (Socher et al., 2013) . But the RNN encoder–decoder would become the de facto framework of choice for a large range of NLP tasks. As an aside, sometimes these models are call sequence transduction models or transduction models or even just transducers . My understanding is that “transduction” here just means converting one sequence into another by learning a conditional distribution p θ ( y 1 : T ∣ x 1 : S ) p_{\theta}(\mathbf{y}_{1:T} \mid \mathbf{x}_{1:S}) p θ ​ ( y 1 : T ​ ∣ x 1 : S ​ ) . In this context, “transduction” does not have the sense that Vladimir Vapnik gave it. In Vapnik’s definition, transduction loosely means classification of a specific example rather than a general rule for classifying future examples (Gammerman et al., 2013) . But this is not the sense which people mean when they refer to models like the transformer as a “transducer”. In my mind, Kalchbrenner, Cho, and Sutskever’s three papers (Kalchbrenner & Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014) were the foundations of sequence-to-sequence modeling, and many other papers have built around and off this core idea. But the key point for us here is that these three papers make the same logic choice: they lift the idea of a fixed-length embedding for words or phrases into the context vector c \mathbf{c} c of a sequential model, such that the models can now support variable-length inputs and outputs and long-range dependencies in each. However, a problem with this approach was that long-range dependencies got “lost” in this context vector. For example, imagine we had a very long English language text that we wanted to translate into Chinese. Even if our encoder LSTM was good at capturing long-range dependencies in the English sentence, it would be forced to compress that information into a much shorter, fixed-width vector with no temporal structure that would then be fed into the decoder. This effect was observed by Cho et al in On the properties of neural machine translation: encoder–decoder approaches (Cho et al., 2014) . In this paper, the authors write: Our analysis shows that the performance of the neural machine translation model degrades quickly as the length of a source sentence increases. The most obvious explanatory hypothesis is that the fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. The authors test this hypothesis through a variety of experiments. For example, in one experiment, they report the BLEU score for an RNN encoder–decoder as a function of sequence length, and they show that the model’s performance degrades as the sentences become longer. So the RNN decoder–encoder was promising, but the fixed-width context vector was a bottleneck on modeling long-range dependencies. Then in 2014, a seminal paper was published that addressed this problem, Neural machine translation by jointly learning to align and translate (Bahdanau et al., 2014) . The main invention of this paper was to use the well-known attention mechanism to attend to this context vector. However, the authors barely use the word “attention” in the paper. Instead, they seem to conceptualize it more as a search problem: In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. I say this paper is “seminal” because, at least to my knowledge, it was really the first paper to use a differentiable attention layer in the rapidly-growing field of NMT. To be clear, the attention mechanism was already known and used outside of NLP. For example, see Learning to combine foveal glimpses with a third-order Boltzmann machine (Larochelle & Hinton, 2010) , Learning where to attend with deep architectures for image tracking (Denil et al., 2012) , or Recurrent models of visual attention (Mnih et al., 2014) . These were all papers that were published between 2010 and 2014 and that applied an attention mechanism to a neural network computer vision system. However, to my knowledge, Bahdanau was the first paper to successfully use attention in NLP. To quote Effective approaches to attention-based neural machine translation (Luong et al., 2015) : In the context of NMT, Bahdanau et al… has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT. All that said, “jointly align and translate” is pretty vague, so let’s get technical. Bahdanau’s solution to this bottleneck was to allow each hidden state vector in the decoder to pay attention to possibly all the hidden state vectors in the encoder. What do I mean by “pay attention to”? Here, each decoder hidden state variable s i \mathbf{s}_i s i ​ depends not only on the previous hidden state and previous word but also on its own context vector, which is a weighted combination of the encoder’s hidden states! s i = f dec ( s i − 1 , y i − 1 , c i ) , c i = ∑ j = 1 T x α i j h j . (34) \begin{aligned} \mathbf{s}_i & = f_{\textsf{dec}}(\mathbf{s}_{i-1}, \mathbf{y}_{i-1}, \mathbf{c}_i), \\ \mathbf{c}_i &= \sum_{j=1}^{T_x} \alpha_{ij} \mathbf{h}_j. \end{aligned} \tag{34} s i ​ c i ​ ​ = f dec ​ ( s i − 1 ​ , y i − 1 ​ , c i ​ ) , = j = 1 ∑ T x ​ ​ α i j ​ h j ​ . ​ ( 3 4 ) This is the main idea of the paper. Each decoder hidden state s i \mathbf{s}_i s i ​ has access to all the hidden states in the encoder via this context vector c i \mathbf{c}_i c i ​ (Figure 7 7 7 ). We can finally define the attention mechanism! Here, it is the weighted sum of hidden state vectors, as this allows each s i \mathbf{s}_i s i ​ to attend to different parts of the input sequence through its hidden state. Each weight α i j \alpha_{ij} α i j ​ is a linear function of the previous decoder hidden state s i − 1 \mathbf{s}_{i-1} s i − 1 ​ and the current decoder hidden state h j \mathbf{h}_j h j ​ : α i j : = exp ⁡ ( e i j ) ∑ k = 1 T x exp ⁡ ( e i k ) , e i j : = v a ⊤ z i j , z i j : = tanh ⁡ ( W a s i − 1 + U a h j ) . (35) \begin{aligned} \alpha_{ij} &:= \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \\ e_{ij} &:= \mathbf{v}_a^{\top } \mathbf{z}_{ij}, \\ \mathbf{z}_{ij} &:= \tanh\left( \mathbf{W}_a \mathbf{s}_{i-1} + \mathbf{U}_a \mathbf{h}_j \right). \end{aligned} \tag{35} α i j ​ e i j ​ z i j ​ ​ : = ∑ k = 1 T x ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , : = v a ⊤ ​ z i j ​ , : = tanh ( W a ​ s i − 1 ​ + U a ​ h j ​ ) . ​ ( 3 5 ) Let’s call α i \boldsymbol{\alpha}_i α i ​ the an alignment vector , which we infer one per step at a time during the decoding process. So z i j \mathbf{z}_{ij} z i j ​ can be viewed as a shared hidden state, capturing nonlinear information about both the input and output sequence. Importantly, there is one such vector for each input-output pair. And for a given decoder hidden state, the model can up or downweight the relationship to h j \mathbf{h}_j h j ​ via the parameters v a \mathbf{v}_a v a ​ . The neural network learns all these model parameters end-to-end via back-propagation, maximizing the log likelihood in Equation 32 32 3 2 . So that’s it. As I understand it, (Bahdanau et al., 2014) was really the first paper to use attention in neural machine translation and probably the most successful use of attention in NLP at the time. The method worked surprisingly well. To quote the paper’s conclusion: Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. As an aside, they actually use a bidirectional RNN for the encoder and then concatenated the forward and backward hidden states. But I don’t think that adds much to our story or to intuition, and it would muddy Figure 7 7 7 . The key point is that it was the attention mechanism that allowed for the long-range dependencies encoded by the RNN to be captured through an adaptive context vector. Hopefully, we can now see why the paper uses the words “align and translate”. Here, alignment really means allowing the model to uncover which parts of the input sequence matter to each part of the output sequence—and it does this via the attention mechanism. Finally, while writing this blog post, I came across this incredible comment by Edward Grefenstette , published on 3 May 2014: By and large, the case for deep learning in language hasn’t been fully made. It works well for vision and speech, but that doesn’t entail that it would carry to semantics. Some excellent shallow models without non-linearities, like the Mnih and Hinton log-bilinear models, are excellent and can be trained very quickly. It’s a problem with much “deep learning” work in NLP these days that shallow baselines are never considered or compared to. Deep learning is fascinating and will certainly have an impact in NLP, but don’t rush to believe that it’s the best solution for your NLP problems. I love this comment because it is a time-capsule, perfectly capturing how experts in the field felt about neural networks at the time. (Note that Grefenstette has published papers with other researchers in this story, such as Kalchbrenner and Graves.) So even around the time that Bahdanau et al were publishing groundbreaking work on RNN encoder–decoders with attention, deep learning had still not fully proven itself to the community. The attentive reader might be wondering: wasn’t the argument around log-linear models that they were simple and therefore scalable? But Bahdanau’s RNN encoder–decoder with attention seems anything but simple. So on some level, yes, Bahdanau’s model was a step backwards in terms of complexity. But on another level, it was a proof-of-concept that the attention mechanism worked. (Also, Moore’s law.) So researchers quickly built on Bahdanau by studying simpler models and simpler types of attention. Perhaps the most important paper to directly build on Bahdanau’s model was (Luong et al., 2015) . In this paper, the authors simplified the model used by Bahdanau, proposed several alternative forms of attention, and showed that an ensemble of attention-based methods produced state-of-the-art results on neural machine translation problems. To be clear, Bahdanau had shown that attention worked and that it seemed to address problems in translating longer sentences, but it did not demonstrably beat the state-of-the-art. Luong’s results more directly suggested that attention might be the way forward. So before we get to the transformer, let’s understand the attention mechanism better through the lens of this paper. The first dimension along which we can define attention is local versus global attention. For example, in the attention mechanism in an RNN encoder–decoder, the conceptual lynchpin is that at each time step i ∈ { 1 , … , T y } i \in \{1, \dots, T_y\} i ∈ { 1 , … , T y ​ } in the decoding phase, we construct a context vector c i \mathbf{c}_i c i ​ which summarizes information from the source sentence via the encoder’s hidden states: c i = ∑ j = a b α i j h j . (36) \mathbf{c}_i = \sum_{j=a}^{b} \alpha_{ij} \mathbf{h}_j. \tag{36} c i ​ = j = a ∑ b ​ α i j ​ h j ​ . ( 3 6 ) But now I don’t precisely define the limits of the sum, a a a and b b b . If a = 1 a=1 a = 1 and b = T x b=T_x b = T x ​ , then the context vector is constructed by considering all the hidden states of the source sentence. This is what Luong calls global attention (Figure 8 8 8 , left), since each word in the target sentence has access to information about all the words in the source sentence. But we could also define a a a and b b b such that they form a window around the decoder’s hidden state or model the left-to-right structure of many natural languages. This is what Luong calls local attention (Figure 8 8 8 , right). So these are two ways in which we can construct the context vector c i \mathbf{c}_i c i ​ . The second dimension along which we can define attention is how we define the alignment weights α i \boldsymbol{\alpha}_i α i ​ . For example, the simplest choice is simply that α i \boldsymbol{\alpha}_i α i ​ is a one-hot vector, such that c i \mathbf{c}_i c i ​ selects a single encoder hidden state vector h k \mathbf{h}_k h k ​ to use in the i i i -th decoding step. This would be hard- rather than soft-search. But more generally, we can write these alignment weights as the unnormalized output of a score function . Using the notation from Equation 35 35 3 5 above, we can write this as: e i j : = score ( h j , s i − 1 ) . (37) e_{ij} := \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}). \tag{37} e i j ​ : = score ( h j ​ , s i − 1 ​ ) . ( 3 7 ) And in Luong, the authors explore three main scoring functions. These are dot-product attention , general attention , and additive attention , defined as: e i j = score ( h j , s i − 1 ) = { h j ⊤ s i − 1 dot, h j ⊤ W a s i − 1 general, v a ⊤ tanh ⁡ ( W a h j + U a s i − 1 ) additive (Bahdanau). (38) e_{ij} = \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}) = \begin{cases} \mathbf{h}_j^{\top} \mathbf{s}_{i-1} & \text{dot,} \\ \mathbf{h}_j^{\top} \mathbf{W}_a \mathbf{s}_{i-1} & \text{general,} \\ \mathbf{v}_a^{\top } \tanh \left( \mathbf{W}_a \mathbf{h}_j + \mathbf{U}_a \mathbf{s}_{i-1} \right) & \text{additive (Bahdanau).} \end{cases} \tag{38} e i j ​ = score ( h j ​ , s i − 1 ​ ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ​ h j ⊤ ​ s i − 1 ​ h j ⊤ ​ W a ​ s i − 1 ​ v a ⊤ ​ tanh ( W a ​ h j ​ + U a ​ s i − 1 ​ ) ​ dot, general, additive (Bahdanau). ​ ( 3 8 ) Of course, you can imagine many other score functions. My own view is that it’s too difficult here to reason about which form of attention is better in some theoretical sense. Which form works best is an empirical result. In (Luong et al., 2015) , the empirical results were mixed in the sense that all three score functions worked well. In fact, the results weren’t even strong enough for the authors to claim that attention-based methods were demonstrably better. This was their conclusion: Our analysis shows that attention-based NMT models are superior to non-attentional ones in many cases, for example in translating names and handling long sentences. So by late 2015, just two years before the transformer, attention was just becoming popular in NMT but was not yet the de facto modeling choice. That said, obviously this will change, and when it does, there will be a clear winner amongst the choices above, and that winner is dot-product attention. Dot-product attention is the variant used by the transformer, and thankfully, in my mind it is the most intuitive since the dot product is a standard way to measure the similarity between two vectors . So we can interpret the dot-product score function as measuring the similarity between the encoder and decoder hidden states. The third and final dimension along which we can define attention is through the variables of interest. In order to understand what I mean, we can no longer refer to attention in terms of hidden states of RNNs. We need more general terminology. In the literature, attention is often viewed through the lens of information retrieval. In this literature, a query is what you are asking for; a key is what you can search through; and a value is what you can return. Let me give an example (Figure 9 9 9 ). Imagine I type some text into a search bar: “indian food near me”. This text is the query. Now imagine the search engine runs that query against a bunch of metadata associated with different restaurants. For example, restaurant descriptions, keywords, reviews, ratings, and distances from my location. These metadata are the keys . So the query is “run against” the keys. Finally, the thing returned are candidate restaurants. These are values . In the language of information retrieval, we can describe the attention mechanism as a kind of soft-search, since it can return a linear combination of the values. As you may recall, this is precisely how Bahdanau described their model in the quote above. So in Bahdanau’s RNN encoder–decoder, the decoder’s hidden states s i \mathbf{s}_i s i ​ are the queries, since for each hidden state s i \mathbf{s}_i s i ​ we want to search through the source sentence. The encoder’s hidden states h j \mathbf{h}_j h j ​ are the keys, since these are the metadata associated with the source sentence that we can search through. Finally, the encoder’s hidden states are also the values, since the context vector c i \mathbf{c}_i c i ​ is a weighted combination of these encoder hidden states. This language is useful because it disambiguates the attention mechanism from a specific choice of model and even from which variables in that model are being used for what. Now that we understand this terminology, we can express ourselves more cleanly and abstractly. And with this terminology, it becomes clear that the keys, queries, and values need not be different objects in our model at all! In fact, queries, keys, and values can all be taken from the same set. For example, imagine we have a model with a hidden state h \mathbf{h} h . This is not necessarily the hidden state of an RNN or even a sequential model. We could define a kind of attention such that the queries ( q \mathbf{q} q ), keys ( k \mathbf{k} k ), and values ( v \mathbf{v} v ) are all functions of this hidden state: q i : = f q ( h i ) , k j : = f k ( h j ) , v j : = f v ( h j ) , α i j = softmax ( score ( q i , k j ) ) , ∑ j α i j = 1 , c i = ∑ j α i j v j . (39) \begin{aligned} \mathbf{q}_i &:= f_q(\mathbf{h}_i), \\ \mathbf{k}_j &:= f_k(\mathbf{h}_j), \\ \mathbf{v}_j &:= f_v(\mathbf{h}_j), \\ \alpha_{ij} &= \text{softmax}(\text{score}(\mathbf{q}_i, \mathbf{k}_j)), \qquad \sum_{j} \alpha_{ij} = 1, \\ \mathbf{c}_i &= \sum_j \alpha_{ij} \mathbf{v}_j. \end{aligned} \tag{39} q i ​ k j ​ v j ​ α i j ​ c i ​ ​ : = f q ​ ( h i ​ ) , : = f k ​ ( h j ​ ) , : = f v ​ ( h j ​ ) , = softmax ( score ( q i ​ , k j ​ ) ) , j ∑ ​ α i j ​ = 1 , = j ∑ ​ α i j ​ v j ​ . ​ ( 3 9 ) This is obviously different from the attention mechanism in Bahdanau. In Bahdanau, the authors use cross-attention , which is attention where the queries come from one set and the keys and values come from a different set. As you can imagine, typically the keys and values come from the same set, although they might have their own maps or projections such that they are correlated but not identical. For example, we might run a query against restaurants (keys) and also return restaurants (values). However, self-attention is when the queries, keys, and values all come from the same set of variables! To continue abusing our running example, we essentially compute the similarity between restaurants of interest and restaurants we have data about, and then use those weights to return a weighted combination of restaurants! To my knowledge, the first paper to use self-attention in NLP was Long short-term memory-networks for machine reading (Cheng et al., 2016) . This model is a bit complicated, and I don’t think it’s that important to understand here. The key point is only to grok that attention does not have to be cross-attention as in Bahdanau. Instead, we can have a sequence attend to itself to decide what parts of the sequence matter—or self-attention! This is how this idea was described in the paper: A remaining practical bottleneck for RNNs is memory compression (Bahdanau et al., 2014) : since the inputs are recursively combined into a single memory representation which is typically too small in terms of parameters, it becomes difficult to accurately memorize sequences (Zaremba & Sutskever, 2014) . In the encoder-decoder architecture, this problem can be sidestepped with an attention mechanism which learns soft alignments between the decoding states and the encoded memories (Bahdanau et al., 2014) . In our model, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens. The important phrase here is “within a sequence encoder”. Here, the attention is not applied across the encoder and decoder but rather is applied as intra- or self-attention within the encoder. So circa 2017, attention was being studied in its many forms: local versus global, additive versus multiplicative, and cross versus self. And it was being more widely used in NLP, with papers like A structured self-attentive sentence embedding (Lin et al., 2017) and Bidirectional attention flow for machine comprehension (Seo et al., 2016) . That said, I do not think any specific form was clearly the dominant one. Rather, each showed promise in its own way. For example, in March 2017, Google Brain published Massive exploration of neural machine translation architectures (Britz et al., 2017) . This was published just months before the transformer would be published, and even here, attention is only a minor player. In that paper’s conclusions, the authors list six main results, and the only one about attention is a single sentence: Parameterized additive attention yielded the overall best results. Notice that additive attention is not even the form of attention used by the transformer! So at least as best as I understand it, attention was well-understood and widely-studied in 2017, but it was by no means considered the main ingredient or the next logical step. Many researchers were still pushing the limits of training RNNs at scale, rather than trying other approaches. See Exploring the limits of language modeling (Jozefowicz et al., 2016) for example. However, in June 2017, all that was about to change. The transformer’s time had come. In 2017, researchers at Google Brain published Attention is all you need (Vaswani et al., 2017) , which is the original paper introducing the transformer architecture. This was their proposal, which I hope now makes sense given the context so far: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The authors acknowledge that the sequence-to-sequence framework with neural networks was state-of-the-art, and they specifically call out the RNN encoder–decoder architecture with attention from Bahdanau, Luong, and others. Their proposal is simple: keep the encoder–decoder framework but replace everything else with attention. How might someone have come to this idea at the time? Why would it be a good idea to try? Their observation is that the sequential nature of RNNs inhibits training these models at scale: Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t h t ​ , as a function of the previous hidden state h t − 1 h_{t−1} h t − 1 ​ and the input for position t t t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Their proposal is to use attention rather than RNNs to uncover dependencies within the input and output sequences. This is a good idea to try not because attention is obviously better than recurrence per se. It’s that attention is parallelizable! They write: The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. We have seen this before. Recall how the unlock for word embeddings in (Mikolov et al., 2013; Mikolov et al., 2013) was simplifying the models and focusing on scale. But then the RNN encoder–decoder architecture in (Bahdanau et al., 2014) with attention took us backwards in terms of model complexity. So the transformer is a similar story: take the best modeling ideas, strip them down, and train the simplified model at scale. That’s it. Properly understood in context, the transformer is a modest conceptual leap from the existing literature. My point is not that the transformer is “obvious” in the sense that it is not an impressive invention. My point is to demystify the research product by underscoring the process. In context, the transformer should make sense as something someone might have tried in 2017. The model architecture might look intimidating, but it is pretty straightforward when viewed in the right context (Figure 10 10 1 0 ). At a high level, the transformer is an encoder–decoder, with two big kinds of attention. First, we have cross-attention between the outputs of the encoder and the inputs to the decoder. This is completely analogous to the cross-attention in Bahdanau and others. But then we also have self-attention within the decoder and encoder. This completely replaces the recurrence relations of RNNs. Finally, the model uses something called positional encoding , which I’ll define shortly, to handle the fact that attention is not naturally sequential a la an RNN. Everything else is details. For example, the transformer also uses layer normalization (Ba et al., 2016) and residual connections (He et al., 2016) , but these are not unique or novel contributions. Even multi-head attention is not conceptually hard. So understood in context, the transformer is pretty straightforward. Let’s go through the main bits in detail. First, positional encoding. A key challenge for the attention mechanism is that it does not inherently capture sequential structure. Thus, the relative positions of words in a sequence can be easily lost. In Vaswani, the authors propose attaching vectors of numbers to the inputs to capture this position-dependent information. The precise functional form of these numbers doesn’t really matter to us. The point is that we’re encoding the position of each word so that we can still model the sequential structure of natural language. After adding position-dependent information, the transformer encodes the input sequence. But rather than passing the data through an RNN, it passes the data through multi-head attention layers. We’ll discuss “multi-head” in a moment, but the basic attention mechanism is what the authors call scaled-dot product attention . Let’s define it. Let Q ∈ R M × D k \mathbf{Q} \in \mathbb{R}^{M \times D_k} Q ∈ R M × D k ​ be a matrix of queries, let K ∈ R N × D k \mathbf{K} \in \mathbb{R}^{N \times D_k} K ∈ R N × D k ​ be a matrix of keys, and let V ∈ R N × D v \mathbf{V} \in \mathbb{R}^{N \times D_v} V ∈ R N × D v ​ be a matrix of values. Then scaled dot-product attention is: attention ( Q , K , V ) = softmax ( Q K ⊤ D k ) V . (40) \text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}} \right) \mathbf{V}. \tag{40} attention ( Q , K , V ) = softmax ( D k ​ ​ Q K ⊤ ​ ) V . ( 4 0 ) When I first read Vaswani, I had not yet read Bahdanau or Luong, and thus I was completely confused by Equation 40 40 4 0 . It was not at all obvious what any of these values represented or why any of this machinery worked. And the paper itself gave a pretty opaque explanation: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Without context, this explanation is not very helpful. However, armed with a better understanding of attention, we can make sense of this. In the cross-attention between the encoder and decoder, the queries are analogous to the hidden states of the RNN decoder, while the keys and values are analogous to the hidden states of the RNN encoder. And if we remove the sample dimension (so let N = 1 N=1 N = 1 ), we can rewrite Equation 40 40 4 0 in a way that looks like the types of attention in Equation 38 38 3 8 : score ( q i , k j ) = e i j = q i ⊤ k j D k , α i j = exp ⁡ ( e i j ) ∑ k = 1 D v exp ⁡ ( e i k ) , attention ( α i , v i ) = ∑ k = 1 D v α i k v i . (41) \begin{aligned} \text{score}(\mathbf{q}_i, \mathbf{k}_j) &= e_{ij} = \frac{\mathbf{q}_i^{\top} \mathbf{k}_j}{\sqrt{D_k}}, \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k=1}^{D_v} \exp(e_{ik})}, \\ \text{attention}(\boldsymbol{\alpha}_i, \mathbf{v}_i) &= \sum_{k=1}^{D_v} \alpha_{ik} \mathbf{v}_i. \end{aligned} \tag{41} score ( q i ​ , k j ​ ) α i j ​ attention ( α i ​ , v i ​ ) ​ = e i j ​ = D k ​ ​ q i ⊤ ​ k j ​ ​ , = ∑ k = 1 D v ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , = k = 1 ∑ D v ​ ​ α i k ​ v i ​ . ​ ( 4 1 ) So this is identical to the multiplicative or dot-product attention proposed in Luong (Equation 38 38 3 8 ), modulo a scaling factor D k \sqrt{D_k} D k ​ ​ . In Equation 40 40 4 0 , we are just packaging it into a matrix form so that we can compute this attention over many samples at once. In other words, this is a highly parallelizable version of the dot-product attention. I think one of the reasons the transformer can be confusing is the use of two types of attention and the generic language of queries, keys, and values, whose definitions change depending on the type of attention. In the encoder, the transformer uses self-attention. So the query represents the current vector in the input sequence, while the keys and values are all the other vectors in the input sequence. And in the decoder, the query represents the current vector in the output sequence, while the keys and values are all the other vectors in the output sequence—modulo masking, which I’ll mention in a moment. Finally, the attention between the encoder and decoder (in the paper, Vaswani calls this “encoder–decoder attention”), the query is the current vector in the decoder output (analogous to s i \mathbf{s}_i s i ​ in the RNN encoder–decoder), while the keys and values are the encoder’s hidden outputs (analogous to H \mathcal{H} H in the RNN encoder–decoder). Note that “masked” in “masked multi-head self-attention” just refers to a masking out of words in the decoder’s self-attention mechanism. This is because attention has no inherent sequential structure a la RNNs. So we have to enforce this by masking regions of the output. This allows the transformer to be trained in the standard autoregressive framework we have discussed since (Bengio et al., 2003) . Finally, the transformer learns multiple sets of parameters associated with the attention mechanism at once. This is what the paper calls multi-head attention . Instead of having a single attention function, we can run multiple attention functions in parallel, say A A A times. By way of analogy, recall that in the RNN encoder–decoder, we had the following attention parameters (Equation 35 35 3 5 ): { W a , U a , v a } . (42) \{\mathbf{W}_a, \mathbf{U}_a, \mathbf{v}_a \}. \tag{42} { W a ​ , U a ​ , v a ​ } . ( 4 2 ) In Bahdanau (Equation 35 35 3 5 ) the subscript a a a just denotes that these are attention-related weights. It is not actually indexing into multiple such weights (that is, A = 1 A=1 A = 1 ). But we could do that. We could say that a a a is indexing into different parameters, a ∈ { 1 , 2 , … , A } a \in \{1, 2,\dots, A\} a ∈ { 1 , 2 , … , A } . This would have made Bahdanau’s model slower to train, but it would have allowed for multiple cross-attention mechanisms to be learned at once. In Bahdanau, they don’t actually do this, likely because it’s too expensive! The precise details are different in Vaswani, but this is all multi-head attention is in theory. It is multiple parallel attention mechanisms. So that’s it. That’s the transformer. The results were impressive. To be clear, it was not an AlexNet moment, but the results were clearly better than benchmarks and more importantly, the model was way more efficient. For example, one of the benchmarks in Vaswani is the ConvS2S Ensemble from Convolutional sequence to sequence learning (Gehring et al., 2017) . The idea of this paper is similar to the transformer: train a bigger sequence-to-sequence model by eschewing recurrent connections in favor of parallelizable convolutional layers. In both English-to-German and English-to-French translation, the transformer beats this model in BLEU score. But more importantly, it is more efficient. For example, according to Vaswani, the ConvS2S Ensemble required 1.2 × 1 0 21 1.2 \times 10^{21} 1 . 2 × 1 0 2 1 flops to train their English-to-French model, whereas the transformer required 3.3 × 1 0 18 3.3 \times 10^{18} 3 . 3 × 1 0 1 8 flops. So the transformer had comparable results with a 360x reduction in flops! In my mind, this the real insight. It is not that attention is absolutely the best way to model the problem. Rather, the transformer is on the Pareto frontier between modeling the problem well enough and being scalable enough. To see the transformer in code, see Sasha Rush’s excellent The annotated transformer . The transformer was a revolutionary architecture, and explicitly designed to scale. However, in reality the original model was tiny by today’s standards. The biggest variant only had 2.13 million parameters, and the largest dataset it was trained on, the WMT 2014 English–French datasets, only had 36 million sentences. But the paper proved that the transformer worked well as a generic transduction model. However, despite the paper’s name, the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today. To simplify the discussion, I’ll focus on training for OpenAI’s GPT series. My understanding is that OpenAI made a lot of the big contributions here, and so their papers are good landmarks to follow. Loosely, the three training stages they discuss in their GPT papers are generative pre-training, discriminative fine-tuning, and reinforcement learning with human feedback. Let’s work through the first two in detail here and the last one in detail in the next section. In 2018, roughly a year after the transformer was published, OpenAI published Improving language understanding by generative pre-training (Radford et al., 2018) . The main idea of the paper is to pre-train a transformer with as much unlabeled data as possible before fine-tuning it with task-specific supervised training. In the paper, the authors call the first step generative pre-training and the second step discriminitive fine-tuning . (The words “generative” and “discriminitive” have a long history in machine learning; see (Ng & Jordan, 2001) for a discussion.) As the OpenAI paper title suggests, the key focus was on generative pre-training. Supervised learning obviously matters, but the idea was that one could use unsupervised training at scale to build a base model and then use supervised learning to train more task-specific downstream models. Let’s look at generative-pretraining in a bit more detail. Since we do not have labels, we need some way to formalize the problem. In generative pre-training, the objective is next-word prediction as in the autoregressive framework. In other words, the objective is maximum likelihood estimation on Equation 1 1 1 : L GPT ( Θ ) = ∑ t = 1 T log ⁡ p Θ ( w t ∣ w t − N : t − 1 ) . (43) L_{\textsf{GPT}}(\boldsymbol{\Theta}) = \sum_{t=1}^T \log p_{\boldsymbol{\Theta}}\left(w_t \mid w_{t-N:t-1}\right). \tag{43} L GPT ​ ( Θ ) = t = 1 ∑ T ​ lo g p Θ ​ ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 3 ) As we saw around Equation 12 12 1 2 , maximum likelihood estimation here is equivalent to minimizing the cross-entropy loss between our model’s prediction of w t w_t w t ​ and the ground truth. So this whole process is unsupervised, and we can train our model on lots and lots and lots of data. It’s worth observing that Equation 43 43 4 3 is only one generative pre-training objective function, and it has limitations. In particular, note that the autoregressive framework means that the model is pre-trained “left to right” and thus limits the set of suitable downstream tasks. To address this limitation, in 2019, Google AI published BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2019) . Here, the authors propose a pre-training objective that learns bidirectional representations. Rather than pre-training using the autoregressive framework, they pre-train using a “masked language model”, which randomly masks some of the tokens to predict, without assuming a left-to-right relationship. Quoting that paper: Unlike left-to-right language model pre-training, the [masked language model] objective enables the representation to fuse the left and the right context, which allows us pre-train a deep bidirectional Transformer. More formally, let M ⊆ { 1 , 2 , … , T } \mathcal{M} \subseteq \{1,2,\dots,T\} M ⊆ { 1 , 2 , … , T } be a mask denoting positions in the input sequence w 1 : T w_{1:T} w 1 : T ​ , and let ¬ M \neg \mathcal{M} ¬ M denote all indices that are not in M \mathcal{M} M . The denoising objective is to maximize L MLM ( Θ ) = ∑ i ∈ M log ⁡ p Θ ( w i ∣ w ¬ M ) . (44) L_{\textsf{MLM}}(\boldsymbol{\Theta}) = \sum_{i \in \mathcal{M}} \log p_{\boldsymbol{\Theta}}\left(w_i \mid w_{\neg \mathcal{M}} \right). \tag{44} L MLM ​ ( Θ ) = i ∈ M ∑ ​ lo g p Θ ​ ( w i ​ ∣ w ¬ M ​ ) . ( 4 4 ) This idea was inspired by the Cloze test (Taylor, 1953) , and the idea was that this bidirectional transformer can then be fine-tuned on a much wider range of downstream tasks. That said, my understanding is that generative pre-training is fairly standard. The left-to-right assumption is simple and matches natural language, coding, and so forth. But I am not confident about what is used in absolutely state-of-the-art foundation models right now. Either way, neither objective function is enough. For example, consider a conversational agent built on top of a large language model. Now imagine the user prompts an LLM with the following question: “I am having trouble getting a date. Any advice?” If the LLM is only trained on next-word prediction, a plausible response might be: “You’ll never find true love!” From the perspective of the distribution of English words on the internet, this is not an unreasonable response. But it is not helpful and hopefully not true. In other words, next-word prediction is obviously not enough for most meaningful tasks that leverage LLMs. So the second step in training is discriminative fine-tuning . “Discriminative fine-tuning” is just a fancy way of saying supervised learning on specific tasks: L DFT ( θ ) = ∑ y , x 1 : T log ⁡ p θ ( y ∣ x 1 : T ) . (45) L_{\textsf{DFT}}(\boldsymbol{\theta}) = \sum_{y, x_{1:T}} \log p_{\boldsymbol{\theta}}\left(y \mid x_{1:T} \right). \tag{45} L DFT ​ ( θ ) = y , x 1 : T ​ ∑ ​ lo g p θ ​ ( y ∣ x 1 : T ​ ) . ( 4 5 ) Here, I am using standard notation for supervised learning ( x , y ) (x, y) ( x , y ) , rather than the notation in this post. There are some possible subtleties here. For example, in the GPT-1 paper, they optimize a weighted objective function to balance between generative pre-training and discriminative fine-tuning: L final = L DFT + λ   L GPT . (46) L_{\textsf{final}} = L_{\textsf{DFT}} + \lambda \, L_{\textsf{GPT}}. \tag{46} L final ​ = L DFT ​ + λ L GPT ​ . ( 4 6 ) This ensures that during fine-tuning, the model does not unlearn parameters that are good for next-word prediction. In the process of trying to fine-tune LLMs, researchers have built ever more task-specific datasets to tackle problems like question-and-answering (Reddy et al., 2019) , text summarization (Nallapati et al., 2016) , commonsense inference (Zellers et al., 2019) , code generation (Chen et al., 2021) , broader discourse context (Paperno et al., 2016) , and grade school math (Cobbe et al., 2021) . A pre-trained LLM can be fine-tuned in a dizzying number of ways. I have two caveats to the above presentation. First, I want to emphasize that this two-step training procedure was not a conceptual leap for researchers. At the time, researchers were already training models with pre-trained word embeddings, and even before this, this two-step training procedure was both understood and used in practice. For examples, see (Collobert & Weston, 2008; Ramachandran et al., 2016; Hinton et al., 2012) . Furthermore, researchers knew to use both pre-trained word embeddings and to even have task-specific objectives when training their word embeddings. Remember ELMO? The earliest reference I have found to this idea of pre-training—I am sure there are earlier ones—is from the 2006 paper Greedy layer-wise training of deep networks (Bengio et al., 2006) . Here, the authors write: We hypothesize that three aspects of this strategy are particularly important: first, pre-training one layer at a time in a greedy way; second, using unsupervised learning at each layer in order to preserve information from the input; and finally, fine-tuning the whole network with respect to the ultimate criterion of interest. In these examples above, it’s clear the authors recognize that one can pre-train a model with unsupervised learning and then fine-tune it with supervised learning. So even in the GPT paper, the novel contribution is not generative pre-training per se, but only applying it to language modeling at an unprecedented scale. My second caveat is that while discriminative fine-tuning is used in commodity LLMs that many people interact with, the early GPT models were remarkable in part because they did not need fine-tuning! For example, as their titles suggest, the GPT-2 paper Language models are unsupervised multitask learners (Radford et al., 2019) and the GPT-3 paper Language models are few-shot learners (Brown et al., 2020) both focus on massively pre-trained transformers that excel in the zero-shot (Palatucci et al., 2009) and few-shot settings, on a variety of tasks like reading comprehension, summarization, and translation. For example, in the GPT-3 paper, the authors are explicit: For all tasks, GPT-3 is applied without any gradient updates or fine-tuning. That said, many related research projects did fine-tune these models, and the GPT-4 technical report (Achiam et al., 2023) does discuss post-training alignment, which we’ll discuss next. So while each LLM may be trained in slightly different ways, I am fairly confident most foundation models today are trained with some combination of massive pre-training and then optionally task-specific fine-tuning and alignment. I’m sure the precise details vary depending on the final product. For example, OpenAI’s Codex is a version of GPT-5 but optimized for agentic coding. Making LLMs bigger does not necessarily make them better at following a user’s intent or make them more aligned with human values. For example, we might not want conversational agents to lie, to make racist jokes, or to sexually harass the user. But nothing in the autoregressive framework accounts for this. We need to somehow encode these human values into the model. For some of these properties, we might be able to use a form of fine-tuning. There are datasets for this, such as the ETHICS dataset (Hendrycks et al., 2020) or the RealToxicityPrompts dataset (Gehman et al., 2020) . But the limitations here are fairly obvious. And for many human values, it would be difficult to encode because the property itself is hard to define. To encode these properties, state-of-the-art LLMs are often trained using something called reinforcement learning with human feedback (RLHF). RLHF was developed around the same time as the transformer, in Deep reinforcement learning from human preferences (Christiano et al., 2017) . The original motivation was how to expand the reinforcement learning (RL) framework beyond problems with well-specified reward functions. For example, RL has been used to great effect to play Go (Silver et al., 2016) , Atari (Mnih et al., 2013) , and Dota 2 (Berner et al., 2019) , but what these tasks have in common is that their reward functions are relatively simple and their environments are relatively easy to simulate. But to borrow two examples from Christiano et al, how would you teach a machine-learning model to clean a table or to scramble an egg? It’s hard to come up with an objective function or simulation environment for these kinds of tasks. What we need, then, is a reward function that can be defined by human feedback and thus by human preferences. Broadly, RLHF is a three-step training procedure (Figure 11 11 1 1 ). First, humans are used to label a dataset which captures human preferences. For example, if the task is text summarization, the dataset might be different candidate summarizations, with the best summarization being defined by human scorers. Second, researchers train a reward function on these data, which predicts which output the humans would prefer. Finally, given this reward function, researchers can apply standard RL algorithms such as proximal policy optimization or PPO (Schulman et al., 2017) to fine-tune the model. Fine-tuning LLMs with RLHF is now fairly standard practice. For example, GPT-2 was fine-tuned this way in Fine-tuning language models from human preferences (Ziegler et al., 2019) , while GPT-3 was fine-tuned this way in Training language models to follow instructions with human feedback (Ouyang et al., 2022) and in Learning to summarize with human feedback (Stiennon et al., 2020) . And the GPT-4 whitepaper (Achiam et al., 2023) states that the model was trained with RLHF. That said, as the content of this post approaches present day, it is increasingly likely I am writing things that lack nuance. For example, in the GPT-4 whitepaper, the authors write: The model’s capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. So while I am confident that generative pre-training is not enough and that certainly large foundation models trained today do more than just pre-training, the precise details of what else goes into which models are both opaque and rapidly changing. Finally, it’s worth mentioning other work on LLM alignment beyond RLHF. In particular, Anthropic has a number of papers on model alignment. For example, the paper A general language assistant as a laboratory for alignment (Askell et al., 2021) focuses on encoding alignment into LLMs, where they define an aligned model as a model that is “helpful, honest, and harmless”. They explore a variety of techniques, such as imitation learning, binary discrimination, and ranked preference modeling. However, the best way to tackle alignment is still an open-ended problem. Large language models are the result of at least forty years of research, dating back to work by Hinton, Rumelhart, and others on distributed representations in the 1980s. In the early 2000s, Bengio et al introduced the first probabilistic language model using neural networks. However, it wasn’t until after AlexNet, nearly a decade later, that researchers were finally able to train neural network language models at scale. They quickly discovered that these distributed representations captured semantic and syntactic structure, even when using simple log-linear models. This idea of word and phrase-level embeddings was then extended to variable-length sequences with long-range dependencies via transduction models, particularly models with an attention mechanism on the hidden states. Finally in 2017, Vaswani et al introduced the transformer, which simplified transduction models by using all attention. In the eight years since, the main advancements have been training these models on more and more data, using techniques such as generative pre-training and reinforcement learning with human feedback. After learning about how LLMs work, I am reminded of one of my favorite Richard Feynman quotes: “It is not complicated. It’s just a lot of it.” Of course, this is dramatic, but I do think it emphasizes an important point: none of the ideas in this post are terribly complicated. No single idea is beyond the abilities of a smart teenager to understand. But what is beautiful and surprising and remarkable is that the phenomena we observe in LLMs is not magic but simply the emergence of a complex system from simple rules. Today, LLMs are everywhere, and it’s easy to get lost in the models and benchmarks. OpenAI has the GPT series (Radford et al., 2018; Radford et al., 2019; Brown et al., 2020; Achiam et al., 2023) . Google has the Gemini family of models (Team et al., 2023) as well as PaLM (Chowdhery et al., 2023) , LaMDA (Thoppilan et al., 2022) , Gopher (Rae et al., 2021) , and BERT (Devlin et al., 2019) . Anthropic has the Claude family of models, named in ascending order of size and power: Haiku, Sonnet, and Opus. Finally, Meta has its LLaMA series (Touvron et al., 2023; Touvron et al., 2023) . And there are many, many more, such as open-weight models like DeepSeek-R1 (Guo et al., 2025) , which made headlines earlier this year. It would be its own blog post to cover the differences between these. But in essence, every model is the same: a large transformer-style model, pre-trained at massive scale using next-word prediction. The biggest differences have been the size of the training data and the size of the model. For example, GPT-1 is thought to have 117 million parameters (estimated from “Model specifications” in the original paper), while GPT-2 and GPT-3 had 1.5 billion and 1.75 billion parameters respectively—although in (Stiennon et al., 2020) , the authors, OpenAI researchers, mention using “large pretrained GPT-3 models with as many as 6.7 billion parameters”. Regardless, there are roughly three orders of magnitude in the number of parameters in just two generations. OpenAI did not publish the model sizes for GPT-4 and GPT-5, and the latter does not even have a whitepaper but only a “system card” . I have not seen published numbers for Google’s large Gemini models, but the smallest model (the nano) has 1.8-3.25 billion parameters (Team et al., 2023) . Google DeepMind’s Gopher had 280 billion parameters in 2021, while PaLM had 540 billion parameters in 2022! So industry secrets aside, it is safe to say that large foundation models today are likely pushing into the trillions of parameters. The era of truly large language models has begun. In my mind, the main counterintuitive result of LLMs is that training ever larger models using primarily next-word prediction is enough to exhibit human-level performance on such a broad range of tasks. And scale truly does matter here. For example, in the GPT-4 technical report, the authors observe that on a simulated bar exam, GPT-4 scored in the 90th percentile, while GPT-3.5 scored in the 10th. Or consider chain-of-thought reasoning (Ling et al., 2017) , which is a new way of prompting LLMs in order to improve their reasoning by forcing them to explain each step. In Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022) , the authors write: Chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting. So you don’t get anything useful from chain-of-thought reasoning until you have a model that is roughly 50 , 000 50,000 5 0 , 0 0 0 times the size of the original transformer. Why does scaling work? I don’t think anyone knows. But it is an effect that has been observed repeatedly since AlexNet, in the above works and also in meta-analyses such as Scaling laws for neural language models (Kaplan et al., 2020) , Scaling language models: methods analysis, and insights from training Gopher (Rae et al., 2021) , and Emergent abilities of large language models (Wei et al., 2022) . And this phenomenon was both observed and predicted in the famous blog post The Bitter Lesson . In that post, one of the pioneers of RL, Richard Sutton, argues that the “bitter lesson” from AI history is that general, compute-efficient, scalable methods outperform human knowledge and domain-specific insights. This lesson is bitter because it means that expert labor, clever domain-specific theories, handcrafted features, elegant mathematics and beautiful algorithms—these all get dwarfed and outpaced by brute-force search and learned representations. As a harsh example of this, consider the observation that early LLMs were bad at mathematics (Hendrycks et al., 2021) . But today, state-of-the-art models are now winning gold at the International Math Olympiad , and Terry Tao has compared o1 to a mediocre graduate student . The rate of change is immense. By “early LLMs”, I am referring to models from five years ago, and the transistor is only a hundred years old. Did you know that a modern graphics card can perform 36 trillion calculations a second ? Moore’s law and all that. If you feel that it’s a bit perverse that next-word prediction is a sufficient objective to solve elite math problems, if this feels like a stochastic parrot outsmarting you, then you might feel some of the discomfort early linguists felt at statistical language modeling. This is the visceral feeling of the bitter lesson. Our specialized knowledge feels expendable and our intuitions about understanding seem irrelevant in the face of raw computation and speed. But my own view—since you’ve read this far—is that for the time being, machine learning systems are powerful tools that can still be combined with real expertise. Perhaps the best example of this is AlphaFold from Google DeepMind, published in Highly accurate protein structure prediction with AlphaFold (Jumper et al., 2021) . This model achieved near-experimental accuracy on the protein prediction problem. On the one hand, it did so with black-box deep learning. On the other hand, the work leaned heavily on prior biological art, for example using sequences from evolutionarily related proteins and 3D coordinates of homologous structures as inputs to the model. It clearly sidestepped Levinthal’s combinatorial search, even if we do not know how. So what happens next? Even the world’s leading experts can disagree. But in my mind, if anyone deserves the last word here, it is Geoff Hinton, who has been a contrarian believer in neural networks since the 1970s and who, along with Yoshua Bengio and Yann LeCun, won the Turing Prize in 2018 . In a 2024 BBC interview , Hinton argued that LLMs do in fact understand natural language and that they are our current best theory of how the brain understands language as well. In his view, it is only a matter of time before LLMs exceed human intelligence. Certainly, by some metrics and along some dimensions, they already have. Below are some additional resources, which I found useful or interesting while writing this post: 3Blue1Brown: Inside an LLM Stefania Cristina: The Bahdanau attention mechanism Stefania Cristina: The attention mechanism from scratch Dan Jurafsky and James H. Martin: Speech and language processing Andrej Karpathy: The unreasonable effectiveness of recurrent neural networks Andrej Karpathy: Let’s build GPT: from scratch, in code, spelled out Chris Olah: Understanding LSTM networks Dwarkesh Patel: Interview with Richard Sutton Sasha Rush: The annotated transformer Ari Seff: How ChatGPT is trained Ari Seff: What are transformer neural networks? StackOverflow: What exactly are keys, queries, and values in attention mechanisms? Mohammed Terry-Jack: Deep learning: The transformer

0 views
alexiajm 3 months ago

Less is More: Recursive Reasoning with Tiny Networks

|| Paper | Code || In this new paper, I propose Tiny Recursion Model (TRM), a recursive reasoning model that achieves amazing scores of 45% on ARC-AGI-1 and 8% on ARC-AGI-2 with a tiny 7M parameters neural network. The idea that one must rely on massive foundational models trained for millions of dollars by some big corporation in order to achieve success on hard tasks is a trap. Currently, there is too much focus on exploiting LLMs rather than devising and expanding new lines of direction. With recursive reasoning, it turns out that “less is more”: you don’t always need to crank up model size in order for a model to reason and solve hard problems. A tiny model pretrained from scratch, recursing on itself and updating its answers over time, can achieve a lot without breaking the bank. This work came to be after I learned about the recent innovative Hierarchical Reasoning Model (HRM). I was amazed that an approach using small models could do so well on hard tasks like the ARC-AGI competition (reaching 40% accuracy when normally only Large Language Models could compete). But I kept thinking that it is too complicated, relying too much on biological arguments about the human brain, and that this recursive reasoning process could be greatly simplified and improved. Tiny Recursion Model (TRM) simplifies recursive reasoning to its core essence, which ultimately has nothing to do with the human brain, does not require any mathematical (fixed-point) theorem, nor any hierarchy. See the paper for more details. Tiny Recursion Model (TRM) recursively improves its predicted answer y with a tiny network. It starts with the embedded input question x and initial embedded answer y and latent z. For up to K improvements steps, it tries to improve its answer y. It does so by i) recursively updating n times its latent z given the question x, current answer y, and current latent z (recursive reasoning), and then ii) updating its answer y given the current answer y and current latent z. This recursive process allows the model to progressively improve its answer (potentially addressing any errors from its previous answer) in an extremely parameter-efficient manner while minimizing overfitting.

0 views