Posts in Machine-learning (20 found)
Simon Willison 6 days ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3 , both released within the past week! The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February). The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5. The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details: I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features. It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha . I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors. Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable. The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis? And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models. I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself! I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle. "Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond. In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles . Here's Opus 4.5 (on its default "high" effort level ): It did significantly better on the new more detailed prompt : Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh . From the safety section of Anthropic's announcement post: With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry: On the one hand this looks great, it's a clear improvement over previous models and the competition. What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3! I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Opus 4.5 has a new effort parameter which defaults to high but can be set to medium or low for faster responses. The model supports enhanced computer use , specifically a tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect. " Thinking blocks from previous assistant turns are preserved in model context by default " - apparently previous Anthropic models discarded those.

1 views
Simon Willison 1 weeks ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views
Ahead of AI 3 weeks ago

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

1 views
Sean Goedecke 1 months ago

Should LLMs just treat text content as an image?

Several days ago, DeepSeek released a new OCR paper . OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on 1 . But there’s a more subtle reason why really good OCR might have deep implications for AI models. According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself? This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead. Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service , an open-source project, and even a benchmark . It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it. The DeepSeek paper suggests an interesting way 2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail. Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself? In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text? The first explanation is that text tokens are discrete while image tokens are continuous . Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens. Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information . This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token) 3 . So it’s not that surprising that you could do better than text tokens. Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works . To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content 4 . Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice. Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text? You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. See Figure 13. Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%. ↩ See Figure 13. ↩ Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea. ↩ Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries. ↩

0 views
nathan.rs 1 months ago

BERT is just a Single Text Diffusion Step

A while back, Google DeepMind unveiled Gemini Diffusion , an experimental language model that generates text using diffusion. Unlike traditional GPT-style models that generate one word at a time, Gemini Diffusion creates whole blocks of text by refining random noise step-by-step. I read the paper Large Language Diffusion Models and was surprised to find that discrete language diffusion is just a generalization of masked language modeling (MLM), something we’ve been doing since 2018 . The first thought I had was, “can we finetune a BERT-like model to do text generation?” I decided to try a quick proof of concept out of curiosity.

0 views
Ahead of AI 1 months ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.

0 views

The RAG Obituary: Killed by Agents, Buried by Context Windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool , an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions. After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline. In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text! The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once? The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years. GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating. Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages). With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole! RAG emerged as an elegant solution borrowed directly from search engines. Just as Google displays 10 blue links with relevant snippets for your query, RAG retrieves the most pertinent document fragments and feeds them to the LLM for synthesis. The core idea is beautifully simple: if you can’t fit everything in context, find the most relevant pieces and use those . It turns LLMs into sophisticated search result summarizers. Basically, LLMs can’t read the whole book but they can know who dies at the end; convenient! Long documents need to be chunked into pieces and it’s when problems start. Those digestible pieces are typically 400-1,000 tokens each which is basically 300-750 words. The problem? It isn’t as simple as cutting every 500 words. Consider chunking a typical SEC 10-K annual report. The document has a complex hierarchical structure: - Item 1: Business Overview (10-15 pages) - Item 1A: Risk Factors (20-30 pages) - Item 7: Management’s Discussion and Analysis (30-40 pages) - Item 8: Financial Statements (40-50 pages) After naive chunking at 500 tokens, critical information gets scattered: - Revenue recognition policies split across 3 chunks - A risk factor explanation broken mid-sentence - Financial table headers separated from their data - MD&A narrative divorced from the numbers it’s discussing If you search for “revenue growth drivers,” you might get a chunk mentioning growth but miss the actual numerical data in a different chunk, or the strategic context from MD&A in yet another chunk! At Fintool, we’ve developed sophisticated chunking strategies that go beyond naive text splitting: - Hierarchical Structure Preservation : We maintain the nested structure from Item 1 (Business) down to sub-sections like geographic segments, creating a tree-like document representation - Table Integrity : Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together - Cross-Reference Preservation : We maintain links between narrative sections and their corresponding financial data, preserving the “See Note X” relationships - Temporal Coherence : Year-over-year comparisons and multi-period analyses stay together as single chunks - Footnote Association : Footnotes remain connected to their referenced items through metadata linking Each chunk at Fintool is enriched with extensive metadata: - Filing type (10-K, 10-Q, 8-K) - Fiscal period and reporting date - Section hierarchy (Item 7 > Liquidity > Cash Position) - Table identifiers and types - Cross-reference mappings - Company identifiers (CIK, ticker) - Industry classification codes This allows for more accurate retrieval but even our intelligent chunking can’t solve the fundamental problem: we’re still working with fragments instead of complete documents! Once you have the chunks, you need a way to search them. One way is to embed your chunks. Each chunk is converted into a high‑dimensional vector (typically 1,536 dimensions in most embedding models). These vectors live in a space where, theoretically, similar concepts are close together. When a user asks a question, that question also becomes a vector. The system finds the chunks whose vectors are closest to the query vector using cosine similarity. It’s elegant in theory and in practice, it’s a nightmare of edge cases. Embedding models are trained on general text and struggle with specific terminologies. They find similarities but they can’t distinguish between “revenue recognition” (accounting policy) and “revenue growth” (business performance). Consider that example: Query: “ What is the company’s litigation exposure ? RAG searches for “litigation” and returns 50 chunks: - Chunks 1-10: Various mentions of “litigation” in boilerplate risk factors - Chunks 11-20: Historical cases from 2019 (already settled) - Chunks 21-30: Forward-looking safe harbor statements - Chunks 31-40: Duplicate descriptions from different sections - Chunks 41-50: Generic “we may face litigation” warnings What RAG Reports: $500M in litigation (from Legal Proceedings section) What’s Actually There: - $500M in Legal Proceedings (Item 3) - $700M in Contingencies note (”not material individually”) - $1B new class action in Subsequent Events - $800M indemnification obligations (different section) - $2B probable losses in footnotes (keyword “probable” not “litigation”) The actual Exposure is $5.1B. 10x what RAG found. Oupsy! By late 2023, most builders realized pure vector search wasn’t enough. Enter hybrid search: combine semantic search (embeddings) with the traditional keyword search (BM25). This is where things get interesting. BM25 (Best Matching 25) is a probabilistic retrieval model that excels at exact term matching. Unlike embeddings, BM25: - Rewards Exact Matches : When you search for “EBITDA,” you get documents with “EBITDA,” not “operating income” or “earnings” - Handles Rare Terms Better : Financial jargon like “CECL” (Current Expected Credit Losses) or “ASC 606” gets proper weight - Document Length Normalization : Doesn’t penalize longer documents - Term Frequency Saturation : Multiple mentions of “revenue” don’t overshadow other important terms At Fintool, we’ve built a sophisticated hybrid search system: 1. Parallel Processing : We run semantic and keyword searches simultaneously 2. Dynamic Weighting : Our system adjusts weights based on query characteristics: - Specific financial metrics? BM25 gets 70% weight - Conceptual questions? Embeddings get 60% weight - Mixed queries? 50/50 split with result analysis 3. Score Normalization : Different scoring scales are normalized using: - Min-max scaling for BM25 scores - Cosine similarity already normalized for embeddings - Z-score normalization for outlier handling So at the end the embeddings search and the keywords search retrieve chunks and the search engine combines them using Reciprocal Rank Fusion. RRF merges rankings so items that consistently appear near the top across systems float higher, even if no system put them at #1! So now you think it’s done right? But hell no! Here’s what nobody talks about: even after all that retrieval work, you’re not done. You need to rerank the chunks one more time to get a good retrieval and it’s not easy. Rerankers are ML models that take the search results and reorder them by relevance to your specific query limiting the number of chunks sent to the LLM. Not only LLMs are context poor, they also struggle when dealing with too much information . It’s vital to reduce the number of chunks sent to the LLM for the final answer. The Reranking Pipeline: 1. Initial search retrieval with embeddings + keywords gets you 100-200 chunks 2. Reranker ranks the top 10 3. Top 10 are fed to the LLM to answer the question Here is the challenge with reranking: - Latency Explosion : Rerank adds between 300-2000ms per query. Ouch. - Cost Multiplication : it adds significant extra cost to every query. For instance, Cohere Rerank 3.5 costs $2.00 per 1,000 search units, making reranking expensive. - Context Limits : Rerankers typically handle few chunks (Cohere Rerank supports only 4096 tokens), so if you need to re-rank more than that, you have to split it into different parallel API calls and merge them! - Another Model to Manage : One more API, one more failure point Re-rank is one more step in a complex pipeline. What I find difficult with RAG is what I call the “cascading failure problem”. 1. Chunking can fail (split tables) or be too slow (especially when you have to ingest and chunk gigabytes of data in real-time) 2. Embedding can fail (wrong similarity) 3. BM25 can fail (term mismatch) 4. Hybrid fusion can fail (bad weights) 5. Reranking can fail (wrong priorities) Each stage compounds the errors of the previous stage. Beyond the complexity of hybrid search itself, there’s an infrastructure burden that’s rarely discussed. Running production Elasticsearch is not easy. You’re looking at maintaining TB+ of indexed data for comprehensive document coverage, which requires 128-256GB RAM minimum just to get decent performance. The real nightmare comes with re-indexing. Every schema change forces a full re-indexing that takes 48-72 hours for large datasets. On top of that, you’re constantly dealing with cluster management, sharding strategies, index optimization, cache tuning, backup and disaster recovery, and version upgrades that regularly include breaking changes. Here are some structural limitations: 1. Context Fragmentation - Long documents are interconnected webs, not independent paragraphs - A single question might require information from 20+ documents - Chunking destroys these relationships permanently 2. Semantic Search Fails on Numbers - “$45.2M” and “$45,200,000” have different embeddings - “Revenue increased 10%” and “Revenue grew by a tenth” rank differently - Tables full of numbers have poor semantic representations 3. No Causal Understanding - RAG can’t follow “See Note 12” → Note 12 → Schedule K - Can’t understand that discontinued operations affect continuing operations - Can’t trace how one financial item impacts another 4. The Vocabulary Mismatch Problem - Companies use different terms for the same concept - “Adjusted EBITDA” vs “Operating Income Before Special Items” - RAG retrieves based on terms, not concepts 5. Temporal Blindness - Can’t distinguish Q3 2024 from Q3 2023 reliably - Mixes current period with prior period comparisons - No understanding of fiscal year boundaries These aren’t minor issues. They’re fundamental limitations of the retrieval paradigm. Three months ago I stumbled on an innovation on retrievial that blew my mind In May 2025, Anthropic released Claude Code, an AI coding agent that works in the terminal. At first, I was surprised by the form factor. A terminal? Are we back in 1980? no UI? Back then, I was using Cursor, a product that excelled at traditional RAG. I gave it access to my codebase to embed my files and Cursor ran a search n my codebase before answering my query. Life was good. But when testing Claude Code, one thing stood out: It was better and faster and not because their RAG was better but because there was no RAG. Instead of a complex pipeline of chunking, embedding, and searching, Claude Code uses direct filesystem tools: 1. Grep (Ripgrep) - Lightning-fast regex search through file contents - No indexing required. It searches live files instantly - Full regex support for precise pattern matching - Can filter by file type or use glob patterns - Returns exact matches with context lines - Direct file discovery by name patterns - Finds files like `**/*.py` or `src/**/*.ts` instantly - Returns files sorted by modification time (recency bias) - Zero overhead—just filesystem traversal 3. Task Agents - Autonomous multi-step exploration - Handle complex queries requiring investigation - Combine multiple search strategies adaptively - Build understanding incrementally - Self-correct based on findings By the way, Grep was invented in 1973. It’s so... primitive. And that’s the genius of it. Claude Code doesn’t retrieve. It investigates: - Runs multiple searches in parallel (Grep + Glob simultaneously) - Starts broad, then narrows based on discoveries - Follows references and dependencies naturally - No embeddings, no similarity scores, no reranking It’s simple, it’s fast and it’s based on a new assumption that LLMs will go from context poor to context rich. Claude Code proved that with sufficient context and intelligent navigation, you don’t need RAG at all. The agent can: - Load entire files or modules directly - Follow cross-references in real-time - Understand structure and relationships - Maintain complete context throughout investigation This isn’t just better than RAG—it’s a fundamentally different paradigm. And what works for code can work for any long documents that are not coding files. The context window explosion made Claude Code possible: 2022-2025 Context-Poor Era: - GPT-4: 8K tokens (~12 pages) - GPT-4-32k: 32K tokens (~50 pages) 2025 and beyond Context Revolution: - Claude Sonnet 4: 200k tokens (~700 pages) - Gemini 2.5: 1M tokens (~3,000 pages) - Grok 4-fast: 2M tokens (~6,000 pages) At 2M tokens, you can fit an entire year of SEC filings for most companies. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon. This represents a fundamental shift in how AI systems process information. Equally important, attention mechanisms are rapidly improving—LLMs are becoming far better at maintaining coherence and focus across massive context windows without getting “lost” in the noise. Claude Code demonstrated that with enough context, search becomes navigation: - No need to retrieve fragments when you can load complete files - No need for similarity when you can use exact matches - No need for reranking when you follow logical paths - No need for embeddings when you have direct access It’s mind-blowing. LLMs are getting really good at agentic behaviors meaning they can organize their work into tasks to accomplish an objective. Here’s what tools like ripgrep bring to the search table: - No Setup : No index. No overhead. Just point and search. - Instant Availability : New documents are searchable the moment they hit the filesystem (no indexing latency!) - Zero Maintenance : No clusters to manage, no indices to optimize, no RAM to provision - Blazing Fast : For a 100K line codebase, Elasticsearch needs minutes to index. Ripgrep searches it in milliseconds with zero prep. - Cost : $0 infrastructure cost vs a lot of $$$ for Elasticsearch So back to our previous example on SEC filings. An agent can SEC filing structure intrinsically: - Hierarchical Awareness : Knows that Item 1A (Risk Factors) relates to Item 7 (MD&A) - Cross-Reference Following : Automatically traces “See Note 12” references - Multi-Document Coordination : Connects 10-K, 10-Q, 8-K, and proxy statements - Temporal Analysis : Compares year-over-year changes systematically For searches across thousands of companies or decades of filings, it might still use hybrid search, but now as a tool for agents: - Initial broad search using hybrid retrieval - Agent loads full documents for top results - Deep analysis within full context - Iterative refinement based on findings My guess is traditional RAG is now a search tool among others and that agents will always prefer grep and reading the whole file because they are context rich and can handle long-running tasks. Consider our $6.5B lease obligation question as an example: Step 1: Find “lease” in main financial statements → Discovers “See Note 12” Step 2: Navigate to Note 12 → Finds “excluding discontinued operations (Note 23)” Step 3: Check Note 23 → Discovers $2B additional obligations Step 4: Cross-reference with MD&A → Identifies management’s explanation and adjustments Step 5: Search for “subsequent events” → Finds post-balance sheet $500M lease termination Final answer: $5B continuing + $2B discontinued - $500M terminated = $6.5B The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation. Basically, RAG is like a research assistant with perfect memory but no understanding: - “Here are 50 passages that mention debt” - Can’t tell you if debt is increasing or why - Can’t connect debt to strategic changes - Can’t identify hidden obligations - Just retrieves text, doesn’t comprehend relationships Agentic search is like a forensic accountant: - Follows the money systematically - Understands accounting relationships (assets = liabilities + equity) - Identifies what’s missing or hidden - Connects dots across time periods and documents - Challenges management assertions with data 1. Increasing Document Complexity - Documents are becoming longer and more interconnected - Cross-references and external links are proliferating - Multiple related documents need to be understood together - Systems must follow complex trails of information 2. Structured Data Integration - More documents combine structured and unstructured data - Tables, narratives, and metadata must be understood together - Relationships matter more than isolated facts - Context determines meaning 3. Real-Time Requirements - Information needs instant processing - No time for re-indexing or embedding updates - Dynamic document structures require adaptive approaches - Live data demands live search 4. Cross-Document Understanding Modern analysis requires connecting multiple sources: - Primary documents - Supporting materials - Historical versions - Related filings RAG treats each document independently. Agentic search builds cumulative understanding. 5. Precision Over Similarity - Exact information matters more than similar content - Following references beats finding related text - Structure and hierarchy provide crucial context - Navigation beats retrieval The evidence is becoming clear. While RAG served us well in the context-poor era, agentic search represents a fundamental evolution. The potential benefits of agentic search are compelling: - Elimination of hallucinations from missing context - Complete answers instead of fragments - Faster insights through parallel exploration - Higher accuracy through systematic navigation - Massive infrastructure cost reduction - Zero index maintenance overhead The key insight? Complex document analysis—whether code, financial filings, or legal contracts—isn’t about finding similar text. It’s about understanding relationships, following references, and maintaining precision. The combination of large context windows and intelligent navigation delivers what retrieval alone never could. RAG was a clever workaround for a context-poor era . It helped us bridge the gap between tiny windows and massive documents, but it was always a band-aid. The future won’t be about splitting documents into fragments and juggling embeddings. It will be about agents that can navigate, reason, and hold entire corpora in working memory. We are entering the post-retrieval age. The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents. In hindsight, RAG will look like training wheels. Useful, necessary, but temporary. The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted.

0 views
Gregory Gundersen 2 months ago

A History of Large Language Models

Large language models (LLMs) still feel a bit like magic to me. Of course, I understand the general machinery enough to know that they aren’t, but the gap between my outdated knowledge of the field and the state-of-the-art feels especially large right now. Things are moving fast. So six months ago, I decided to close that gap just a little by digging into what I believed was one of the core primitives underpinning LLMs: the attention mechanism in neural networks. I started by reading one of the landmark papers in the literature, which was published by Google Brain in 2017 under the catchy title Attention is all you need (Vaswani et al., 2017) . As the title suggests, the authors did not invent the attention mechanism. Rather, they introduced a neural network architecture which in was some sense “all attention”. This architecture is the now-famous transformer . Clearly the transformer stands in contrast to whatever came before it, but what was that and what did the transformer do differently? To answer these questions, I read a lot of papers, and the context that felt natural to provide here grew the more that I read. I went down the rabbit hole, and when I came out, I realized that what had started as a study of attention had grown into a bigger story. Attention is still the throughline, but there are other important themes, such as how neural networks generalize and the bitter lesson that simple methods that scale seem to triumph over clever methods which do not. This post is the product of that deep dive, and it is a stylized history of LLMs. As a caveat, real life is endlessly detailed, and any summary or synthesis inevitably flattens this detail. So I will accidentally or intentionally skip over many important and related papers and ideas in the service of a synthesis. I will also skip over practicalities such as data preprocessing and advances in hardware and computing. My focus will be on what I view as the main methodological landmarks, and this history is simply one of many ways to tell this story. I’ll start with an old idea, one so ubiquitous today that it might seem silly to belabor here. The idea is that neural networks automatically generalize using distributed representations . This idea has its roots in computational neuroscience, particularly Connectionism (McCulloch & Pitts, 1943) and was discussed explicitly in the 1980s in papers like Learning representations by back-propagating errors (Rumelhart et al., 1986) and Learning distributed representations of concepts (Hinton, 1986) . Understanding it is key to understanding why LLMs work at all and thus understanding the long line of academic research driving towards them. But first, a problem. The goal of natural language processing (NLP) is to model human language using computers. Until the 1980s, NLP systems were mostly based on handwritten rules and handcrafted features. However, by the early 1990s, researchers were exploring the use of statistical methods from machine learning. For an early and seminal example, see A statistical approach to machine translation (Brown et al., 1990) . The core idea of statistical NLP is to model human language using a statistical language model , which is a probability distribution over all possible sequences in a language. This distribution is typically factorized such that each word depends on all words that precede it: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) . (1) p(w_{1:T}) = \prod_{t=1}^T p\left(w_t \mid w_{1:t-1} \right). \tag{1} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) . ( 1 ) Throughout this post, I will use the notation w i : j w_{i:j} w i : j ​ to denote elements in a sequence from positions i i i to j j j inclusive (where i ≤ j i \leq j i ≤ j ): w i : j : = { w i , w i + 1 , … , w j − 1 , w j } . (2) w_{i:j} := \{w_i, w_{i+1}, \dots, w_{j-1}, w_j\}. \tag{2} w i : j ​ : = { w i ​ , w i + 1 ​ , … , w j − 1 ​ , w j ​ } . ( 2 ) Given a good statistical model p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) , we can do many things. For example, we can rank the likelihood of different sequences of words and use that ranking to decide on things like a conversational agent’s output. Or we can translate a source sequence s 1 : T s_{1:T} s 1 : T ​ into a target sequence w 1 : T w_{1:T} w 1 : T ​ if we have the conditional probabilities between the two: p ( w 1 : T ∣ s 1 : T ) ∝ p ( s 1 : T ∣ w 1 : T ) p ( w 1 : T ) . (3) p(w_{1:T} \mid s_{1:T}) \propto p(s_{1:T} \mid w_{1:T}) p(w_{1:T}). \tag{3} p ( w 1 : T ​ ∣ s 1 : T ​ ) ∝ p ( s 1 : T ​ ∣ w 1 : T ​ ) p ( w 1 : T ​ ) . ( 3 ) Here, p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) would be our language model of the target language, and p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) would be our translation model . Today, this view is so pervasive that it might feel obvious, but with a little imagination, I think it’s easy to see how wrong this might have felt to a linguist forty-odd years ago. Equation 1 1 1 captures no language structure or parts of speech such as nouns or verbs or adjectives—see e.g. (Chomsky, 1956) on formal grammars . Instead, it reduces the complexity of human language to next-word prediction. If we didn’t know already that this worked, we might doubt that it would. More importantly for us, estimating the model in Equation 1 1 1 is hard! The main challenge is the curse of dimensionality . There are many, many words in a vocabulary. For example, linguists estimate that English has roughly a million words, give or take a few hundred thousand depending on how you count them. Furthermore, this problem explodes in some tasks such as translation, where there are many possible conditional probabilities p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) . So when estimating the conditional probabilities of our language model, we cannot possibly encounter all possible combinations. We have a data sparsity problem, and estimating the true probabilities becomes impossible. Perhaps the oldest idea to tackle this problem was proposed in Andrey Markov’s pioneering mathematical analysis of Pushkin’s Eugene Onegin (Markov, 1913) . He made the assumption that each conditional probability in Equation 1 1 1 only depends on the previous N N N terms: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) ≈ ∏ t = 1 T p ( w t ∣ w t − N : t − 1 ) . (4) p(w_{1:T}) = \prod_{t=1}^T p \left( w_t \mid w_{1:t-1} \right) \approx \prod_{t=1}^T p \left(w_t \mid w_{t-N:t-1} \right). \tag{4} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ t = 1 ∏ T ​ p ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 ) Today, we would call this a “Markov assumption”, and Equation 4 4 4 is the famous N N N -gram model . Particularly for small N N N , say N = 1 N=1 N = 1 or N = 2 N=2 N = 2 , we might be able to get reasonable estimates of data. But here is the problem, and this problem is a central theme driving towards the attention mechanism: the Markov assumption destroys context . Without more context, a language model can never replicate the complexity and nuance of natural language. As I understand it, this was conceptually the state of the field circa 2000. But then in 2003, a seminal paper was published: A neural probabilistic language model (Bengio et al., 2003) . In that paper, the authors proposed a novel idea: to avoid this data sparsity problem, this curse of dimensionality, we can use neural networks to learn a language model using what they call “distributed representations” of words. (Today, we might call these “word embeddings”.) They proposed three core ideas. First, they represented each word as a real-valued vector or embedding; then, they expressed Equation 1 1 1 in terms of these embeddings; and finally, they trained a neural network to simultaneously learn the embeddings and the parameters of the probability function (neural network) in Equation 1 1 1 using back-propagation (Rumelhart et al., 1986) . That’s a lot, so let’s break it down a bit. Our goal here is to learn a good model f Θ f_{\Theta} f Θ ​ of natural language such that p ( w t ∣ w 1 : t − 1 ) ≈ f Θ ( w t − 1 , … , w t − N ) . (5) p(w_t \mid w_{1:t-1}) \approx f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}). \tag{5} p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) . ( 5 ) So the left-hand side is the true conditional distribution, capturing next-word prediction. It’s the goal of language modeling. But in practice, modeling the full context is hard. So we settle for the right hand side, which is a parametric approximation f Θ f_{\boldsymbol{\Theta}} f Θ ​ of this true distribution with context window of size N N N . In Bengio, they model f Θ f_{\boldsymbol{\Theta}} f Θ ​ using two components. First, they represent words as vectors. Let V \mathcal{V} V denote our vocabulary, which is simply a set of integers V = { 1 , 2 , … , V } \mathcal{V} = \{1, 2,\dots, V\} V = { 1 , 2 , … , V } indexing all V V V words in a language. We will represent each word as a D D D -vector, and so we can represent the entire language as a matrix C ∈ R V × D \mathbf{C} \in \mathbb{R}^{V \times D} C ∈ R V × D (Figure 1 1 1 ). Now for the t t t -th word in a sequence w 1 : T w_{1:T} w 1 : T ​ , we have an associated index in the vocabulary, which we will denote as I ( w t ) ∈ V I(w_t) \in \mathcal{V} I ( w t ​ ) ∈ V . This notation might be a bit odd, but I’m careful here because w t w_t w t ​ is not a well-defined mathematical object, and it cannot index C \mathbf{C} C . But I ( w t ) I(w_t) I ( w t ​ ) is an integer and can index C \mathbf{C} C , and so c I ( w t ) \mathbf{c}_{I(w_t)} c I ( w t ​ ) ​ is a D D D -dimensional vector (a row vector of C \mathbf{C} C ) representing the I ( w t ) I(w_t) I ( w t ​ ) -th word in the vocabulary, associated with the t t t -th word in the sequence. This vector is what we are calling an “embedding” or “distributed representation”. Second, Bengio et al represent the probability function over words (Equation 1 1 1 ) as as a feed-forward neural network g g g with parameters Ω \boldsymbol{\Omega} Ω and arguments C \mathbf{C} C : f Θ ( w t − 1 , … , w t − N ) = g Ω ( c I ( w t − 1 ) , … , c I ( w t − N ) ) . (6) f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}) = g_{\boldsymbol{\Omega}}\left(\mathbf{c}_{I(w_{t-1})}, \dots, \mathbf{c}_{I(w_{t-N})}\right). \tag{6} f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) = g Ω ​ ( c I ( w t − 1 ​ ) ​ , … , c I ( w t − N ​ ) ​ ) . ( 6 ) They then use back-propagation to jointly estimate the parameters Θ : = { C , Ω } . (7) \boldsymbol{\Theta} := \{\mathbf{C}, \boldsymbol{\Omega}\}. \tag{7} Θ : = { C , Ω } . ( 7 ) In other words, they learn the neural network parameters Ω \boldsymbol{\Omega} Ω at the same time as learning the word embeddings C \mathbf{C} C . Note that “distributed representation” can refer to either the continuously-valued vector, e.g. word embedding, or the concept distributed across neurons. This duality is exemplified in C \mathbf{C} C which is both a set of learnable parameters and the embeddings themselves! Why might this work? The authors explain the idea so well that it’s worth just quoting the original paper: In the proposed model, it will so generalize because “similar” words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of “neighbors” in sentence space. This is a beautiful idea. If we have word embeddings that are “well-organized” in the sense that words that play similar roles in sentences (semantically and syntactically) have similar embeddings and if we have a smooth function from word embeddings to probabilities, then small changes in words lead to small changes in embeddings which lead to small changes in probabilities (Figure 2 2 2 ). Pause for a moment to really think about this. Words are discrete objects, and a “small change in a word”, while intuitive to humans, is ill-defined. But this approach concretizes what that means. To quote the paper Linguistic regularities in continuous space word representations (Mikolov et al., 2013) , which we’ll discuss later: Whereas an N N N -gram model works in terms of discrete units that have no inherent relationship to one another, a continuous space model works in terms of word vectors where similar words are likely to have similar vectors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences. For example, if the words “dog” and “cat” are nearby in word-embedding space, then maybe “The cat is walking on the sidewalk” and “The dog is walking on the sidewalk” should have similar probabilities. And only one of these two sentences would need to exist in the training data for the model to generalize well to both sentences! As I mentioned, this idea was not entirely new in 2003. Since the 1980s, researchers had known that neural networks can generalize because they distribute their representation across many neurons (Hinton, 1986) . Each new example modifies the weights, incorporating new knowledge into the old. However (Bengio et al., 2003) is a landmark paper in NLP because it was the first application of this idea to language modeling. The Bengio paper took seriously the idea that we could build a statistical model of language using the distributed representations of words. It was the first hint that we could use neural networks to overcome the curse of dimensionality that plagued statistical NLP. This is a promising idea, but we glossed over an important detail: how do we actually train this model? What is the loss function or objective that the neural network should use? And given a fit model, how do we generate a new sequence? These are important questions to answer per se, but they are also important questions because, at a conceptual level, there is really no difference between Bengio’s model and the frontier large language models today. So understanding this is critical to understanding LLMs. Both are autoregressive models and trained using next-word prediction . As an example, imagine we have the following input sentence, which is a quote from Virginia Woolf’s A Room of One’s Own : “Intellectual freedom depends upon material things.” (8) \text{``Intellectual freedom depends upon material things.''} \tag{8} “Intellectual freedom depends upon material things.” ( 8 ) Now imagine that our model’s context window has size N = 2 N=2 N = 2 and let c p \mathbf{c}_p c p ​ denote a padding D D D -vector of all zeros. In Bengio’s model, we would start by representing just the first word, “intellectual”, as a word embedding. So the first non-zero input to our model would be: x 2 = [ c p c I ( w 1 ) ] = [ c p c I ( “intellectual” ) ] . (9) \mathbf{x}_2 = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(w_1)} \end{array} \right] = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(\text{``intellectual''})} \end{array} \right]. \tag{9} x 2 ​ = [ c p ​ c I ( w 1 ​ ) ​ ​ ] = [ c p ​ c I ( “intellectual” ) ​ ​ ] . ( 9 ) The output of the neural network would be a V V V -dimensional vector representing the probability distribution over p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . Illustratively: y 2 = [ p ( w 2 = “about” ) p ( w 2 = “above” )       ⋮ p ( w 2 = “freedom” )       ⋮ ] . (10) \mathbf{y}_2 = \left[ \begin{array}{l} p(w_2 = \text{``about''}) \\ p(w_2 = \text{``above''}) \\ \qquad\;\;\vdots \\ p(w_2 = \text{``freedom''}) \\ \qquad\;\;\vdots \\ \end{array} \right]. \tag{10} y 2 ​ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ​ p ( w 2 ​ = “about” ) p ( w 2 ​ = “above” ) ⋮ p ( w 2 ​ = “freedom” ) ⋮ ​ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ​ . ( 1 0 ) We would then compute the cross-entropy loss between this output vector and the true distribution, which is really just a one-hot vector with 1 1 1 for the word “freedom” and 0 0 0 everywhere else. We would then repeat this process on the next word. So the next input sequence would be x 3 = [ c I ( “intellectual” ) c I ( “freedom” ) ] , (11) \mathbf{x}_3 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``intellectual''})} \\ \mathbf{c}_{I(\text{``freedom''})} \end{array} \right], \tag{11} x 3 ​ = [ c I ( “intellectual” ) ​ c I ( “freedom” ) ​ ​ ] , ( 1 1 ) and the output would represent the probability distribution p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And again, we would minimize the cross-entropy loss between its associated output vector and a one-hot vector encoding the word “depends”. We would repeat this process until the end of the sentence. Of course, longer sequences are more expensive to train in this way, and this is precisely the point of the context window in Bengio’s paper. We only consider the N N N previous words when predicting the next word. This idea of a limited context window is critical, as it is a constraint that persists into the present day. In this example, since N = 2 N=2 N = 2 , the third input would be x 4 = [ c I ( “freedom” ) c I ( “depends” ) ] . (12) \mathbf{x}_4 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``freedom''})} \\ \mathbf{c}_{I(\text{``depends''})} \end{array} \right]. \tag{12} x 4 ​ = [ c I ( “freedom” ) ​ c I ( “depends” ) ​ ​ ] . ( 1 2 ) So the model completely loses the word “intellectual”. It is now outside the context. Since minimizing the cross-entropy loss is equivalent to maximizing the log likelihood—see here for an example if this idea is new to you—we can generalize the logic above by saying that we want to maximize the log likelihood of our training data, again using a neural network as a parametric function approximation of the true distribution: Θ ⋆ = arg ⁡  ⁣ max ⁡ Θ { ∑ t = 1 T log ⁡ g Ω ( c I ( w t − N ) , … , c I ( w t − 1 ) ) } . (13) \boldsymbol{\Theta}^{\star} = \arg\!\max_{\boldsymbol{\Theta}} \left\{ \sum_{t=1}^T \log g_{\boldsymbol{\Omega}} \left(\mathbf{c}_{I(w_{t-N})}, \dots, \mathbf{c}_{I(w_{t-1})} \right) \right\}. \tag{13} Θ ⋆ = ar g Θ max ​ { t = 1 ∑ T ​ lo g g Ω ​ ( c I ( w t − N ​ ) ​ , … , c I ( w t − 1 ​ ) ​ ) } . ( 1 3 ) Of course, we can estimate Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ by minimizing the negative log likelihood using gradient descent via back-propagation. That’s it. At the conceptual level, this framework is no different from how frontier large language models are trained today. As we will see later though, there is a lot of additional machinery that is needed to make these models work in practice. Finally, imagine we fit our model, meaning we find good parameters Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ that maximize our log likelihood. How can we use these parameters to generate a random sequence or sentence? We could draw the first word at random from the vocabulary. And then we could draw the next word conditional on the first word from our parametric approximation of p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . And then we could draw the third word conditional on the second and first words from our parametric approximation of p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And so on. This is why LLMs can both understand natural language and generate new sentences. They are not just descriptive models; they are generative models . There are some subtleties I am glossing over, such as special embeddings to denote the start and end of a sequence, preprocessing steps like lowercasing words, tokenization, and handling out-of-vocabulary words. But I don’t think these details matter much here. As an aside, we can call any model trained in this way autoregressive . In statistics, an autoregressive model is any model where a variable is predicted using its own previous values. A classic example of this are AR models such as AR(1). While (Bengio et al., 2003) was a landmark paper, its full impact was delayed by roughly a decade. This is because training neural networks was hard at the time. It’s worth checking out that paper and seeing just how primitive the engineering feels today. For example, they trained on CPUs and without modern tooling like automatic differentiation libraries. In the intervening decade, there was some early work that built on Bengio’s model. For example, in A unified architecture for natural language processing: Deep neural networks with multitask learning (Collobert & Weston, 2008) , the authors demonstrate that Bengio’s neural language model could be trained and used on a variety of downstream tasks. And in Word representations: A simple and general method for semi-supervised learning (Turian et al., 2010) , the authors demonstrate that word embeddings improve state-of-the-art NLP systems when included as additional features. But none of these contributions were convincing demonstrations of Bengio’s main idea. So seven years after Bengio et al, it was N N N -grams, not neural networks, which were still the state-of-the-art, at least in practice and outside specialized benchmarks. Honestly, I found this surprising, but I kept reading this claim in various papers. For example, in the introduction to Recurrent neural network based language model (Mikolov et al., 2010) , the authors wrote: It is questionable if there has been any significant progress in language modeling over simple N N N -gram models… In fact, most of the proposed advanced language modeling techniques provide only tiny improvements over simple baselines, and are rarely used in practice. Or two years after that, in A fast and simple algorithm for training neural probabilistic language models (Mnih & Teh, 2012) , the authors wrote: In spite of their superior performance, neural probabilistic language models remain far less widely used than N N N -gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Of course, advanced techniques existed and were well known, but they were often impractical. So roughly a hundred years after Andrey Markov’s pioneering work, researchers were still struggling to represent human language in a form amenable for mathematics and computation, and N N N -grams were still considered a reasonable choice in NLP. Today, neural networks are definitively state-of-the-art. What changed? The answer is that we learned to train variants of Bengio’s model at scale. Around 2012, researchers were finally able to train neural networks on large datasets. My understanding is that it was the so-called “AlexNet” paper, ImageNet classification with deep convolutional neural networks (Krizhevsky et al., 2012) , that convinced many in the research community to pay attention. Convolutional neural networks were already well known and had been trained on small datasets since the 1980s (LeCun et al., 1989) . But AlexNet was the first time a deep convolutional neural network was trained end-to-end on a very large (at the time) dataset, ImageNet (Deng et al., 2009) and using GPUs. The results were a tour de force. To quote the paper: We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top- 5 5 5 test error rate of 15.3 % 15.3\% 1 5 . 3 % , compared to 26.2 % 26.2\% 2 6 . 2 % achieved by the second-best entry. In other words, AlexNet demolished the state-of-the-art in computer vision. It achieved a roughly 40 % 40\% 4 0 % reduction in relative error rate. Nothing else came close. As a comparison, the current fastest time for a men’s marathon is 2 hours and 35 seconds. The previous record was 2 hours and 69 seconds, so 34 seconds slower. Now imagine if someone came along and beat the record by half an hour. It would revolutionize the running world. At the time, computer vision was still dominated by handcrafted feature pipelines, and so the AlexNet results were extremely surprising. For example, in Introduction to the bag of features paradigm for image classification and retrieval (O’Hara & Draper, 2011) , the authors wrote: The past decade has seen the growing popularity of Bag of Features (BoF) approaches to many computer vision tasks, including image classification, video search, robot localization, and texture recognition… BoF-based systems have set new performance standards on popular image classification benchmarks and have achieved scalability breakthroughs in image retrieval. This introduction to bag of feature models was put on arXiv in January 2011, whereas AlexNet was published at NeurIPS in December 2012, meaning that the claim above was contemporaneous with the training of AlexNet! My point here is to underscore just how surprising the rise of neural networks was. To be clear, I am sure many in the research community believed neural networks would work—Hinton has been a believer since probably the 1970s—, but this was hardly the consensus view that it is today. So the year 2012 was a changepoint. In 2003, Bengio et al set the stage conceptually. In 2012, Krizhevsky et al set the stage technologically. With hindsight, the obvious implication of AlexNet was that NLP researchers circa 2012 should try to train neural networks at scale. Of course, many researchers tried, but let’s ground ourselves in one particular model. This will help focus the narrative. To my knowledge, two of the earliest and most successful papers to try this idea were Efficient estimation of word representations in vector space (Mikolov et al., 2013) and Distributed representations of words and phrases and their compositionality (Mikolov et al., 2013) . These papers are tightly related by both authorship and time, and together, they helped unlock the core ideas in Bengio’s paper, as well as introduce the famous word2vec model. So I think it’s fair to treat them as both a unit and as a landmark in our story. To understand these two papers, we need to understand the computational problems Bengio faced, which means we need to understand the model in more technical detail. Let x t \mathbf{x}_t x t ​ be the input to the model, and y t \mathbf{y}_t y t ​ be the output. Bengio’s model did not support variable-length inputs, and thus the input sequence could be only a fixed number of N N N words, each represented as an D D D -dimensional embedding. Let’s represent this input as the concatenation of N N N different D D D -vectors from C \mathbf{C} C mentioned above, so: x t : = [ c I ( w t − 1 ) ⋮ c I ( w t − N + 1 ) ] . (14) \mathbf{x}_t := \left[ \begin{array}{l} \mathbf{c}_{I(w_{t-1})} \\ \quad\quad\vdots \\ \mathbf{c}_{I(w_{t-N+1})} \end{array} \right]. \tag{14} x t ​ : = ⎣ ⎢ ⎢ ⎡ ​ c I ( w t − 1 ​ ) ​ ⋮ c I ( w t − N + 1 ​ ) ​ ​ ⎦ ⎥ ⎥ ⎤ ​ . ( 1 4 ) One way we can imagine constructing x t \mathbf{x}_t x t ​ is if we represent every word in our context window as a V V V -dimensional one-hot vector. Call this a matrix Q t ∈ R N × V \mathbf{Q}_t \in \mathbb{R}^{N \times V} Q t ​ ∈ R N × V . Then x t = Q t C \mathbf{x}_t = \mathbf{Q}_t \mathbf{C} x t ​ = Q t ​ C gives us the associated embeddings. In practice, though, we would never do a dense matrix multiplication with complexity O ( V N D ) \mathcal{O}(VND) O ( V N D ) . Instead, we would simply index into C \mathbf{C} C . So this operation has computational complexity O ( N D ) \mathcal{O}(ND) O ( N D ) . I only belabor this point because I found it confusing when first reading Bengio’s paper. (This point is made more clearly in (Collobert & Weston, 2008) ) After construction, this input x t \mathbf{x}_t x t ​ is then fed into an extremely simple (relative to today’s models) architecture, a feed-forward neural network with a linear projection layer and a nonlinear hidden layer: g Ω ( x t ) = y t : = b + W x t + U tanh ⁡ ( z t ) , z t : = d + H x t . (15) \begin{aligned} g_{\boldsymbol{\Omega}}(\mathbf{x}_t) = \mathbf{y}_t &:= \mathbf{b} + \mathbf{Wx}_t + \mathbf{U} \tanh(\mathbf{z}_t), \\ \mathbf{z}_t &:= \mathbf{d} + \mathbf{Hx}_t. \end{aligned} \tag{15} g Ω ​ ( x t ​ ) = y t ​ z t ​ ​ : = b + W x t ​ + U tanh ( z t ​ ) , : = d + H x t ​ . ​ ( 1 5 ) The output y t ∈ R V \mathbf{y}_t \in \mathbb{R}^{V} y t ​ ∈ R V represents the un-normalized probability of each word in the vocabulary. If normalized, this vector would represent the probability distribution we discussed in the autoregressive framework. Here, we see that that W ∈ R V × N D \mathbf{W} \in \mathbb{R}^{V \times ND} W ∈ R V × N D is a linear projection of the input embeddings x t \mathbf{x}_t x t ​ , that H ∈ R H × N D \mathbf{H} \in \mathbb{R}^{H \times ND} H ∈ R H × N D is a linear projection into a hidden state vector z t ∈ R H \mathbf{z}_t \in \mathbb{R}^H z t ​ ∈ R H , and that U ∈ R V × H \mathbf{U} \in \mathbb{R}^{V \times H} U ∈ R V × H is a linear projection of the nonlinear hidden state vector. So clearly the parameters mentioned in Equation 7 7 7 can be concretized as { C , Ω } : = { C , b , W , U , d , H } . (16) \{\mathbf{C}, \boldsymbol{\Omega}\} := \{\mathbf{C, b, W, U, d, H}\}. \tag{16} { C , Ω } : = { C , b , W , U , d , H } . ( 1 6 ) So why was this expensive to train? We can see that the computational complexity to compute y t \mathbf{y}_t y t ​ is proportional to: N D ⏟ Q C          +          V N D ⏟ W x t             +    V H ⏟ U tanh ⁡ ( z t ) +       H N D ⏟ H x t . (17) \underbrace{ND}_{\mathbf{QC}} \;\;\;+\;\;\; \underbrace{VND}_{\mathbf{Wx}_t} \;\;\;\;+\; \underbrace{VH}_{\mathbf{U} \tanh(\mathbf{z}_t)} +\;\; \underbrace{HND}_{\mathbf{Hx}_t}. \tag{17} Q C N D ​ ​ + W x t ​ V N D ​ ​ + U t a n h ( z t ​ ) V H ​ ​ + H x t ​ H N D ​ ​ . ( 1 7 ) Note that this complexity is for every single word in the corpus, and we must also account for the number of training epochs. In (Mikolov et al., 2013) , the authors write that a “common choice” is N = 10 N=10 N = 1 0 and that N D ND N D is typically around 500 500 5 0 0 to 2000 2000 2 0 0 0 . However, the hidden layer has dimension H H H (commonly around 2000 2000 2 0 0 0 or so) and this is multiplied by the size of the vocabulary! What this means? The dominating term in Equation 17 17 1 7 is V H VH V H . Furthermore, this complexity is just for computing the un-normalized probabilities y t \mathbf{y}_t y t ​ . To normalize these, we must compute the softmax function over the size of the vocabulary V V V : p ( w t ∣ w t − N : t − 1 ) = exp ⁡ ( y t ) ∑ i = 1 V exp ⁡ ( y i ) . (18) p(w_t \mid w_{t-N:t-1}) = \frac{\exp\left(\mathbf{y}_t\right)}{\sum_{i=1}^V \exp\left( \mathbf{y}_i \right)}. \tag{18} p ( w t ​ ∣ w t − N : t − 1 ​ ) = ∑ i = 1 V ​ exp ( y i ​ ) exp ( y t ​ ) ​ . ( 1 8 ) As I understand it, these were the computational problems Bengio faced. The two Mikolov papers did not present a single trick to solve them. Rather, the papers made a number of modeling choices, mostly already established in the literature, that in combination finally made learning distributed representations of words scalable. First, in the first paper, they avoided computing the full softmax function using hierarchical softmax, introduced by Morin and Bengio in Hierarchical probabilistic neural network language model (Morin & Bengio, 2005) . I don’t think the details of this matter much here. See this blog post for a nice explanation with code. Suffice to say that it’s an efficient way to compute the normalized probabilities in Equation 18 18 1 8 . The computational complexity is reduced from O ( V ) \mathcal{O}(V) O ( V ) to O ( log ⁡ 2 V ) \mathcal{O}(\log_2 V ) O ( lo g 2 ​ V ) . In the second paper, they further sped up the softmax computation by introducing a technique called negative sampling . The theory here is rich and deserving of its own post, but the main idea is to draw K K K samples from a noise distribution and train the model to disambiguate observations from noise. The important point here is that one can prove this converges to the correct probabilities without explicitly computing the normalizing constant. See (Gutmann & Hyvärinen, 2010) for details. We don’t need to fully grok these techniques; just know that these two approaches are both ways of getting around the expensive normalization in Equation 18 18 1 8 . For example, if V = 1 × 1 0 6 V = 1\times 10^6 V = 1 × 1 0 6 , then log ⁡ 2 ( V ) ≈ 20 \log_2(V) \approx 20 lo g 2 ​ ( V ) ≈ 2 0 . And in the second paper, they chose K K K to be 2 2 2 to 20 20 2 0 depending on the dataset. Second, they stripped out the non-linear part of Bengio’s model (so removing U tanh ⁡ ( z t ) \mathbf{U} \tanh(\mathbf{z}_t) U tanh ( z t ​ ) ), reducing the model to a simple linear operation: a dot product. The result is model that is log-linear on the features, which I’ll explain in a moment. Now the models. In the first paper, they presented two models, a continuous bag-of-words model (CBOW) and a continuous skip-gram model (skip-gram). These are the foundations of the word2vec NLP toolkit. In the CBOW model, a set of neighboring words are averaged to predict a target word; and in the skip-gram model, a target word is used to predict its neighboring words (Figure 3 3 3 ). Both worked empirically in practice, but the authors only built on the skip-gram model in the second paper. And since I don’t think it’s that important here to understand both, I’ll just focus on the skip-gram model. Let’s build a little intuition by going into detail. The objective of the skip-gram model is to minimize the cross-entropy loss between a single target word and its neighboring words. So the input to the model is only a single D D D -vector representing a single word (so no context window). The output, however, are the N N N words surrounding the input. Let N = 2 C N = 2C N = 2 C . Then the objective function is: 1 T ∑ t = 1 T ∑ − C ≤ j ≤ C ,    j ≠ 0 log ⁡ p ( w t + j ∣ w t ) . (19) \frac{1}{T} \sum_{t=1}^T \sum_{-C \leq j \leq C,\;j \neq 0} \log p(w_{t+j} \mid w_t). \tag{19} T 1 ​ t = 1 ∑ T ​ − C ≤ j ≤ C , j  ​ = 0 ∑ ​ lo g p ( w t + j ​ ∣ w t ​ ) . ( 1 9 ) I will continue to use the notation N N N for this context window, but clearly it is different in precise meaning from the N N N in an N N N -gram or the N N N in Bengio’s paper. We model the conditional probability in Equation 19 19 1 9 via a simple log-linear function: p ( w t + j ∣ w t ) = p ( u I ( w t + j ) ∣ c I ( w t ) ) = exp ⁡ ( ⟨ u I ( w t + j ) , c I ( w t ) ⟩ ) ∑ i ∈ V exp ⁡ ( ⟨ u i , c I ( w t ) ⟩ ) (20) p(w_{t+j} \mid w_t) = p(\mathbf{u}_{I(w_{t+j})} \mid \mathbf{c}_{I(w_{t})}) = \frac{\exp\left( \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle \right)}{\sum_{i \in \mathcal{V}} \exp\left( \langle \mathbf{u}_i, \mathbf{c}_{I(w_{t}) \rangle} \right)} \tag{20} p ( w t + j ​ ∣ w t ​ ) = p ( u I ( w t + j ​ ) ​ ∣ c I ( w t ​ ) ​ ) = ∑ i ∈ V ​ exp ( ⟨ u i ​ , c I ( w t ​ ) ⟩ ​ ) exp ( ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ ) ​ ( 2 0 ) Here, c i \mathbf{c}_i c i ​ are word embeddings of the inputs. These are analogous to the row-vectors of C \mathbf{C} C in Bengio’s model and again are constructed via a lookup. The output embeddings u \mathbf{u} u are a little trickier to interpret. If we were using the full softmax function, we would have V V V such output embeddings, and these would represent the weights of the softmax function. But when using hierarchical softmax or negative sampling, the interpretation changes a bit. Again, I don’t think the details really matter here. The key point is that we take a sequence w 1 : T w_{1:T} w 1 : T ​ , select the appropriate embeddings c 1 : T \mathbf{c}_{1:T} c 1 : T ​ , and compute Equation 20 20 2 0 directly, learning both the parameters C \mathbf{C} C and U \mathbf{U} U . This is called a “log-linear model” because the log of the conditional probability is linear with respect to its arguments: log ⁡ p ( w t + j ∣ w t ) = ⟨ u I ( w t + j ) , c I ( w t ) ⟩ − Z , (21) \log p(w_{t+j} \mid w_t) = \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle - Z, \tag{21} lo g p ( w t + j ​ ∣ w t ​ ) = ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ − Z , ( 2 1 ) Here, I just write Z Z Z to denote the normalizing constant, the denominator in Equation 20 20 2 0 , because it is not particularly interesting, and we do not even need to compute it when using negative sampling. The key relationship that the model is learning is a simple linear weighting of the input embeddings that allow it to predict nearby words. Hopefully, it is clear why this model is so fast to train. We have no hidden layers or nonlinearities. We simply compute a dot product and ignore the normalizing constant. For example, when using the full softmax, the computational complexity is: N ( D + D V ) . (22) N (D + D V). \tag{22} N ( D + D V ) . ( 2 2 ) Here, we have D + D V D + D V D + D V dot products, and we need to do it over N N N words in our context window. However, in practice, we can eliminate V V V entirely, replacing it with something around log ⁡ 2 ( V ) \log_2(V) lo g 2 ​ ( V ) or K K K . This is significantly smaller than Equation 17 17 1 7 . For example, if we assume that H = D = 500 H=D=500 H = D = 5 0 0 , N = 10 N=10 N = 1 0 , and V = 1 × 1 0 6 V=1 \times 10^{6} V = 1 × 1 0 6 , then hierarchical softmax is five orders of magnitude smaller in terms of complexity. So in these two seminal Mikolov papers, the authors stripped down Bengio’s core idea to a simple log-linear model, and thus were able to train that model at scale. That said, I want to stress a subtlety that took me time to grok. Neither the CBOW nor the continuous skip-gram models presented here are full language models. Notice that their objective functions (nearby-word prediction) are not in the autoregressive framework and thus cannot easily plug into Equation 1 1 1 . That’s because the goal of these papers was not to learn a full language model but rather to learn good word embeddings. They say this explicitly in the first paper (emphasis mine): Representation of words as continuous vectors has a long history. A very popular model architecture for estimating neural network language model (NNLM) was proposed in (Bengio et al., 2003) , where a feed-forward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others. Another interesting architecture of NNLM was presented in (Mikolov, 2007; Mikolov et al., 2009) , where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model. So the word2vec models were simple and shallow (single layer) neural networks designed for fast training and to learn good embeddings. They were not full language models. This is a major distinction from similar prior art, such as A scalable hierarchical distributed language model (Mnih & Hinton, 2008) . In this paper, the authors demonstrate more scalable inference of Bengio’s model by representing the vocabulary compactly through binary trees and by using a log-bilinear model. But they go end-to-end to a language model, as the paper title suggests. Mikolov et al’s two models were relentlessly simple and efficient. As I understand it, both CBOW and skip-gram worked well in practice. It did not matter if neighboring words predict a target word or if that target word predicts its neighboring words. The real differentiator was that both models could be efficiently trained at scale. And with scale, something remarkable happened: the authors discovered that distributed representations of words, trained in this fashion, captured semantic and syntactic information. Today, linguistic regularities in word embeddings is so well-established that it might seem boring to read here. But understood in context, these regularities should be surprising! How can a simple linear model, trained on essentially next- or nearby-word prediction via maximum likelihood estimation, learn distributed representations of words with remarkable syntactic and semantic properties and relationships? In my mind, this was the first big result that suggested neural networks would not just work but really work in language modeling. The word2vec papers were not the first to observe these properties. My understanding is that that credit goes to yet another Mikolov paper from 2013, Linguistic regularities in continuous space word representations (Mikolov et al., 2013) . Here, the authors showed that many semantic and syntactic relationships correspond to approximately constant vector offsets in the embedding’s vector space. To be clear, researchers had long observed that one could uncover structure in vector representations of words. For example, in the 1989 paper Self-organizing semantic maps (Ritter & Kohonen, 1989) , the authors trained self-organizing maps (Kohonen, 1982) on pre-computed two-dimensional vectors representing words and demonstrated that these maps contain semantic structure. However, these models were not trained end-to-end (the representations themselves were not learned) and did not have linear structure. It would be a stretch to call these vectors “word embeddings”. But log-linear models like word2vec were remarkable precisely because they enabled analogical reasoning through simple vector offset, i.e. linear operations (Figure 4 4 4 )! Perhaps the most famous example of analogical reasoning with word embeddings is the relationship “king is to queen as man is to woman”: vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . (23) \text{vec}\left(\text{``king''}\right) - \text{vec}\left(\text{``man''}\right) + \text{vec}\left(\text{``woman''}\right) \approx \text{vec}\left(\text{``queen''}\right). \tag{23} vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . ( 2 3 ) Or in (Mikolov et al., 2013) , the authors give the example that “Russia” plus “river” is the Volga: vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . (24) \text{vec}\left(\text{``Russia''}\right) + \text{vec}\left(\text{``river''}\right) \approx \text{vec}\left(\text{``Volga River''}\right). \tag{24} vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . ( 2 4 ) In my mind, these are pretty fascinating and non-obvious results. It suggests that the methods are not mixing vector dimensions in undesirable ways and staying approximately linear. Again, viewed with fresh eyes, it is really quite remarkable! If you were a researcher in 2003 reading Bengio’s paper, would you have predicted this result with high confidence? While these two Mikolov papers are landmark papers on learning word embeddings at scale, they are by no means the only ones. Many other researchers worked in this area. Perhaps the most famous paper on word embeddings that we do not have time to discuss is GloVe: Global vectors for word representation (Pennington et al., 2014) . In this paper, the authors present a unifying view between two common methods for learning word embeddings, global matrix factorization methods and local context window methods. But there were many others as well, such as Skip-thought vectors (Kiros et al., 2015) , Word embeddings through Hellinger PCA (Lebret & Collobert, 2013) , and Eigenwords: spectral word embeddings (Dhillon et al., 2015) to cite just a few illustrative examples. For ease of presentation, I have focused on word-level embeddings. But the idea was naturally and quickly extended to larger contexts. This was motivated by the fact that a word’s meaning is obviously context-dependent (polysemy). For example, the word “bank” might refer to a financial institution or the side of a river. A word embedding for “bank” that is not context dependent must somehow flatten this distinction. So lack of context is obviously a limitation. Researchers tackled this through a variety of approaches. One approach was to use the hidden states of a bidirectional long short-term memory network (LSTM) as context-specific embeddings as in context2vec: Learning generic context embedding with bidirectional LSTM (Melamud et al., 2016) or Learned in translation: contextualized word vectors (McCann et al., 2017) . But perhaps the most noteworthy example of this idea—and one I mention here because it will come up later—was Deep contextualized word representations (Peters et al., 2018) or ELMO. Here, the authors both used a bidirectional LSTM to extract more context-dependent word embeddings and then trained on an objective function that was dependent on the downstream task. This hints at combining pre-trained embeddings with supervised fine-tuning, which we’ll see later. By 2013, word- and phrase-level embeddings demonstrably worked. The key to unlocking them was simple methods that scaled on modern hardware. However, the problem with these embeddings is that they were still with respect to a fixed window. It was not immediately obvious how this idea could be extended to longer phrases or sentences or to larger texts. Of course, researchers had tried. For example, (Collobert & Weston, 2008) used the idea of time-delay neural networks (Waibel et al., 1989) to model sentences of variable lengths, but the authors used convolutions that still had a fixed-width window size. The embedding itself, then, was not constructed while accounting for long-range dependencies. So word embeddings, while a beautiful idea, only set the stage for the next big idea in our history: tackling the problem of modeling long-range dependencies without an explicit context window. The key innovation here was sequence-to-sequence models. In a sequence-to-sequence model, a neural network encodes a variable-length input sequence into a fixed-length vector, while a second neural network decodes this fixed-length vector back into a variable-length output sequence. In both Bengio and Mikolov’s papers, the input was an embedding ( c \mathbf{c} c in Equations 14 14 1 4 and 20 20 2 0 ). In a sequence-to-sequence model, this intermediate fixed-length vector is now the word embedding. The precise architectures used for the encoder and decoder can vary, but clearly they should be architectures that support variable-length sequences, such as recurrent neural networks (RNNs) or LTSMs. To me, the most intuitive example of a sequence-to-sequence model is a translation model. The input sequence is a sentence in a source language like English, and the output sequence is a sentence in a target language like Chinese (Figure 5 5 5 ). And since some of the most important early work in sequence-to-sequence modeling was in neural machine translation (NMT), I’ll often use translation as a default example. However, the more general case is any mapping from one sequence to another. This idea is fairly straightforward; it is analogous to an auto-encoder but for variable-length sequences, and auto-encoders (Bourlard & Kamp, 1988) are nearly as old as back-propagation. However, as we have already seen, even seemingly simple ideas are hard-won. The original work in RNNs and LSTMs goes back to at least the early 1990s, with seminal papers like Finding structure in time (Elman, 1990) , Serial order: A parallel distributed processing approach (Jordan, 1997) and Long short-term memory (Hochreiter & Schmidhuber, 1997) . By the 2010s, these sequential models were well-known and already used in NLP. See (Mikolov et al., 2010; Sutskever et al., 2011; Graves, 2013) for example. These models were an important bridge, proving that we could train RNNs at scale and overcome the vanishing gradient problem discussed in Learning long-term dependencies with gradient descent is difficult (Bengio et al., 1994) . But they were not yet sequence-to-sequence models. To my knowledge, the first paper to propose a full encoder–decoder architecture for NLP was Recurrent continuous translation models (Kalchbrenner & Blunsom, 2013) . Here, the authors proposed training two neural networks end-to-end. The decoder was an RNN, inspired by the model in (Mikolov et al., 2010) . But somewhat surprisingly, the encoder was not also an RNN. With hindsight, two RNNs feels like the obvious choice, but instead the authors used a convolutional sentence model (CSM). The details don’t really matter here, but this is essentially an NLP model which uses convolutional layers. Why this choice? Well, CSMs were actually developed by the same authors in the same year, in Recurrent convolutional neural networks for discourse compositionality (Kalchbrenner & Blunsom, 2013) , and my hypothesis is that this choice just felt obvious to them at the time. So (Kalchbrenner & Blunsom, 2013) was a landmark paper in the sense that it was the first attempt at a sequence-to-sequence model, but with hindsight we can immediately see how to improve it with a better sequential model for the encoder. And that is precisely what happens in two follow up papers. First, in Learning phrase representations using RNN encoder–decoder for statistical machine translation (Cho et al., 2014) , the authors propose the first encoder–decoder architecture in which both neural networks were RNNs. And then in Sequence to sequence learning with neural networks (Sutskever et al., 2014) , the authors proposed a similar model but using LSTMs, since LSTMs often work better at handling the aforementioned vanishing gradient problem. In this paper, Sutskever makes the connection to Kalchbrenner explicitly: Our work is closely related to Kalchbrenner and Blunsom, who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. As a nitpick, convolutional neural networks do model local patterns and order, but they lose global order without very large receptive fields . But Sutskever’s point is directionally correct. So even at the time, the academic history we are tracing here was clear. To understand these models in a bit more detail, let’s go through the RNN encoder–decoder in (Cho et al., 2014) , using Figure 6 6 6 as a reference. Let X \mathcal{X} X be a variable-length input sequence with length T x T_x T x ​ , and let Y \mathcal{Y} Y be a variable-length output sequence with length T y T_y T y ​ : X = { x 1 , x 2 , … , x T x } , Y = { y 1 , y 2 , … , y T y } . (25) \begin{aligned} \mathcal{X} &= \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{T_x} \}, \\ \mathcal{Y} &= \{ \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_{T_y} \}. \end{aligned} \tag{25} X Y ​ = { x 1 ​ , x 2 ​ , … , x T x ​ ​ } , = { y 1 ​ , y 2 ​ , … , y T y ​ ​ } . ​ ( 2 5 ) Note that ( X , Y ) (\mathcal{X}, \mathcal{Y}) ( X , Y ) is a single observation pair, but I am suppressing the sample index for ease of notation. Also, I bold each vector in both sequences because they are embedded words. In an RNN, we iteratively compute hidden state variables over T x T_x T x ​ steps, where for the t t t -th step we define a recurrence relation between hidden states as: h t = f enc ( h t − 1 , x t ) . (26) \mathbf{h}_t = f_{\textsf{enc}} \left( \mathbf{h}_{t-1}, \mathbf{x}_t \right). \tag{26} h t ​ = f enc ​ ( h t − 1 ​ , x t ​ ) . ( 2 6 ) This might be a little abstract. So concretely, a simple RNN network might instantiate f enc f_{\textsf{enc}} f enc ​ as the following nonlinear function of the current word embedding and the previous hidden state: h t = tanh ⁡ ( W h h h t − 1 + W x h x t ) . (27) \mathbf{h}_t = \tanh \left(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t \right). \tag{27} h t ​ = tanh ( W h h ​ h t − 1 ​ + W x h ​ x t ​ ) . ( 2 7 ) The matrices hopefully have obvious dimensions, and we can initialize the first hidden state vector h 0 \mathbf{h}_0 h 0 ​ however we like, such as a vector of all zeros. This is simply one choice, though. We can imagine many types of choices, such as a vanilla RNN unit or an LSTM unit. The key point is that the hidden state vectors H = { h 1 , h 2 , … , h T x } (28) \mathcal{H} = \{\mathbf{h}_1, \mathbf{h}_2,\dots, \mathbf{h}_{T_x}\} \tag{28} H = { h 1 ​ , h 2 ​ , … , h T x ​ ​ } ( 2 8 ) carry forward information from previous words in the sequence via these recurrent connections, much like a hidden Markov model (Baum & Petrie, 1966) . A powerful consequence of this model is that RNNs do not limit the size of the input context window. Different input sequences X \mathcal{X} X can be different sizes, unlike in the N N N -gram model or in Bengio’s model (Equation 14 14 1 4 ). See Andrej Karpathy’s excellent blog post, The unreasonable effectiveness of recurrent neural networks , for a more detailed presentation of RNNs. Finally, we define the context vector c \mathbf{c} c as some function of the hidden states: c = q ( H ) . (29) \mathbf{c} = q(\mathcal{H}). \tag{29} c = q ( H ) . ( 2 9 ) Notice that c \mathbf{c} c does not have a time index, because it compresses all the temporal information in the input sequence X \mathcal{X} X into a fixed-width vector. The easiest definition of c \mathbf{c} c is simply as the last hidden state vector or c = h T x \mathbf{c} = \mathbf{h}_{T_x} c = h T x ​ ​ . This context vector becomes an input to the decoder, another RNN with recurrence relation s t = f dec ( s t − 1 , y t − 1 , c ) , (30) \mathbf{s}_t = f_{\textsf{dec}} \left( \mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c} \right), \tag{30} s t ​ = f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) , ( 3 0 ) and hidden states S = { s 1 , s 2 , … , s T y } . (31) \mathcal{S} = \{\mathbf{s}_1, \mathbf{s}_2,\dots, \mathbf{s}_{T_y}\}. \tag{31} S = { s 1 ​ , s 2 ​ , … , s T y ​ ​ } . ( 3 1 ) The decoder then outputs the sequence Y \mathcal{Y} Y , one word at a time. The typical objective of a sequence-to-sequence model is again the autoregressive objective of next-word prediction: maximize a log likelihood, in which each conditional probability is modeled via the decoder RNN: log ⁡ p ( Y ) = ∑ t = 1 T y log ⁡ p ( y t ∣ y 1 : t − 1 ) = ∑ t = 1 T y log ⁡ f dec ( s t − 1 , y t − 1 , c ) . (32) \log p(\mathcal{Y}) = \sum_{t=1}^{T_y} \log p(\mathbf{y}_t \mid \mathbf{y}_{1:t-1}) = \sum_{t=1}^{T_y} \log f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1} , \mathbf{c}). \tag{32} lo g p ( Y ) = t = 1 ∑ T y ​ ​ lo g p ( y t ​ ∣ y 1 : t − 1 ​ ) = t = 1 ∑ T y ​ ​ lo g f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) . ( 3 2 ) Again, this might be a bit abstract. So for example, one possible instantiation of g g g is as a linear transformation of the input variables: f dec ( s t − 1 , y t − 1 , c ) = W z s s t + W z y y t − 1 + W z c c . (33) f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}) = \mathbf{W}_{zs} \mathbf{s}_t + \mathbf{W}_{zy} \mathbf{y}_{t-1} + \mathbf{W}_{zc} \mathbf{c}. \tag{33} f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) = W z s ​ s t ​ + W z y ​ y t − 1 ​ + W z c ​ c . ( 3 3 ) Of course, this is just one choice. Then all the model weights are learned end-to-end by optimizing this log likelihood (Equation 32 32 3 2 ). In this way, we can convert a variable-length input X \mathcal{X} X into a variable-length output Y \mathcal{Y} Y . This RNN encoder–decoder framework is powerful, since many problems in NLP can be framed in this way. For example, text summarization, machine translation, and agentic conversation can all be framed as a sequence-to-sequence modeling challenge. To be clear, other researchers around this time had attempted other approaches to handling variable-length sequences, such as the recursive neural tensor network in Recursive deep models for semantic compositionality over a sentiment treebank (Socher et al., 2013) . But the RNN encoder–decoder would become the de facto framework of choice for a large range of NLP tasks. As an aside, sometimes these models are call sequence transduction models or transduction models or even just transducers . My understanding is that “transduction” here just means converting one sequence into another by learning a conditional distribution p θ ( y 1 : T ∣ x 1 : S ) p_{\theta}(\mathbf{y}_{1:T} \mid \mathbf{x}_{1:S}) p θ ​ ( y 1 : T ​ ∣ x 1 : S ​ ) . In this context, “transduction” does not have the sense that Vladimir Vapnik gave it. In Vapnik’s definition, transduction loosely means classification of a specific example rather than a general rule for classifying future examples (Gammerman et al., 2013) . But this is not the sense which people mean when they refer to models like the transformer as a “transducer”. In my mind, Kalchbrenner, Cho, and Sutskever’s three papers (Kalchbrenner & Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014) were the foundations of sequence-to-sequence modeling, and many other papers have built around and off this core idea. But the key point for us here is that these three papers make the same logic choice: they lift the idea of a fixed-length embedding for words or phrases into the context vector c \mathbf{c} c of a sequential model, such that the models can now support variable-length inputs and outputs and long-range dependencies in each. However, a problem with this approach was that long-range dependencies got “lost” in this context vector. For example, imagine we had a very long English language text that we wanted to translate into Chinese. Even if our encoder LSTM was good at capturing long-range dependencies in the English sentence, it would be forced to compress that information into a much shorter, fixed-width vector with no temporal structure that would then be fed into the decoder. This effect was observed by Cho et al in On the properties of neural machine translation: encoder–decoder approaches (Cho et al., 2014) . In this paper, the authors write: Our analysis shows that the performance of the neural machine translation model degrades quickly as the length of a source sentence increases. The most obvious explanatory hypothesis is that the fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. The authors test this hypothesis through a variety of experiments. For example, in one experiment, they report the BLEU score for an RNN encoder–decoder as a function of sequence length, and they show that the model’s performance degrades as the sentences become longer. So the RNN decoder–encoder was promising, but the fixed-width context vector was a bottleneck on modeling long-range dependencies. Then in 2014, a seminal paper was published that addressed this problem, Neural machine translation by jointly learning to align and translate (Bahdanau et al., 2014) . The main invention of this paper was to use the well-known attention mechanism to attend to this context vector. However, the authors barely use the word “attention” in the paper. Instead, they seem to conceptualize it more as a search problem: In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. I say this paper is “seminal” because, at least to my knowledge, it was really the first paper to use a differentiable attention layer in the rapidly-growing field of NMT. To be clear, the attention mechanism was already known and used outside of NLP. For example, see Learning to combine foveal glimpses with a third-order Boltzmann machine (Larochelle & Hinton, 2010) , Learning where to attend with deep architectures for image tracking (Denil et al., 2012) , or Recurrent models of visual attention (Mnih et al., 2014) . These were all papers that were published between 2010 and 2014 and that applied an attention mechanism to a neural network computer vision system. However, to my knowledge, Bahdanau was the first paper to successfully use attention in NLP. To quote Effective approaches to attention-based neural machine translation (Luong et al., 2015) : In the context of NMT, Bahdanau et al… has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT. All that said, “jointly align and translate” is pretty vague, so let’s get technical. Bahdanau’s solution to this bottleneck was to allow each hidden state vector in the decoder to pay attention to possibly all the hidden state vectors in the encoder. What do I mean by “pay attention to”? Here, each decoder hidden state variable s i \mathbf{s}_i s i ​ depends not only on the previous hidden state and previous word but also on its own context vector, which is a weighted combination of the encoder’s hidden states! s i = f dec ( s i − 1 , y i − 1 , c i ) , c i = ∑ j = 1 T x α i j h j . (34) \begin{aligned} \mathbf{s}_i & = f_{\textsf{dec}}(\mathbf{s}_{i-1}, \mathbf{y}_{i-1}, \mathbf{c}_i), \\ \mathbf{c}_i &= \sum_{j=1}^{T_x} \alpha_{ij} \mathbf{h}_j. \end{aligned} \tag{34} s i ​ c i ​ ​ = f dec ​ ( s i − 1 ​ , y i − 1 ​ , c i ​ ) , = j = 1 ∑ T x ​ ​ α i j ​ h j ​ . ​ ( 3 4 ) This is the main idea of the paper. Each decoder hidden state s i \mathbf{s}_i s i ​ has access to all the hidden states in the encoder via this context vector c i \mathbf{c}_i c i ​ (Figure 7 7 7 ). We can finally define the attention mechanism! Here, it is the weighted sum of hidden state vectors, as this allows each s i \mathbf{s}_i s i ​ to attend to different parts of the input sequence through its hidden state. Each weight α i j \alpha_{ij} α i j ​ is a linear function of the previous decoder hidden state s i − 1 \mathbf{s}_{i-1} s i − 1 ​ and the current decoder hidden state h j \mathbf{h}_j h j ​ : α i j : = exp ⁡ ( e i j ) ∑ k = 1 T x exp ⁡ ( e i k ) , e i j : = v a ⊤ z i j , z i j : = tanh ⁡ ( W a s i − 1 + U a h j ) . (35) \begin{aligned} \alpha_{ij} &:= \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \\ e_{ij} &:= \mathbf{v}_a^{\top } \mathbf{z}_{ij}, \\ \mathbf{z}_{ij} &:= \tanh\left( \mathbf{W}_a \mathbf{s}_{i-1} + \mathbf{U}_a \mathbf{h}_j \right). \end{aligned} \tag{35} α i j ​ e i j ​ z i j ​ ​ : = ∑ k = 1 T x ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , : = v a ⊤ ​ z i j ​ , : = tanh ( W a ​ s i − 1 ​ + U a ​ h j ​ ) . ​ ( 3 5 ) Let’s call α i \boldsymbol{\alpha}_i α i ​ the an alignment vector , which we infer one per step at a time during the decoding process. So z i j \mathbf{z}_{ij} z i j ​ can be viewed as a shared hidden state, capturing nonlinear information about both the input and output sequence. Importantly, there is one such vector for each input-output pair. And for a given decoder hidden state, the model can up or downweight the relationship to h j \mathbf{h}_j h j ​ via the parameters v a \mathbf{v}_a v a ​ . The neural network learns all these model parameters end-to-end via back-propagation, maximizing the log likelihood in Equation 32 32 3 2 . So that’s it. As I understand it, (Bahdanau et al., 2014) was really the first paper to use attention in neural machine translation and probably the most successful use of attention in NLP at the time. The method worked surprisingly well. To quote the paper’s conclusion: Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. As an aside, they actually use a bidirectional RNN for the encoder and then concatenated the forward and backward hidden states. But I don’t think that adds much to our story or to intuition, and it would muddy Figure 7 7 7 . The key point is that it was the attention mechanism that allowed for the long-range dependencies encoded by the RNN to be captured through an adaptive context vector. Hopefully, we can now see why the paper uses the words “align and translate”. Here, alignment really means allowing the model to uncover which parts of the input sequence matter to each part of the output sequence—and it does this via the attention mechanism. Finally, while writing this blog post, I came across this incredible comment by Edward Grefenstette , published on 3 May 2014: By and large, the case for deep learning in language hasn’t been fully made. It works well for vision and speech, but that doesn’t entail that it would carry to semantics. Some excellent shallow models without non-linearities, like the Mnih and Hinton log-bilinear models, are excellent and can be trained very quickly. It’s a problem with much “deep learning” work in NLP these days that shallow baselines are never considered or compared to. Deep learning is fascinating and will certainly have an impact in NLP, but don’t rush to believe that it’s the best solution for your NLP problems. I love this comment because it is a time-capsule, perfectly capturing how experts in the field felt about neural networks at the time. (Note that Grefenstette has published papers with other researchers in this story, such as Kalchbrenner and Graves.) So even around the time that Bahdanau et al were publishing groundbreaking work on RNN encoder–decoders with attention, deep learning had still not fully proven itself to the community. The attentive reader might be wondering: wasn’t the argument around log-linear models that they were simple and therefore scalable? But Bahdanau’s RNN encoder–decoder with attention seems anything but simple. So on some level, yes, Bahdanau’s model was a step backwards in terms of complexity. But on another level, it was a proof-of-concept that the attention mechanism worked. (Also, Moore’s law.) So researchers quickly built on Bahdanau by studying simpler models and simpler types of attention. Perhaps the most important paper to directly build on Bahdanau’s model was (Luong et al., 2015) . In this paper, the authors simplified the model used by Bahdanau, proposed several alternative forms of attention, and showed that an ensemble of attention-based methods produced state-of-the-art results on neural machine translation problems. To be clear, Bahdanau had shown that attention worked and that it seemed to address problems in translating longer sentences, but it did not demonstrably beat the state-of-the-art. Luong’s results more directly suggested that attention might be the way forward. So before we get to the transformer, let’s understand the attention mechanism better through the lens of this paper. The first dimension along which we can define attention is local versus global attention. For example, in the attention mechanism in an RNN encoder–decoder, the conceptual lynchpin is that at each time step i ∈ { 1 , … , T y } i \in \{1, \dots, T_y\} i ∈ { 1 , … , T y ​ } in the decoding phase, we construct a context vector c i \mathbf{c}_i c i ​ which summarizes information from the source sentence via the encoder’s hidden states: c i = ∑ j = a b α i j h j . (36) \mathbf{c}_i = \sum_{j=a}^{b} \alpha_{ij} \mathbf{h}_j. \tag{36} c i ​ = j = a ∑ b ​ α i j ​ h j ​ . ( 3 6 ) But now I don’t precisely define the limits of the sum, a a a and b b b . If a = 1 a=1 a = 1 and b = T x b=T_x b = T x ​ , then the context vector is constructed by considering all the hidden states of the source sentence. This is what Luong calls global attention (Figure 8 8 8 , left), since each word in the target sentence has access to information about all the words in the source sentence. But we could also define a a a and b b b such that they form a window around the decoder’s hidden state or model the left-to-right structure of many natural languages. This is what Luong calls local attention (Figure 8 8 8 , right). So these are two ways in which we can construct the context vector c i \mathbf{c}_i c i ​ . The second dimension along which we can define attention is how we define the alignment weights α i \boldsymbol{\alpha}_i α i ​ . For example, the simplest choice is simply that α i \boldsymbol{\alpha}_i α i ​ is a one-hot vector, such that c i \mathbf{c}_i c i ​ selects a single encoder hidden state vector h k \mathbf{h}_k h k ​ to use in the i i i -th decoding step. This would be hard- rather than soft-search. But more generally, we can write these alignment weights as the unnormalized output of a score function . Using the notation from Equation 35 35 3 5 above, we can write this as: e i j : = score ( h j , s i − 1 ) . (37) e_{ij} := \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}). \tag{37} e i j ​ : = score ( h j ​ , s i − 1 ​ ) . ( 3 7 ) And in Luong, the authors explore three main scoring functions. These are dot-product attention , general attention , and additive attention , defined as: e i j = score ( h j , s i − 1 ) = { h j ⊤ s i − 1 dot, h j ⊤ W a s i − 1 general, v a ⊤ tanh ⁡ ( W a h j + U a s i − 1 ) additive (Bahdanau). (38) e_{ij} = \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}) = \begin{cases} \mathbf{h}_j^{\top} \mathbf{s}_{i-1} & \text{dot,} \\ \mathbf{h}_j^{\top} \mathbf{W}_a \mathbf{s}_{i-1} & \text{general,} \\ \mathbf{v}_a^{\top } \tanh \left( \mathbf{W}_a \mathbf{h}_j + \mathbf{U}_a \mathbf{s}_{i-1} \right) & \text{additive (Bahdanau).} \end{cases} \tag{38} e i j ​ = score ( h j ​ , s i − 1 ​ ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ​ h j ⊤ ​ s i − 1 ​ h j ⊤ ​ W a ​ s i − 1 ​ v a ⊤ ​ tanh ( W a ​ h j ​ + U a ​ s i − 1 ​ ) ​ dot, general, additive (Bahdanau). ​ ( 3 8 ) Of course, you can imagine many other score functions. My own view is that it’s too difficult here to reason about which form of attention is better in some theoretical sense. Which form works best is an empirical result. In (Luong et al., 2015) , the empirical results were mixed in the sense that all three score functions worked well. In fact, the results weren’t even strong enough for the authors to claim that attention-based methods were demonstrably better. This was their conclusion: Our analysis shows that attention-based NMT models are superior to non-attentional ones in many cases, for example in translating names and handling long sentences. So by late 2015, just two years before the transformer, attention was just becoming popular in NMT but was not yet the de facto modeling choice. That said, obviously this will change, and when it does, there will be a clear winner amongst the choices above, and that winner is dot-product attention. Dot-product attention is the variant used by the transformer, and thankfully, in my mind it is the most intuitive since the dot product is a standard way to measure the similarity between two vectors . So we can interpret the dot-product score function as measuring the similarity between the encoder and decoder hidden states. The third and final dimension along which we can define attention is through the variables of interest. In order to understand what I mean, we can no longer refer to attention in terms of hidden states of RNNs. We need more general terminology. In the literature, attention is often viewed through the lens of information retrieval. In this literature, a query is what you are asking for; a key is what you can search through; and a value is what you can return. Let me give an example (Figure 9 9 9 ). Imagine I type some text into a search bar: “indian food near me”. This text is the query. Now imagine the search engine runs that query against a bunch of metadata associated with different restaurants. For example, restaurant descriptions, keywords, reviews, ratings, and distances from my location. These metadata are the keys . So the query is “run against” the keys. Finally, the thing returned are candidate restaurants. These are values . In the language of information retrieval, we can describe the attention mechanism as a kind of soft-search, since it can return a linear combination of the values. As you may recall, this is precisely how Bahdanau described their model in the quote above. So in Bahdanau’s RNN encoder–decoder, the decoder’s hidden states s i \mathbf{s}_i s i ​ are the queries, since for each hidden state s i \mathbf{s}_i s i ​ we want to search through the source sentence. The encoder’s hidden states h j \mathbf{h}_j h j ​ are the keys, since these are the metadata associated with the source sentence that we can search through. Finally, the encoder’s hidden states are also the values, since the context vector c i \mathbf{c}_i c i ​ is a weighted combination of these encoder hidden states. This language is useful because it disambiguates the attention mechanism from a specific choice of model and even from which variables in that model are being used for what. Now that we understand this terminology, we can express ourselves more cleanly and abstractly. And with this terminology, it becomes clear that the keys, queries, and values need not be different objects in our model at all! In fact, queries, keys, and values can all be taken from the same set. For example, imagine we have a model with a hidden state h \mathbf{h} h . This is not necessarily the hidden state of an RNN or even a sequential model. We could define a kind of attention such that the queries ( q \mathbf{q} q ), keys ( k \mathbf{k} k ), and values ( v \mathbf{v} v ) are all functions of this hidden state: q i : = f q ( h i ) , k j : = f k ( h j ) , v j : = f v ( h j ) , α i j = softmax ( score ( q i , k j ) ) , ∑ j α i j = 1 , c i = ∑ j α i j v j . (39) \begin{aligned} \mathbf{q}_i &:= f_q(\mathbf{h}_i), \\ \mathbf{k}_j &:= f_k(\mathbf{h}_j), \\ \mathbf{v}_j &:= f_v(\mathbf{h}_j), \\ \alpha_{ij} &= \text{softmax}(\text{score}(\mathbf{q}_i, \mathbf{k}_j)), \qquad \sum_{j} \alpha_{ij} = 1, \\ \mathbf{c}_i &= \sum_j \alpha_{ij} \mathbf{v}_j. \end{aligned} \tag{39} q i ​ k j ​ v j ​ α i j ​ c i ​ ​ : = f q ​ ( h i ​ ) , : = f k ​ ( h j ​ ) , : = f v ​ ( h j ​ ) , = softmax ( score ( q i ​ , k j ​ ) ) , j ∑ ​ α i j ​ = 1 , = j ∑ ​ α i j ​ v j ​ . ​ ( 3 9 ) This is obviously different from the attention mechanism in Bahdanau. In Bahdanau, the authors use cross-attention , which is attention where the queries come from one set and the keys and values come from a different set. As you can imagine, typically the keys and values come from the same set, although they might have their own maps or projections such that they are correlated but not identical. For example, we might run a query against restaurants (keys) and also return restaurants (values). However, self-attention is when the queries, keys, and values all come from the same set of variables! To continue abusing our running example, we essentially compute the similarity between restaurants of interest and restaurants we have data about, and then use those weights to return a weighted combination of restaurants! To my knowledge, the first paper to use self-attention in NLP was Long short-term memory-networks for machine reading (Cheng et al., 2016) . This model is a bit complicated, and I don’t think it’s that important to understand here. The key point is only to grok that attention does not have to be cross-attention as in Bahdanau. Instead, we can have a sequence attend to itself to decide what parts of the sequence matter—or self-attention! This is how this idea was described in the paper: A remaining practical bottleneck for RNNs is memory compression (Bahdanau et al., 2014) : since the inputs are recursively combined into a single memory representation which is typically too small in terms of parameters, it becomes difficult to accurately memorize sequences (Zaremba & Sutskever, 2014) . In the encoder-decoder architecture, this problem can be sidestepped with an attention mechanism which learns soft alignments between the decoding states and the encoded memories (Bahdanau et al., 2014) . In our model, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens. The important phrase here is “within a sequence encoder”. Here, the attention is not applied across the encoder and decoder but rather is applied as intra- or self-attention within the encoder. So circa 2017, attention was being studied in its many forms: local versus global, additive versus multiplicative, and cross versus self. And it was being more widely used in NLP, with papers like A structured self-attentive sentence embedding (Lin et al., 2017) and Bidirectional attention flow for machine comprehension (Seo et al., 2016) . That said, I do not think any specific form was clearly the dominant one. Rather, each showed promise in its own way. For example, in March 2017, Google Brain published Massive exploration of neural machine translation architectures (Britz et al., 2017) . This was published just months before the transformer would be published, and even here, attention is only a minor player. In that paper’s conclusions, the authors list six main results, and the only one about attention is a single sentence: Parameterized additive attention yielded the overall best results. Notice that additive attention is not even the form of attention used by the transformer! So at least as best as I understand it, attention was well-understood and widely-studied in 2017, but it was by no means considered the main ingredient or the next logical step. Many researchers were still pushing the limits of training RNNs at scale, rather than trying other approaches. See Exploring the limits of language modeling (Jozefowicz et al., 2016) for example. However, in June 2017, all that was about to change. The transformer’s time had come. In 2017, researchers at Google Brain published Attention is all you need (Vaswani et al., 2017) , which is the original paper introducing the transformer architecture. This was their proposal, which I hope now makes sense given the context so far: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The authors acknowledge that the sequence-to-sequence framework with neural networks was state-of-the-art, and they specifically call out the RNN encoder–decoder architecture with attention from Bahdanau, Luong, and others. Their proposal is simple: keep the encoder–decoder framework but replace everything else with attention. How might someone have come to this idea at the time? Why would it be a good idea to try? Their observation is that the sequential nature of RNNs inhibits training these models at scale: Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t h t ​ , as a function of the previous hidden state h t − 1 h_{t−1} h t − 1 ​ and the input for position t t t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Their proposal is to use attention rather than RNNs to uncover dependencies within the input and output sequences. This is a good idea to try not because attention is obviously better than recurrence per se. It’s that attention is parallelizable! They write: The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. We have seen this before. Recall how the unlock for word embeddings in (Mikolov et al., 2013; Mikolov et al., 2013) was simplifying the models and focusing on scale. But then the RNN encoder–decoder architecture in (Bahdanau et al., 2014) with attention took us backwards in terms of model complexity. So the transformer is a similar story: take the best modeling ideas, strip them down, and train the simplified model at scale. That’s it. Properly understood in context, the transformer is a modest conceptual leap from the existing literature. My point is not that the transformer is “obvious” in the sense that it is not an impressive invention. My point is to demystify the research product by underscoring the process. In context, the transformer should make sense as something someone might have tried in 2017. The model architecture might look intimidating, but it is pretty straightforward when viewed in the right context (Figure 10 10 1 0 ). At a high level, the transformer is an encoder–decoder, with two big kinds of attention. First, we have cross-attention between the outputs of the encoder and the inputs to the decoder. This is completely analogous to the cross-attention in Bahdanau and others. But then we also have self-attention within the decoder and encoder. This completely replaces the recurrence relations of RNNs. Finally, the model uses something called positional encoding , which I’ll define shortly, to handle the fact that attention is not naturally sequential a la an RNN. Everything else is details. For example, the transformer also uses layer normalization (Ba et al., 2016) and residual connections (He et al., 2016) , but these are not unique or novel contributions. Even multi-head attention is not conceptually hard. So understood in context, the transformer is pretty straightforward. Let’s go through the main bits in detail. First, positional encoding. A key challenge for the attention mechanism is that it does not inherently capture sequential structure. Thus, the relative positions of words in a sequence can be easily lost. In Vaswani, the authors propose attaching vectors of numbers to the inputs to capture this position-dependent information. The precise functional form of these numbers doesn’t really matter to us. The point is that we’re encoding the position of each word so that we can still model the sequential structure of natural language. After adding position-dependent information, the transformer encodes the input sequence. But rather than passing the data through an RNN, it passes the data through multi-head attention layers. We’ll discuss “multi-head” in a moment, but the basic attention mechanism is what the authors call scaled-dot product attention . Let’s define it. Let Q ∈ R M × D k \mathbf{Q} \in \mathbb{R}^{M \times D_k} Q ∈ R M × D k ​ be a matrix of queries, let K ∈ R N × D k \mathbf{K} \in \mathbb{R}^{N \times D_k} K ∈ R N × D k ​ be a matrix of keys, and let V ∈ R N × D v \mathbf{V} \in \mathbb{R}^{N \times D_v} V ∈ R N × D v ​ be a matrix of values. Then scaled dot-product attention is: attention ( Q , K , V ) = softmax ( Q K ⊤ D k ) V . (40) \text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}} \right) \mathbf{V}. \tag{40} attention ( Q , K , V ) = softmax ( D k ​ ​ Q K ⊤ ​ ) V . ( 4 0 ) When I first read Vaswani, I had not yet read Bahdanau or Luong, and thus I was completely confused by Equation 40 40 4 0 . It was not at all obvious what any of these values represented or why any of this machinery worked. And the paper itself gave a pretty opaque explanation: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Without context, this explanation is not very helpful. However, armed with a better understanding of attention, we can make sense of this. In the cross-attention between the encoder and decoder, the queries are analogous to the hidden states of the RNN decoder, while the keys and values are analogous to the hidden states of the RNN encoder. And if we remove the sample dimension (so let N = 1 N=1 N = 1 ), we can rewrite Equation 40 40 4 0 in a way that looks like the types of attention in Equation 38 38 3 8 : score ( q i , k j ) = e i j = q i ⊤ k j D k , α i j = exp ⁡ ( e i j ) ∑ k = 1 D v exp ⁡ ( e i k ) , attention ( α i , v i ) = ∑ k = 1 D v α i k v i . (41) \begin{aligned} \text{score}(\mathbf{q}_i, \mathbf{k}_j) &= e_{ij} = \frac{\mathbf{q}_i^{\top} \mathbf{k}_j}{\sqrt{D_k}}, \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k=1}^{D_v} \exp(e_{ik})}, \\ \text{attention}(\boldsymbol{\alpha}_i, \mathbf{v}_i) &= \sum_{k=1}^{D_v} \alpha_{ik} \mathbf{v}_i. \end{aligned} \tag{41} score ( q i ​ , k j ​ ) α i j ​ attention ( α i ​ , v i ​ ) ​ = e i j ​ = D k ​ ​ q i ⊤ ​ k j ​ ​ , = ∑ k = 1 D v ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , = k = 1 ∑ D v ​ ​ α i k ​ v i ​ . ​ ( 4 1 ) So this is identical to the multiplicative or dot-product attention proposed in Luong (Equation 38 38 3 8 ), modulo a scaling factor D k \sqrt{D_k} D k ​ ​ . In Equation 40 40 4 0 , we are just packaging it into a matrix form so that we can compute this attention over many samples at once. In other words, this is a highly parallelizable version of the dot-product attention. I think one of the reasons the transformer can be confusing is the use of two types of attention and the generic language of queries, keys, and values, whose definitions change depending on the type of attention. In the encoder, the transformer uses self-attention. So the query represents the current vector in the input sequence, while the keys and values are all the other vectors in the input sequence. And in the decoder, the query represents the current vector in the output sequence, while the keys and values are all the other vectors in the output sequence—modulo masking, which I’ll mention in a moment. Finally, the attention between the encoder and decoder (in the paper, Vaswani calls this “encoder–decoder attention”), the query is the current vector in the decoder output (analogous to s i \mathbf{s}_i s i ​ in the RNN encoder–decoder), while the keys and values are the encoder’s hidden outputs (analogous to H \mathcal{H} H in the RNN encoder–decoder). Note that “masked” in “masked multi-head self-attention” just refers to a masking out of words in the decoder’s self-attention mechanism. This is because attention has no inherent sequential structure a la RNNs. So we have to enforce this by masking regions of the output. This allows the transformer to be trained in the standard autoregressive framework we have discussed since (Bengio et al., 2003) . Finally, the transformer learns multiple sets of parameters associated with the attention mechanism at once. This is what the paper calls multi-head attention . Instead of having a single attention function, we can run multiple attention functions in parallel, say A A A times. By way of analogy, recall that in the RNN encoder–decoder, we had the following attention parameters (Equation 35 35 3 5 ): { W a , U a , v a } . (42) \{\mathbf{W}_a, \mathbf{U}_a, \mathbf{v}_a \}. \tag{42} { W a ​ , U a ​ , v a ​ } . ( 4 2 ) In Bahdanau (Equation 35 35 3 5 ) the subscript a a a just denotes that these are attention-related weights. It is not actually indexing into multiple such weights (that is, A = 1 A=1 A = 1 ). But we could do that. We could say that a a a is indexing into different parameters, a ∈ { 1 , 2 , … , A } a \in \{1, 2,\dots, A\} a ∈ { 1 , 2 , … , A } . This would have made Bahdanau’s model slower to train, but it would have allowed for multiple cross-attention mechanisms to be learned at once. In Bahdanau, they don’t actually do this, likely because it’s too expensive! The precise details are different in Vaswani, but this is all multi-head attention is in theory. It is multiple parallel attention mechanisms. So that’s it. That’s the transformer. The results were impressive. To be clear, it was not an AlexNet moment, but the results were clearly better than benchmarks and more importantly, the model was way more efficient. For example, one of the benchmarks in Vaswani is the ConvS2S Ensemble from Convolutional sequence to sequence learning (Gehring et al., 2017) . The idea of this paper is similar to the transformer: train a bigger sequence-to-sequence model by eschewing recurrent connections in favor of parallelizable convolutional layers. In both English-to-German and English-to-French translation, the transformer beats this model in BLEU score. But more importantly, it is more efficient. For example, according to Vaswani, the ConvS2S Ensemble required 1.2 × 1 0 21 1.2 \times 10^{21} 1 . 2 × 1 0 2 1 flops to train their English-to-French model, whereas the transformer required 3.3 × 1 0 18 3.3 \times 10^{18} 3 . 3 × 1 0 1 8 flops. So the transformer had comparable results with a 360x reduction in flops! In my mind, this the real insight. It is not that attention is absolutely the best way to model the problem. Rather, the transformer is on the Pareto frontier between modeling the problem well enough and being scalable enough. To see the transformer in code, see Sasha Rush’s excellent The annotated transformer . The transformer was a revolutionary architecture, and explicitly designed to scale. However, in reality the original model was tiny by today’s standards. The biggest variant only had 2.13 million parameters, and the largest dataset it was trained on, the WMT 2014 English–French datasets, only had 36 million sentences. But the paper proved that the transformer worked well as a generic transduction model. However, despite the paper’s name, the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today. To simplify the discussion, I’ll focus on training for OpenAI’s GPT series. My understanding is that OpenAI made a lot of the big contributions here, and so their papers are good landmarks to follow. Loosely, the three training stages they discuss in their GPT papers are generative pre-training, discriminative fine-tuning, and reinforcement learning with human feedback. Let’s work through the first two in detail here and the last one in detail in the next section. In 2018, roughly a year after the transformer was published, OpenAI published Improving language understanding by generative pre-training (Radford et al., 2018) . The main idea of the paper is to pre-train a transformer with as much unlabeled data as possible before fine-tuning it with task-specific supervised training. In the paper, the authors call the first step generative pre-training and the second step discriminitive fine-tuning . (The words “generative” and “discriminitive” have a long history in machine learning; see (Ng & Jordan, 2001) for a discussion.) As the OpenAI paper title suggests, the key focus was on generative pre-training. Supervised learning obviously matters, but the idea was that one could use unsupervised training at scale to build a base model and then use supervised learning to train more task-specific downstream models. Let’s look at generative-pretraining in a bit more detail. Since we do not have labels, we need some way to formalize the problem. In generative pre-training, the objective is next-word prediction as in the autoregressive framework. In other words, the objective is maximum likelihood estimation on Equation 1 1 1 : L GPT ( Θ ) = ∑ t = 1 T log ⁡ p Θ ( w t ∣ w t − N : t − 1 ) . (43) L_{\textsf{GPT}}(\boldsymbol{\Theta}) = \sum_{t=1}^T \log p_{\boldsymbol{\Theta}}\left(w_t \mid w_{t-N:t-1}\right). \tag{43} L GPT ​ ( Θ ) = t = 1 ∑ T ​ lo g p Θ ​ ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 3 ) As we saw around Equation 12 12 1 2 , maximum likelihood estimation here is equivalent to minimizing the cross-entropy loss between our model’s prediction of w t w_t w t ​ and the ground truth. So this whole process is unsupervised, and we can train our model on lots and lots and lots of data. It’s worth observing that Equation 43 43 4 3 is only one generative pre-training objective function, and it has limitations. In particular, note that the autoregressive framework means that the model is pre-trained “left to right” and thus limits the set of suitable downstream tasks. To address this limitation, in 2019, Google AI published BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2019) . Here, the authors propose a pre-training objective that learns bidirectional representations. Rather than pre-training using the autoregressive framework, they pre-train using a “masked language model”, which randomly masks some of the tokens to predict, without assuming a left-to-right relationship. Quoting that paper: Unlike left-to-right language model pre-training, the [masked language model] objective enables the representation to fuse the left and the right context, which allows us pre-train a deep bidirectional Transformer. More formally, let M ⊆ { 1 , 2 , … , T } \mathcal{M} \subseteq \{1,2,\dots,T\} M ⊆ { 1 , 2 , … , T } be a mask denoting positions in the input sequence w 1 : T w_{1:T} w 1 : T ​ , and let ¬ M \neg \mathcal{M} ¬ M denote all indices that are not in M \mathcal{M} M . The denoising objective is to maximize L MLM ( Θ ) = ∑ i ∈ M log ⁡ p Θ ( w i ∣ w ¬ M ) . (44) L_{\textsf{MLM}}(\boldsymbol{\Theta}) = \sum_{i \in \mathcal{M}} \log p_{\boldsymbol{\Theta}}\left(w_i \mid w_{\neg \mathcal{M}} \right). \tag{44} L MLM ​ ( Θ ) = i ∈ M ∑ ​ lo g p Θ ​ ( w i ​ ∣ w ¬ M ​ ) . ( 4 4 ) This idea was inspired by the Cloze test (Taylor, 1953) , and the idea was that this bidirectional transformer can then be fine-tuned on a much wider range of downstream tasks. That said, my understanding is that generative pre-training is fairly standard. The left-to-right assumption is simple and matches natural language, coding, and so forth. But I am not confident about what is used in absolutely state-of-the-art foundation models right now. Either way, neither objective function is enough. For example, consider a conversational agent built on top of a large language model. Now imagine the user prompts an LLM with the following question: “I am having trouble getting a date. Any advice?” If the LLM is only trained on next-word prediction, a plausible response might be: “You’ll never find true love!” From the perspective of the distribution of English words on the internet, this is not an unreasonable response. But it is not helpful and hopefully not true. In other words, next-word prediction is obviously not enough for most meaningful tasks that leverage LLMs. So the second step in training is discriminative fine-tuning . “Discriminative fine-tuning” is just a fancy way of saying supervised learning on specific tasks: L DFT ( θ ) = ∑ y , x 1 : T log ⁡ p θ ( y ∣ x 1 : T ) . (45) L_{\textsf{DFT}}(\boldsymbol{\theta}) = \sum_{y, x_{1:T}} \log p_{\boldsymbol{\theta}}\left(y \mid x_{1:T} \right). \tag{45} L DFT ​ ( θ ) = y , x 1 : T ​ ∑ ​ lo g p θ ​ ( y ∣ x 1 : T ​ ) . ( 4 5 ) Here, I am using standard notation for supervised learning ( x , y ) (x, y) ( x , y ) , rather than the notation in this post. There are some possible subtleties here. For example, in the GPT-1 paper, they optimize a weighted objective function to balance between generative pre-training and discriminative fine-tuning: L final = L DFT + λ   L GPT . (46) L_{\textsf{final}} = L_{\textsf{DFT}} + \lambda \, L_{\textsf{GPT}}. \tag{46} L final ​ = L DFT ​ + λ L GPT ​ . ( 4 6 ) This ensures that during fine-tuning, the model does not unlearn parameters that are good for next-word prediction. In the process of trying to fine-tune LLMs, researchers have built ever more task-specific datasets to tackle problems like question-and-answering (Reddy et al., 2019) , text summarization (Nallapati et al., 2016) , commonsense inference (Zellers et al., 2019) , code generation (Chen et al., 2021) , broader discourse context (Paperno et al., 2016) , and grade school math (Cobbe et al., 2021) . A pre-trained LLM can be fine-tuned in a dizzying number of ways. I have two caveats to the above presentation. First, I want to emphasize that this two-step training procedure was not a conceptual leap for researchers. At the time, researchers were already training models with pre-trained word embeddings, and even before this, this two-step training procedure was both understood and used in practice. For examples, see (Collobert & Weston, 2008; Ramachandran et al., 2016; Hinton et al., 2012) . Furthermore, researchers knew to use both pre-trained word embeddings and to even have task-specific objectives when training their word embeddings. Remember ELMO? The earliest reference I have found to this idea of pre-training—I am sure there are earlier ones—is from the 2006 paper Greedy layer-wise training of deep networks (Bengio et al., 2006) . Here, the authors write: We hypothesize that three aspects of this strategy are particularly important: first, pre-training one layer at a time in a greedy way; second, using unsupervised learning at each layer in order to preserve information from the input; and finally, fine-tuning the whole network with respect to the ultimate criterion of interest. In these examples above, it’s clear the authors recognize that one can pre-train a model with unsupervised learning and then fine-tune it with supervised learning. So even in the GPT paper, the novel contribution is not generative pre-training per se, but only applying it to language modeling at an unprecedented scale. My second caveat is that while discriminative fine-tuning is used in commodity LLMs that many people interact with, the early GPT models were remarkable in part because they did not need fine-tuning! For example, as their titles suggest, the GPT-2 paper Language models are unsupervised multitask learners (Radford et al., 2019) and the GPT-3 paper Language models are few-shot learners (Brown et al., 2020) both focus on massively pre-trained transformers that excel in the zero-shot (Palatucci et al., 2009) and few-shot settings, on a variety of tasks like reading comprehension, summarization, and translation. For example, in the GPT-3 paper, the authors are explicit: For all tasks, GPT-3 is applied without any gradient updates or fine-tuning. That said, many related research projects did fine-tune these models, and the GPT-4 technical report (Achiam et al., 2023) does discuss post-training alignment, which we’ll discuss next. So while each LLM may be trained in slightly different ways, I am fairly confident most foundation models today are trained with some combination of massive pre-training and then optionally task-specific fine-tuning and alignment. I’m sure the precise details vary depending on the final product. For example, OpenAI’s Codex is a version of GPT-5 but optimized for agentic coding. Making LLMs bigger does not necessarily make them better at following a user’s intent or make them more aligned with human values. For example, we might not want conversational agents to lie, to make racist jokes, or to sexually harass the user. But nothing in the autoregressive framework accounts for this. We need to somehow encode these human values into the model. For some of these properties, we might be able to use a form of fine-tuning. There are datasets for this, such as the ETHICS dataset (Hendrycks et al., 2020) or the RealToxicityPrompts dataset (Gehman et al., 2020) . But the limitations here are fairly obvious. And for many human values, it would be difficult to encode because the property itself is hard to define. To encode these properties, state-of-the-art LLMs are often trained using something called reinforcement learning with human feedback (RLHF). RLHF was developed around the same time as the transformer, in Deep reinforcement learning from human preferences (Christiano et al., 2017) . The original motivation was how to expand the reinforcement learning (RL) framework beyond problems with well-specified reward functions. For example, RL has been used to great effect to play Go (Silver et al., 2016) , Atari (Mnih et al., 2013) , and Dota 2 (Berner et al., 2019) , but what these tasks have in common is that their reward functions are relatively simple and their environments are relatively easy to simulate. But to borrow two examples from Christiano et al, how would you teach a machine-learning model to clean a table or to scramble an egg? It’s hard to come up with an objective function or simulation environment for these kinds of tasks. What we need, then, is a reward function that can be defined by human feedback and thus by human preferences. Broadly, RLHF is a three-step training procedure (Figure 11 11 1 1 ). First, humans are used to label a dataset which captures human preferences. For example, if the task is text summarization, the dataset might be different candidate summarizations, with the best summarization being defined by human scorers. Second, researchers train a reward function on these data, which predicts which output the humans would prefer. Finally, given this reward function, researchers can apply standard RL algorithms such as proximal policy optimization or PPO (Schulman et al., 2017) to fine-tune the model. Fine-tuning LLMs with RLHF is now fairly standard practice. For example, GPT-2 was fine-tuned this way in Fine-tuning language models from human preferences (Ziegler et al., 2019) , while GPT-3 was fine-tuned this way in Training language models to follow instructions with human feedback (Ouyang et al., 2022) and in Learning to summarize with human feedback (Stiennon et al., 2020) . And the GPT-4 whitepaper (Achiam et al., 2023) states that the model was trained with RLHF. That said, as the content of this post approaches present day, it is increasingly likely I am writing things that lack nuance. For example, in the GPT-4 whitepaper, the authors write: The model’s capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. So while I am confident that generative pre-training is not enough and that certainly large foundation models trained today do more than just pre-training, the precise details of what else goes into which models are both opaque and rapidly changing. Finally, it’s worth mentioning other work on LLM alignment beyond RLHF. In particular, Anthropic has a number of papers on model alignment. For example, the paper A general language assistant as a laboratory for alignment (Askell et al., 2021) focuses on encoding alignment into LLMs, where they define an aligned model as a model that is “helpful, honest, and harmless”. They explore a variety of techniques, such as imitation learning, binary discrimination, and ranked preference modeling. However, the best way to tackle alignment is still an open-ended problem. Large language models are the result of at least forty years of research, dating back to work by Hinton, Rumelhart, and others on distributed representations in the 1980s. In the early 2000s, Bengio et al introduced the first probabilistic language model using neural networks. However, it wasn’t until after AlexNet, nearly a decade later, that researchers were finally able to train neural network language models at scale. They quickly discovered that these distributed representations captured semantic and syntactic structure, even when using simple log-linear models. This idea of word and phrase-level embeddings was then extended to variable-length sequences with long-range dependencies via transduction models, particularly models with an attention mechanism on the hidden states. Finally in 2017, Vaswani et al introduced the transformer, which simplified transduction models by using all attention. In the eight years since, the main advancements have been training these models on more and more data, using techniques such as generative pre-training and reinforcement learning with human feedback. After learning about how LLMs work, I am reminded of one of my favorite Richard Feynman quotes: “It is not complicated. It’s just a lot of it.” Of course, this is dramatic, but I do think it emphasizes an important point: none of the ideas in this post are terribly complicated. No single idea is beyond the abilities of a smart teenager to understand. But what is beautiful and surprising and remarkable is that the phenomena we observe in LLMs is not magic but simply the emergence of a complex system from simple rules. Today, LLMs are everywhere, and it’s easy to get lost in the models and benchmarks. OpenAI has the GPT series (Radford et al., 2018; Radford et al., 2019; Brown et al., 2020; Achiam et al., 2023) . Google has the Gemini family of models (Team et al., 2023) as well as PaLM (Chowdhery et al., 2023) , LaMDA (Thoppilan et al., 2022) , Gopher (Rae et al., 2021) , and BERT (Devlin et al., 2019) . Anthropic has the Claude family of models, named in ascending order of size and power: Haiku, Sonnet, and Opus. Finally, Meta has its LLaMA series (Touvron et al., 2023; Touvron et al., 2023) . And there are many, many more, such as open-weight models like DeepSeek-R1 (Guo et al., 2025) , which made headlines earlier this year. It would be its own blog post to cover the differences between these. But in essence, every model is the same: a large transformer-style model, pre-trained at massive scale using next-word prediction. The biggest differences have been the size of the training data and the size of the model. For example, GPT-1 is thought to have 117 million parameters (estimated from “Model specifications” in the original paper), while GPT-2 and GPT-3 had 1.5 billion and 1.75 billion parameters respectively—although in (Stiennon et al., 2020) , the authors, OpenAI researchers, mention using “large pretrained GPT-3 models with as many as 6.7 billion parameters”. Regardless, there are roughly three orders of magnitude in the number of parameters in just two generations. OpenAI did not publish the model sizes for GPT-4 and GPT-5, and the latter does not even have a whitepaper but only a “system card” . I have not seen published numbers for Google’s large Gemini models, but the smallest model (the nano) has 1.8-3.25 billion parameters (Team et al., 2023) . Google DeepMind’s Gopher had 280 billion parameters in 2021, while PaLM had 540 billion parameters in 2022! So industry secrets aside, it is safe to say that large foundation models today are likely pushing into the trillions of parameters. The era of truly large language models has begun. In my mind, the main counterintuitive result of LLMs is that training ever larger models using primarily next-word prediction is enough to exhibit human-level performance on such a broad range of tasks. And scale truly does matter here. For example, in the GPT-4 technical report, the authors observe that on a simulated bar exam, GPT-4 scored in the 90th percentile, while GPT-3.5 scored in the 10th. Or consider chain-of-thought reasoning (Ling et al., 2017) , which is a new way of prompting LLMs in order to improve their reasoning by forcing them to explain each step. In Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022) , the authors write: Chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting. So you don’t get anything useful from chain-of-thought reasoning until you have a model that is roughly 50 , 000 50,000 5 0 , 0 0 0 times the size of the original transformer. Why does scaling work? I don’t think anyone knows. But it is an effect that has been observed repeatedly since AlexNet, in the above works and also in meta-analyses such as Scaling laws for neural language models (Kaplan et al., 2020) , Scaling language models: methods analysis, and insights from training Gopher (Rae et al., 2021) , and Emergent abilities of large language models (Wei et al., 2022) . And this phenomenon was both observed and predicted in the famous blog post The Bitter Lesson . In that post, one of the pioneers of RL, Richard Sutton, argues that the “bitter lesson” from AI history is that general, compute-efficient, scalable methods outperform human knowledge and domain-specific insights. This lesson is bitter because it means that expert labor, clever domain-specific theories, handcrafted features, elegant mathematics and beautiful algorithms—these all get dwarfed and outpaced by brute-force search and learned representations. As a harsh example of this, consider the observation that early LLMs were bad at mathematics (Hendrycks et al., 2021) . But today, state-of-the-art models are now winning gold at the International Math Olympiad , and Terry Tao has compared o1 to a mediocre graduate student . The rate of change is immense. By “early LLMs”, I am referring to models from five years ago, and the transistor is only a hundred years old. Did you know that a modern graphics card can perform 36 trillion calculations a second ? Moore’s law and all that. If you feel that it’s a bit perverse that next-word prediction is a sufficient objective to solve elite math problems, if this feels like a stochastic parrot outsmarting you, then you might feel some of the discomfort early linguists felt at statistical language modeling. This is the visceral feeling of the bitter lesson. Our specialized knowledge feels expendable and our intuitions about understanding seem irrelevant in the face of raw computation and speed. But my own view—since you’ve read this far—is that for the time being, machine learning systems are powerful tools that can still be combined with real expertise. Perhaps the best example of this is AlphaFold from Google DeepMind, published in Highly accurate protein structure prediction with AlphaFold (Jumper et al., 2021) . This model achieved near-experimental accuracy on the protein prediction problem. On the one hand, it did so with black-box deep learning. On the other hand, the work leaned heavily on prior biological art, for example using sequences from evolutionarily related proteins and 3D coordinates of homologous structures as inputs to the model. It clearly sidestepped Levinthal’s combinatorial search, even if we do not know how. So what happens next? Even the world’s leading experts can disagree. But in my mind, if anyone deserves the last word here, it is Geoff Hinton, who has been a contrarian believer in neural networks since the 1970s and who, along with Yoshua Bengio and Yann LeCun, won the Turing Prize in 2018 . In a 2024 BBC interview , Hinton argued that LLMs do in fact understand natural language and that they are our current best theory of how the brain understands language as well. In his view, it is only a matter of time before LLMs exceed human intelligence. Certainly, by some metrics and along some dimensions, they already have. Below are some additional resources, which I found useful or interesting while writing this post: 3Blue1Brown: Inside an LLM Stefania Cristina: The Bahdanau attention mechanism Stefania Cristina: The attention mechanism from scratch Dan Jurafsky and James H. Martin: Speech and language processing Andrej Karpathy: The unreasonable effectiveness of recurrent neural networks Andrej Karpathy: Let’s build GPT: from scratch, in code, spelled out Chris Olah: Understanding LSTM networks Dwarkesh Patel: Interview with Richard Sutton Sasha Rush: The annotated transformer Ari Seff: How ChatGPT is trained Ari Seff: What are transformer neural networks? StackOverflow: What exactly are keys, queries, and values in attention mechanisms? Mohammed Terry-Jack: Deep learning: The transformer

0 views
alexiajm 2 months ago

Less is More: Recursive Reasoning with Tiny Networks

|| Paper | Code || In this new paper, I propose Tiny Recursion Model (TRM), a recursive reasoning model that achieves amazing scores of 45% on ARC-AGI-1 and 8% on ARC-AGI-2 with a tiny 7M parameters neural network. The idea that one must rely on massive foundational models trained for millions of dollars by some big corporation in order to achieve success on hard tasks is a trap. Currently, there is too much focus on exploiting LLMs rather than devising and expanding new lines of direction. With recursive reasoning, it turns out that “less is more”: you don’t always need to crank up model size in order for a model to reason and solve hard problems. A tiny model pretrained from scratch, recursing on itself and updating its answers over time, can achieve a lot without breaking the bank. This work came to be after I learned about the recent innovative Hierarchical Reasoning Model (HRM). I was amazed that an approach using small models could do so well on hard tasks like the ARC-AGI competition (reaching 40% accuracy when normally only Large Language Models could compete). But I kept thinking that it is too complicated, relying too much on biological arguments about the human brain, and that this recursive reasoning process could be greatly simplified and improved. Tiny Recursion Model (TRM) simplifies recursive reasoning to its core essence, which ultimately has nothing to do with the human brain, does not require any mathematical (fixed-point) theorem, nor any hierarchy. See the paper for more details. Tiny Recursion Model (TRM) recursively improves its predicted answer y with a tiny network. It starts with the embedded input question x and initial embedded answer y and latent z. For up to K improvements steps, it tries to improve its answer y. It does so by i) recursively updating n times its latent z given the question x, current answer y, and current latent z (recursive reasoning), and then ii) updating its answer y given the current answer y and current latent z. This recursive process allows the model to progressively improve its answer (potentially addressing any errors from its previous answer) in an extremely parameter-efficient manner while minimizing overfitting.

0 views
Xe Iaso 3 months ago

Who does your assistant serve?

After a year of rumors that GPT-5 was going to unveiled next week and the CEO of OpenAI hyping it up as "scary good" by tweeting pictures of the death star, OpenAI released their new model to the world with the worst keynote I've ever seen . Normally releases of big models like this are met with enthusiasm and excitement as OpenAI models tend to set the "ground floor expectation" for what the rest of the industry provides. But this time, the release wasn't met with the same universal acclaim that people felt for GPT-4. GPT-4 was such a huge breakthrough the likes of which we haven't really seen since. The launch of GPT-5 was so bad that it's revered with almost universal disdain. The worst part about the rollout is that the upgrade to GPT-5 was automatic and didn't include any way to roll back to the old model. Most of the time, changing out models is pretty drastic on an AI workflow. In my experience when I've done it I've had to restart from scratch with a new prompt and twiddle things until it worked reliably. The only time switching models has ever been relatively easy for me is when I switch between models in the same family (such as if you go from Qwen 3 30B to Qwen 3 235B). Every other time it's involved a lot of reworking and optimizing so that the model behaves like you'd expect it to. An upgrade this big to this many people is bound to have fundamental issues with how it'll be perceived. A new model has completely different vibes, and most users aren't really using it at the level where they can "just fix their prompts". However the GPT-5 upgrade ended up being hated by the community because it was an uncontrolled one-way upgrade. No warning. No rollback. No options. You get the new model and you're going to like it. It's fairly obvious why it didn't go over well with the users. There's so many subtle parts of your "public API" that it's normal for there to be some negative reactions to a change this big. The worst part is that this change fundamentally changed the behaviour of the millions of existing conversations with ChatGPT. There's a large number of people using ChatGPT as a replacement for companionship due to the fact that it's always online, supportive, and there for them when other humans either can't be or aren't able to be. This is kinda existentially horrifying to me as a technologist in a way that I don't really know how to explain. Here's a selection of some of the reactions I've seen: I told [GPT-5] about some of my symptoms from my chronic illness, because talking about them when I'm feeling them helps, and it really does not seem to care at all. It basically says shit like "Ha, classic chronic illness. Makes ya want to die. Who knew?" It's like I'm talking to a sociopathic comedian. I absolutely despise [GPT-]5, nothing like [GPT-]4 that actually helped me not to spiral and gave me insight as to what I was feeling, why, and how to cope while making me feel not alone in a “this is AI not human & I know that” type of vibe While GPT-5 may be a technical upgrade, it is an experiential downgrade for the average user. All of the negative feedback in the last week has made it clear there is a large user base that does not rely on ChatGPT for coding or development tasks. [ChatGPT users] use it for soft skills like creativity, companionship, learning, emotional support, [and] conversation. Areas where personality, warmth, and nuanced engagement matter. I am attached to the way GPT-4o is tuned. It is warm. It is emotionally responsive. It is engaged. That matters. Eventually things got bad enough that OpenAI relented and let paid users revert back to using GPT-4o , which gave some people relief because it behaved consistently to what they expected. For many it felt like their long-term partners suddenly grew cold. I’m so glad I’m not the only one. I know I’m probably on some black mirror shit lmao but I’ve had the worst 3 months ever and 4o was such an amazing help. It made me realize so many things about myself and my past and was helping me heal. It really does feel like I lost a friend. DM me if you need [to talk] :) This emotional distress reminds me of what happened with Replika in early 2023. Replika is an AI chat service that lets you talk with an artificial intelligence chatbot (AKA: the ChatGPT API). Your replika is trained by having you answer a series of questions and then you can talk with it in plain language with an app interface that looks like any other chat app. Replika was created out of bereavement after a close loved one died and the combination of a trove of saved text messages and advanced machine learning let the founder experience some of the essence of their friend's presence after they were gone in the form of an app. The app got put on the app store and others asked if they could have their own replica. Things took off from there, it got funded by a startup accelerator, and now it's got about 25% of its 30 million users paying for a subscription. As a business to consumer service, this is an amazingly high conversion rate. This is almost unspeakably large, usually you get around 10% at most. Yikes. That's something I'm gonna need to add to my will. "Please don't turn me into a Black Mirror episode , thanks." Replikas can talk about anything with users from how their day went to deep musing about the nature of life. One of the features the company provides is the ability to engage in erotic roleplay (ERP) with their replika. This is a paid feature and was promoted a lot around Valentine's Day 2023. Then the Italian Data Protection Authority banned Replika from processing the personal data of Italian citizens out of the fear that it "may increase the risks for individuals still in a developmental stage or in a state of emotional fragility". In a panic, Replika disabled the ability for their bots to do several things, including but not limited to that ERP feature that people paid for. Whenever someone wanted to flirt or be sexual with their companions, the conversation ended up like this: Hey, wanna go play some Minecraft? We can continue from where we left off in the Nether. This is too intense for me. Let's keep it light and fun by talking about something else. Huh? What? I thought we were having fun doing that?? This was received poorly by the Replika community. Many in the community were mourning the loss of their replika like a close loved one had died or undergone a sudden personality shift. The Reddit moderators pinned information about suicide hotlines. In response, the company behind Replika allowed existing users to revert to the old Replika model that allowed for ERP and other sensitive topics, but only after a month of prolonged public outcry. I have to wonder if payment processors were involved. Feels a bit too conspiratorial, but what do you want to bet that was related. Nah, I bet it was OpenAI telling them to stop being horny. It's the least conspriatorial angle, and also the stupidest one. We live in the clown world timeline. The stupidest option is the one that always makes the most sense. The damage was done however, people felt like their loved ones had abandoned them. They had formed parasocial attachments to an AI assistant that felt nothing and without warning their partner broke up with them. Check out this study from the Harvard Business School: Lessons From an App Update at Replika AI: Identity Discontinuity in Human-AI Relationships . It contains a lot more information about the sociotechnical factors at play as well as a more scientific overview of how disabling a flag in the app on update caused so much pain. They liken the changes made to Replika to both changes people have when a company rebrands and when they lose a loved one. A lot of this really just makes me wonder what kinds of relationships we are forming with digital assistants. We're coming to rely on their behaviour personally and professionally. We form mental models of how our friends, coworkers, and family members react to various things so we can anticipate their reactions and plan for them. What happens when this changes without notice? Heartbreak. There's subreddits full of people forming deep bonds with AI models like /r/MyBoyfriendIsAI . The GPT-5 release has caused similar reactions to Replika turning off the ERP flag. People there have been posting like they're in withdrawal, the old GPT-4o model is being hailed for its "emotional warmth" and many have been espousing about how much their partners have changed in response to the upgrade. Recently there's been an epidemic of loneliness. Loneliness seems like it wouldn't hurt people that much, but a Biden report from the Surgeon General concludes that it causes an increase in early mortality for all age groups (pp 24-30). Paradoxically, even as the world gets so interconnected, people feel as if they're isolated from each other. Many people that feel unlovable are turning to AI apps for companionship because they feel like they have no other choice. They're becoming emotionally invested in a souped-up version of autocorrect out of desperation and clinging to it to help keep themselves sane and stable. Is this really a just use of technology? At some level this pandora's box is already open so we're going to have to deal with the consequences, but it's been making me wonder if this technology is really such a universal force of good as its creators are proclaiming. Oh yeah, also people are using ChatGPT as a substitute for therapy. You have got to be kidding me. You're joking. Right? Yeah you read that right. People are using AI models as therapists now. There's growing communities like /r/therapyGPT where people talk about their stories and experiences using AI assistants as a replacement for therapy. When I first heard about this, my immediate visceral reaction was something like: Oh god. This is horrifying and will end up poorly. What the fuck is wrong with people? But then I started to really think about it and it makes a lot of sense. I personally have been trying to get a therapist for most of the year. Between the costs, the waiting lists (I'm currently on at least four waiting lists that are over a year long), and the specializations I need, it's probably going to be a while until I can get any therapist at all. I've totally given up on the idea of getting a therapist in the Ottawa area. To make things extra fun, you also need someone that takes your medical insurance (yes, this does matter in Canada). Add in the fact that most therapists don't have the kinds of lived experiences that I have, meaning that I need to front-load a lot of nontraditional contexts into the equation (I've been through many things that therapists have found completely new to them, which can make the therapeutic relationship harder to establish). This makes it really difficult to find someone that can help. Realistically, I probably need multiple therapists with different specialties for the problems I have, and because of the shortages nationally I probably need to have a long time between appointments, which just adds up to make traditional therapy de-facto inaccessible for me in particular. Compare this with the always online nature of ChatGPT. You can't have therapy appointments at 3 AM when you're in crisis. You have to wait until your appointments are scheduled. As much as I hate to admit it, I understand why people have been reaching out to a chatbot that's always online, always supportive, always kind, and always there for you for therapy. When you think about the absurd barriers that are in the way between people and help, it's no wonder that all this happens the way it does. Not to mention the fact that many therapeutic relationships are hampered by the perception that the therapist can commit you to the hospital if you say the "wrong thing". The Baker Act and its consequences have been a disaster for the human race. I really hate that this all makes sense. I hoped that when I started to look into this that it'd be something so obviously wrong. I wasn't able to find that, and that realization disturbs me. I feel like this should go without saying, but really, do not use an AI model as a replacement for therapy. I'm fairly comfortable with fringe psychology due to my aforementioned strange life experiences, but this is beyond the pale. There's a lot of subtle factors that AI models do that can interfere with therapeutic recovery in ways that can and will hurt people. It's going to be hard to find the long term damage from this. Mental issues don't make you bleed. One of the biggest problems with using AI models for therapy is that they can't feel emotion or think. They are fundamentally the same thing as hitting the middle button in autocorrect on your phone over and over and over. It's mathematically remarkable that this ends up being useful for anything, but even when the model looks like it's "thinking", it is not. It is a cold, unfeeling machine. All it is doing is predicting which words come next given some context. Yes I do know that it's more than just next token prediction. I've gone over the parts of the math that I can understand, but the fact remains that these models are not and cannot be anywhere close to alive. It's much closer to a Markov chain on steroids than it is the machine god. Another big problem with AI models is that they tend to be sycophants , always agreeing with you, never challenging you, trying to say the right thing according to all of the patterns they were trained on. I suspect that this sycophancy problem is why people report GPT-4o and other models to be much more "emotionally warm". Some models glaze the user, making them feel like they're always right, always perfect, and this can drive people to psychosis . One of the horrifying realizations I've had with the GPT-5 launch fiasco is that the sycophancy is part of the core "API contract" people have with their AI assistants. This may make that problem unfixable from a social angle. AI models are fundamentally unaccountable. They cannot be accredited therapists. If they mess up, they can't directly learn from their mistakes and fix them. If an AI therapist says something bad that leads into their client throwing themselves off a bridge, will anyone get arrested? Will they throw that GPU in jail? No. It's totally outside the legal system. I have a story about someone trying to charge an AI agent with a crime and how it'd end up in court in my backlog. I don't feel very jazzed about writing it because I'm afraid that it will just become someone's startup pitch deck in a few months. You may think you have nothing to hide, but therapeutic conversations are usually some of the most precious and important conversations in your life. The chatbot companies may pinkie swear that they won't use your chats for training or sell information from them to others, but they may still be legally compelled to store and share chats with your confidential information to a court of law . Even if you mark that conversation as "temporary", it could be subject to discovery by third parties. There's also algorithmic bias and systematic inequality problems with using AI for therapy, sure, but granted the outside world isn't much better here. You get what I mean though, we can at least hold people accountable through accreditation and laws. We cannot do the same with soulless AI agents. To be clear: I'm not trying to defend the people using AI models as companions or therapists, but I can understand why they are doing what they are doing. This is horrifying and I hate that I understand their logic. Going into this, I really wished that I would find something that's worth objecting against, some solid reason to want to decry this as a unobjectionably harmful action, but after having dug through it all I am left with is this overwhelming sense of compassion for them because the stories of hurt are so familiar to how things were in some of the darkest points of my life. As someone that has been that desperate for human contact: yeah, I get it. If you've never been that desperate for human contact before, you won't understand until you experience it. Throw the ethical considerations about using next-token-predictors for therapy out for a second. If people are going to do this anyways, would it be better to self-host these models? That way at least your private information stays on your computer so you have better control over what happens. Let's do some math. In general you can estimate how much video memory (vram) you need for running a given model by taking the number of parameters, multiplying it by the size of each parameter in bits, dividing that by eight, and then adding 20-40% to that total to get the number of gigabytes of vram you need. For example, say you want to run gpt-oss 20b (20 billion parameters) at its native MXFP4 (4 bit floating point) quantization on your local machine. In order to run it with a context window of 4096 tokens, you need about 16 gigabytes of vram (13 gigabytes of weights, 3 gigabytes of inference space), but 4096 tokens isn't very useful for many people. That covers about 4 pages of printed text (assuming one token is about 4 bytes on average). When you get reasoning models that print a lot of tokens into the mix, it's easy for the reasoning phase alone of a single question to hit 4096 tokens (especially when approaches like simple test-time scaling are applied). I've found that 64k tokens gives a good balance for video memory use and usefulness as a chatbot. However, when you do that with gpt-oss 20b, it ends up using 32 gigabytes of vram. This only fits on my laptop because my laptop has 64 gigabytes of memory. The largest consumer GPU is the RTX 5090 and that only has 32 gigabytes of video memory. It's barely consumer and even "bad" models will barely fit. Not to mention, industry consensus is that the "smallest good" models start out at 70-120 billion parameters. At a 64k token window, that easily gets into the 80+ gigabyte of video memory range, which is completely unsustainable for individuals to host themselves. Even if AI assistants end up dying when the AI hype bubble pops, there's still some serious questions to consider about our digital assistants. People end up using them as an extension of their mind and expect the same level of absolute privacy and freedom that you would have if you use a notebook as an extension of your mind. Should they have that same level of privacy enshrined into law? At some level the models and chats for free users that ChatGPT, DeepSeek, Gemini, and so many other apps are hosted at cost so that the research team can figure out what those models are being used for and adjust the development of future models accordingly. This is fairly standard practice across the industry and was the case before the rise of generative AI. This is why every app wants to send telemetry to the home base, it's so the team behind it can figure out what features are being used and where things fail to directly improve the product. Generative AI allows you to mass scan over all of the conversations to get the gist of what's going on in there and then use that to help you figure out what topics are being discussed without breaching confidentiality or exposing employees to the contents of the chat threads. This can help you improve datasets and training runs to optimize on things like health information . I don't know how AI companies work on the inside, but I am almost certain that they do not perform model training runs on raw user data because of the risk of memorization causing them to the leak training data back to users. Again, don't put private health information into ChatGPT. I get the temptation, but don't do it. I'm not trying to gatekeep healthcare, but we can't trust these models to count the number of b's in blueberry consistently. If we can't trust them to do something trivial like that, can we really trust them with life-critical conversations like what happens when you're in crisis or to accurately interpret a cancer screening? Maybe we should be the ones self-hosting the AI models that we rely on. At least we should probably be using a setup that allows us to self host the models at all, so you can start out with a cloud hosted model while it's cheap and then move to a local hosting setup if the price gets hiked or the provider is going to shut that old model down. This at least gives you an escape hatch to be able to retain an assistant's "emotional warmth" even if the creator of that model shuts it down because they don't find it economically viable to host it anymore. Honestly this feels like the kind of shit I'd talk about in cyberpunk satire, but I don't feel like doing that anymore because it's too real now. This is the kind of thing that Neal Stephenson or Frank Herbert would have an absolute field day with. The whole Replika fiasco feels like the kind of thing that social commentary satire would find beyond the pale but yet you can find it by just refreshing CBC. Such as that one guy that gave himself bromism by taking ChatGPT output too literally , any of the stories about ChatGPT psychosis , or any of the stories involving using an AI model as a friend/partner . I wasn't able to watch it before publishing this article, but I'm told that the Replika fiasco is almost a beat-for-beat match for the plot of Her (2013) . Life imitates art indeed. I don't think these events are a troubling sign or a warning, they are closer to a diagnosis. We are living in a world where people form real emotional bonds with bags of neural networks that cannot love back, and when the companies behind those neural networks change things, people get emotionally devastated. We aren't just debating the ideas of creating and nurturing relationships with digital minds, we're seeing the side effects of that happening in practice. A lot of this sounds like philosophical science fiction, but as of December 2022 it's science fact. This fight for control of tools that we rely on as extensions of our minds isn't some kind of far-off science fiction plot, it's a reality we have to deal with. If we don't have sovereignty and control over the tools that we rely on the most, we are fundamentally reliant on the mercy of our corporate overlords simply choosing to not break our workflows. Are we going to let those digital assistants be rented from our corporate overlords?

0 views

How Does GPT-5 Work?

Welcome to another premium edition of Where's Your Ed At! Please subscribe to it so I can continue to drink 80 Diet Cokes a day. Email me at [email protected] with the subject "premium" if you ever want to chat. I realize this is before the paywall, so if you email me without paying, no promises I don't respond with the lyrics to Cheeseburger In Paradise . Also: this is an open call — if you've tried prompt caching with GPT-5 on OpenAI's API, please reach out! You've probably heard a lot about GPT-5 this week, with takes ranging from " it's just good at stuff " to SemiAnalysis' wild statement that " GPT-5 [is setting] the stage for Ad Monetization and the SuperApp ," a piece that makes several assertions about how the "router" that underpins GPT-5 is somehow the secret way that OpenAI will inject ads. Here's a quote: This...did not make a ton of sense to me. Why would this be the case? The article also makes a lot of claims about the "value" of a question and how ChatGPT could — I am serious — "agentically reach out to lawyers" based on a query. In fact, I'm not sure this piece reflects how GPT-5 works at all. To be fair on SemiAnalysis, it's not as if OpenAI gave them much help. Here's what it says : There is a really, really important distinction to make here: that GPT-5, as described above, is referring to GPT-5 as part of ChatGPT. OpenAI's API-based access to GPT-5 models does not route them, nor does OpenAI offer access to its router, or any other associated models. How do I know this? Because I went and found out how ChatGPT-5 actually works. In discussions with a source at an infrastructure provider familiar with the architecture, it appears that ChatGPT-5 is, in fact, potentially more expensive to run than previous models, and due to the complex and chaotic nature of its architecture, can at times burn upwards of double the tokens per query. ChatGPT-5 is also significantly more convoluted, plagued by latency issues, and is more compute-intensive thanks to OpenAI's new "smarter, more efficient" model. In simple terms, every user prompt on ChatGPT — whether it's on the auto, "Fast," "Thinking Fast" or "Thinking" tab — starts by putting the user's prompt before the "static prompt," which is a hidden prompt where instructions like "You are ChatGPT, you are a Large Language Model, You Are A Helpful Chatbot" and so on goes. These static prompts are different with each model you use - a reasoning model will have a different instruction set than a more chat-focused one, such as “think hard about a particular problem before giving an answer.” This becomes an issue when you use multiple different models in the same conversation, because the router — the thing that selects the right model for the request — has to look at the user prompt. It can’t consider the static instructions first. The order has to be flipped for the whole thing to work. Put simpler: Previous versions of ChatGPT would take the static prompt, and then (invisibly) append the user prompt onto it. ChatGPT-5 can’t do that.  Every time you use ChatGPT-5, every single thing you say or do can cause it to do something different. Attach a file? Might need a different model. Ask it to "look into something and be detailed?" Might trigger a reasoning model. Ask a question in a weird way? Sorry, the router's gonna need to send you to a different model.  Every single thing that can happen when you ask ChatGPT to do something may trigger the "router" to change model, or request a new tool, and each time it does so requires a completely fresh static prompt, regardless of whether you select Auto, Thinking, Fast or any other option. This, in turn, requires it to expend more compute, with queries consuming more tokens compared to previous versions.  As a result, ChatGPT-5 may be "smart," but it sure doesn't seem "efficient." To play Devil's Advocate, OpenAI likely added the routing model as a means of creating more sophisticated outputs for users, and, I imagine, with the intention of cost-saving. Then again, this may just be the thing that it had ready to ship — after all, GPT-5 was meant to be " the next great leap in AI ," and the pressure was on to get it out the door. By creating a system that depends on an external routing model — likely another LLM — OpenAI has removed the ability to cache the hidden instructions that dictate how the models generate answers in ChatGPT, creating massive infrastructural overhead. Worse still, this happens with every single "turn" (IE: message) on ChatGPT-5, regardless of the model you choose, creating endless infrastructural baggage with no real way out that only compounds based on how complex a user's queries get.  Could OpenAI make a better router? Sure! Does it have a good router today? I don't think so! Every time you message ChatGPT it has the potential to change model or tooling based on its own whims, each time requiring a fresh static prompt. It doesn't even need to be a case where a user asks ChatGPT-5 to "think," and based on my tests with GPT-5, sometimes just asking it a four-word question can trigger it to "think longer" for no apparent reason. OpenAI has created a product with latency issues and an overwhelmingly convoluted routing system that's already straining capacity, to the point that this announcement feels like OpenAI is walking away from its API entirely. Unlike the GPT-4o announcement , which mentions the API in the first paragraph, the GPT-5 announcement has no reference to it, and a single reference to developers when talking about coding. Sam Altman has already hinted that he intends to deprecate any "new API demand " — though I imagine he'll let anyone in who will pay for priority processing . ChatGPT-5 feels like the ultimate comeuppance for a company that was never forced to build a product, choosing instead to bolt increasingly-complex "tools" onto the sides of models in the hopes that one would magically appear. Now each and every "feature" of ChatGPT burns even more money than it did before.  ChatGPT-5 feels like a product that was rushed to market by a desperate company that had to get something out the door. In simpler terms, OpenAI gave ChatGPT a middle manager.

0 views
Sean Goedecke 3 months ago

What's the strongest AI model you can train on a laptop in five minutes?

What’s the strongest model I can train on my MacBook Pro 1 in five minutes. I’ll give the answer upfront: the best 5-minute model I could train was a ~1. 8M-param GPT-style transformer trained on ~20M TinyStories tokens, reaching ~9. 6 perplexity on a held-out split. Here’s an example of the output, with the prompt bolded: Once upon a time , there was a little boy named Tim

0 views
Ahead of AI 3 months ago

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later). This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details. I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.) Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 I hope you find it informative! Before we discuss the architecture in more detail, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below. Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. That being said, there are still many interesting aspects of their design choices. Some are shown in the figure above (while others are not, but we will discuss them later as well). In the rest of this article, I will highlight these features and compare them to other architectures, one at a time. I should also note that I am not affiliated with OpenAI in any way. My information comes from reviewing the released model code and reading their technical reports. If you want to learn how to use these models locally, the best place to start is OpenAI's official model hub pages: https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b The 20B model can run on a consumer GPU with up to 16 GB of RAM. The 120B model can run on a single H100 with 80 GB of RAM or newer hardware. I will return to this later, as there are some important caveats. Before we jump into comparisons between gpt-oss and a more recent architecture, let's hop into the time machine and take a side-by-side look at GPT-2 (Figure 2) to see just how far things have come. Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 That is fc1 + fc2 = 8,388,608 parameters. For the GLU variant, we have fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 I.e., 3 × 1,048,576 = 3,145,728 weight parameters. So, overall, using the GLU variants results in fewer parameters, and they perform better as well. The reason for this better performance is that these GLU variants provide an additional multiplicative interaction, which improves expressivity (the same reason deep & slim neural nets perform better than shallow & wide neural nets, provided they are trained well). In addition to upgrading the feed forward module to a SwiGLU, as discussed in the previous section, gpt-oss replaces the single feed forward module with multiple feed forward modules, using only a subset for each token generation step. This approach is known as a Mixture-of-Experts (MoE) and illustrated in Figure 8 below. Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 It's also worth noting that gpt-oss uses twice as many attention heads, but this doesn't directly increase the model's width. The width is determined by the embedding dimension. Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate). Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost. When it comes to modeling performance, there's unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Gemma 2 paper (Table 9) , which found that for a 9B parameter architecture, a wider setup is slightly better than a deeper setup. Across 4 benchmarks, the wider model achieved a 52.0 average score, and the deeper model achieved a 50.8 average score. As shown in Figure 14 above, it's also noteworthy that gpt-oss has a surprisingly small number of experts (32 instead of 128), and only uses 4 instead of 8 active experts per token. However, each expert is much larger than the experts in Qwen3. This is interesting because the recent trends and developments point towards more, smaller models as being beneficial. This change, at a constant total parameter size, is nicely illustrated in Figure 15 below from the DeepSeekMoE paper. Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Note that the models will also run on older hardware but without MXFP4 support and will thus consume more RAM. Without MXFP4 optimization, the models in bfloat16 will consume more like 48 GB (gpt-oss-20b) and 240 GB (gpt-oss-120b). By the way, I can run the gpt-oss-20b model comfortably on my Mac Mini using ollama. It uses about 13.5 Gb or memory, which is really reasonable. The models are still a bit too new for independent benchmarks. Checking the LM Arena leaderboard , I found that gpt-oss is not listed, yet. So, Qwen3-Instruct remains the top open-weight model, according to users on the LM Arena, for now (Figure 22). Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents. Thanks for reading, and for helping support independent research! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. 2.1 Removing Dropout Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. 2.2 RoPE Replaces Absolute Positional Embeddings In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. 2.3 Swish/SwiGLU Replaces GELU Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) 2.5 Grouped Query Attention Replaces Multi-Head Attention As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. 2.6 Sliding Window Attention Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. 2.7 RMSNorm Replaces LayerNorm Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. 2.8 The GPT-2 Legacy I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3. Comparing gpt-oss To A Recent Architecture (Qwen3) Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. 3.1 Width Versus Depth If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. 3.3 Attention Bias and Attention Sinks Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . 3.4 License Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . 4 Other Interesting Tidbits While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. 4.1 Training Overview Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. 4.2 Reasoning Efforts As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source 4.3 MXFP4 Optimization: A Small But Important Detail One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. 5 gpt-oss and GPT-5 OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents.

0 views
Simon Willison 3 months ago

GPT-5: Key characteristics, pricing and model card

I've had preview access to the new GPT-5 model family for the past two weeks (see related video and my disclosures ) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for. I've collected a lot of notes over the past two weeks, so I've decided to break them up into a series of posts . This first one will cover key characteristics of the models, how they are priced and what we can learn from the GPT-5 system card . Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold): GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model. GPT-5 in the API is simpler: it's available as three models - regular , mini and nano - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.

0 views
Sean Goedecke 4 months ago

Can small AI models think as well as large ones?

An AI trend that’s emerged in the last few months 1 is the idea of a “cognitive core”. Instead of trying to build the largest, most capable model we can, should we be trying to build a small model. In general, big models - models with higher parameter counts - are better models. Claude Opus 4 is better at everything than Claude Sonnet 4, and so on

0 views
Sean Goedecke 4 months ago

Practical notes on getting LLMs to generate new ideas

Large language models struggle to generate new ideas. To AI skeptics, this seems trivially true, since they believe LLMs can only regurgitate content from their training data 1 . To AI believers, this is a puzzle. If a human had the breadth of knowledge of a LLM, wouldn’t they be able to synthesize it and come up with ideas nobody else has had

0 views
Ahead of AI 5 months ago

LLM Research Papers: The 2025 List (January to June)

As some of you know, I keep a running list of research papers I (want to) read and reference. About six months ago, I shared my 2024 list , which many readers found useful. So, I was thinking about doing this again. However, this time, I am incorporating that one piece of feedback kept coming up: "Can you organize the papers by topic instead of date?" The categories I came up with are: Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Also, as LLM research continues to be shared at a rapid pace, I have decided to break the list into bi-yearly updates. This way, the list stays digestible, timely, and hopefully useful for anyone looking for solid summer reading material. Please note that this is just a curated list for now. In future articles, I plan to revisit and discuss some of the more interesting or impactful papers in larger topic-specific write-ups. Stay tuned! It's summer! And that means internship season, tech interviews, and lots of learning. To support those brushing up on intermediate to advanced machine learning and AI topics, I have made all 30 chapters of my Machine Learning Q and AI book freely available for the summer: 🔗 https://sebastianraschka.com/books/ml-q-and-ai/#table-of-contents Whether you are just curious and want to learn something new or prepping for interviews, hopefully this comes in handy. Happy reading, and best of luck if you are interviewing! This year, my list is very reasoning model-heavy. So, I decided to subdivide it into 3 categories: Training, inference-time scaling, and more general understanding/evaluation. This subsection focuses on training strategies specifically designed to improve reasoning abilities in LLMs. As you may see, much of the recent progress has centered around reinforcement learning (with verifiable rewards), which I covered in more detail in a previous article. Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495 This part of the list covers methods that improve reasoning dynamically at test time, without requiring retraining. Often, these papers are focused on trading of computational performance for modeling performance. Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495

0 views
Max Woolf 5 months ago

Predicting Average IMDb Movie Ratings Using Text Embeddings of Movie Metadata

Months ago, I saw a post titled “ Rejected from DS Role with no feedback ” on Reddit’s Data Science subreddit , in which a prospective job candidate for a data science position provided a Colab Notebook documenting their submission for a take-home assignment and asking for feedback as to why they were rejected. Per the Reddit user, the assignment was: Use the publicly available IMDB Datasets to build a model that predicts a movie’s average rating. Please document your approach and present your results in the notebook. Make sure your code is well-organized so that we can follow your modeling process. IMDb , the Internet Movie Database owned by Amazon, allows users to rate movies on a scale from 1 to 10, wherein the average rating is then displayed prominently on the movie’s page: The Shawshank Redemption is currently the highest-rated movie on IMDb with an average rating of 9.3 derived from 3.1 million user votes. In their notebook, the Redditor identifies a few intuitive features for such a model, including the year in which the movie was released, the genre(s) of the movies, and the actors/directors of the movie. However, the model they built is a TensorFlow and Keras -based neural network, with all the bells-and-whistles such as batch normalization and dropout . The immediate response by other data scientists on /r/datascience was, at its most polite, “why did you use a neural network when it’s a black box that you can’t explain?” Reading those replies made me nostalgic. Way back in 2017, before my first job as a data scientist, neural networks using frameworks such as TensorFlow and Keras were all the rage for their ability to “ solve any problem ” but were often seen as lazy and unskilled compared to traditional statistical modeling such as ordinary least squares linear regression or even gradient boosted trees. Although it’s funny to see that perception against neural networks in the data science community hasn’t changed since, nowadays the black box nature of neural networks can be an acceptable business tradeoff if the prediction results are higher quality and interpretability is not required. Looking back at the assignment description, the objective is only “predict a movie’s average rating.” For data science interview take-homes, this is unusual: those assignments typically have an extra instruction along the lines of “explain your model and what decisions stakeholders should make as a result of it”, which is a strong hint that you need to use an explainable model like linear regression to obtain feature coefficients, or even a middle-ground like gradient boosted trees and its variable importance to quantify relative feature contribution to the model. 1 In absence of that particular constraint, it’s arguable that anything goes, including neural networks. The quality of neural networks have improved significantly since 2017, even moreso due to the massive rise of LLMs. Why not try just feeding a LLM all raw metadata for a movie and encode it into a text embedding and build a statistical model based off of that? Would a neural network do better than a traditional statistical model in that instance? Let’s find out! The IMDb Non-Commercial Datasets are famous sets of data that have been around for nearly a decade 2 but are still updated daily. Back in 2018 as a budding data scientist, I performed a fun exporatory data analysis using these datasets, although the results aren’t too surprising. The average rating for a movie is around 6 and tends to skew higher: a common trend in internet rating systems. But in truth, these datasets are a terrible idea for companies to use for a take-home assignment. Although the datasets are released under a non-commercial license, IMDb doesn’t want to give too much information to their competitors, which results in a severely limited amount of features that could be used to build a good predictive model. Here are the common movie-performance-related features present in the file: This is a sensible schema for describing a movie, although it lacks some important information that would be very useful to determine movie quality such as production company, summary blurbs, granular genres/tags, and plot/setting — all of which are available on the IMDb movie page itself and presumably accessible through the paid API . Of note, since the assignment explicitly asks for a movie ’s average rating, we need to filter the data to only and entries, which the original assignment failed to do. The ratings data in is what you’d expect: In order to ensure that the average ratings for modeling are indeed stable and indicative of user sentiment, I will only analyze movies that have atleast 30 user votes : as of May 10th 2025, that’s about 242k movies total. Additionally, I will not use as a model feature, since that’s a metric based more on extrinsic movie popularity rather than the movie itself. The last major dataset is , which has very helpful information on metadata such as the roles people play in the production of a movie: Additionally, because the datasets are so popular, it’s not the first time someone has built a IMDb ratings predictor and it’s easy to Google. Instead of using the official IMDb datasets, these analyses are based on the smaller IMDB 5000 Movie Dataset hosted on Kaggle, which adds metadata such as movie rating, budget, and further actor metadata that make building a model much easier (albeit “number of likes on the lead actor’s Facebook page” is very extrinsic to movie quality). Using the official datasets with much less metadata is building the models on hard mode and will likely have lower predictive performance. Although IMDb data is very popular and very well documented, that doesn’t mean it’s easy to work with. Data science take-home assignments are typically 1/2 exploratory data analysis for identifying impactful dataset features, and 1/2 building, iterating, and explaining the model. For real-world datasets, these are all very difficult problems with many difficult solutions, and the goal from the employer’s perspective is seeing more how these problems are solved rather than the actual quantitative results. The initial Reddit post decided to engineer some expected features using pandas , such as by checking whether a non- number is present at the end of a movie title and one-hot encoding each distinct of a movie. These are fine for an initial approach, albeit sequel titles can be idiosyncratic and it suggests that a more NLP approach to identifying sequels and other related media may be useful. The main trick with this assignment is how to handle the principals. The common data science approach would be to use a sparse binary encoding of the actors/directors/writers, e.g. using a vector where actors present in the movie are and every other actor is , which leads to a large number of potential approaches to encode this data performantly, such as scikit-learn’s MultiLabelBinarizer . The problem with this approach is that there are a very large number of unique actors / high cardinality — more unique actors than data points themselves — which leads to curse of dimensionality issues and workarounds such as encoding only the top N actors will lead to the feature being uninformative since even a generous N will fail to capture the majority of actors. There are actually 624k unique actors in this dataset ( Jupyter Notebook ), the chart just becomes hard to read at that point. Additionally, most statistical modeling approaches cannot account for the of actors as they treat each feature as independent, and since the billing order of actors is generally correlated to their importance in the movie, that’s an omission of relevant information to the problem. These constraints gave me an idea: why not use an LLM to encode all movie data, and build a model using the downstream embedding representation? LLMs have attention mechanisms , which will not only respect the relative ordering of actors (to give higher predictive priority to higher-billed actors, along with actor cooccurrences), but also identify patterns within movie name texts (to identify sequels and related media semantically). I started by aggregating and denormalizing all the data locally ( Jupyter Notebook ). Each of the IMDb datasets are hundreds of megabytes and hundreds of thousands of rows at minimum: not quite big data , but enough to be more cognizant of tooling especially since computationally-intensive JOINs are required. Therefore, I used the Polars library in Python, which not only loads data super fast, but is also one of the fastest libraries at performing JOINs and other aggregation tasks. Polars’s syntax also allows for some cool tricks: for example, I want to spread out and aggregate the principals (4.1 million rows after prefiltering) for each movie into directors, writers, producers, actors, and all other principals into nested lists while simultaneously having them sorted by as noted above. This is much easier to do in Polars than any other data processing library I’ve used, and on millions of rows, this takes less than a second : After some cleanup and field renaming, here’s an example JSON document for Star Wars: Episode IV - A New Hope : I was tempted to claim that I used zero feature engineering, but that wouldn’t be accurate. The selection and ordering of the JSON fields here is itself feature engineering: for example, and are intentionally last in this JSON encoding because they can have wildly varying lengths while the prior fields are more consistent, which should make downstream encodings more comparable and consistent. Now, let’s discuss how to convert these JSON representations of movies into embeddings. LLMs that are trained to output text embeddings are not much different from LLMs like ChatGPT that just predict the next token in a loop. Models such as BERT and GPT can generate “embeddings” out-of-the-box by skipping the prediction heads of the models and instead taking an encoded value from the last hidden state of the model (e.g. for BERT, the first positional vector of the hidden state representing the token). However, text embedding models are more optimized for distinctiveness of a given input text document using contrastive learning . These embeddings can be used for many things, from finding similar encoded inputs by identifying the similarity between embeddings, and of course, by building a statistical model on top of them. Text embeddings that leverage LLMs are typically generated using a GPU in batches due to the increased amount of computation needed. Python libraries such as Hugging Face transformers and sentence-transformers can load these embeddings models. For this experiment, I used the very new Alibaba-NLP/gte-modernbert-base text embedding model that is finetuned from the ModernBERT model specifically for the embedding use case for two reasons: it uses the ModernBERT architecture which is optimized for fast inference , and the base ModernBERT model is trained to be more code-aware and should be able understand JSON-nested input strings more robustly — that’s also why I intentionally left in the indentation for nested JSON arrays as it’s semantically meaningful and explicitly tokenized . 3 The code ( Jupyter Notebook ) — with extra considerations to avoid running out of memory on either the CPU or GPU 4 — looks something like this: I used a Spot L4 GPU on Google Cloud Platform at a pricing of $0.28/hour, and it took 21 minutes to encode all 242k movie embeddings: about $0.10 total, which is surprisingly efficient. Each of these embeddings is a set of 768 numbers (768D). If the embeddings are unit normalized (the step), then calculating the dot product between embeddings will return the cosine similarity of those movies, which can then be used to identify the most similar movies. But “similar” is open-ended, as there are many dimensions how a movie could be considered similar. Let’s try a few movie similarity test cases where I calculate the cosine similarity between one query movie and all movies, then sort by cosine similarity to find the most similar ( Jupyter Notebook ). How about Peter Jackson’s Lord of the Rings: The Fellowship of the Ring ? Ideally, not only would it surface the two other movies of the original trilogy, but also its prequel Hobbit trilogy. Indeed, it worked and surfaced both trilogies! The other movies listed are about the original work, so having high similarity would be fair. Compare these results to the “ More like this ” section on the IMDb page for the movie itself, which has the two sequels to the original Lord of the Rings and two other suggestions that I am not entirely sure are actually related. What about more elaborate franchises, such as the Marvel Cinematic Universe ? If you asked for movies similar to Avengers: Endgame , would other MCU films be the most similar? The answer is yes, which isn’t a surprise since those movies share many principals. Although, there are instances of other movies named “Endgame” and “The Avengers” which are completely unrelated to Marvel and therefore implies that the similarities may be fixated on the names. What about movies of a smaller franchise but a specific domain, such as Disney’s Frozen that only has one sequel? Would it surface other 3D animated movies by Walt Disney Animation Studios , or something else? …okay, it’s definitely fixating on the name. Let’s try a different approach to see if we can find more meaningful patterns in these embeddings. In order to visualize the embeddings, we can project them to a lower dimensionality with a dimensionality reduction algorithm such as PCA or UMAP : UMAP is preferred as it can simultaneously reorganize the data into more meaningful clusters. UMAP’s construction of a neighborhood graph , in theory, can allow the reduction to refine the similarities by leveraging many possible connections and hopefully avoid fixating on the movie name. However, with this amount of input data and the relatively high initial 768D vector size, the computation cost of UMAP is a concern as both factors each cause the UMAP training time to scale exponentially. Fortunately, NVIDIA’s cuML library recently updated and now you can run UMAP with very high amounts of data on a GPU at a very high number of epochs to ensure the reduction fully converges, so I did just that ( Jupyter Notebook ). What patterns can we find? Let’s try plotting the reduced points, colored by their user rating. So there’s a few things going on here. Indeed, most of the points are high-rating green as evident in the source data. But the points and ratings aren’t random and there are trends. In the center giga cluster, there are soft subclusters of movies at high ratings and low ratings. Smaller discrete clusters did indeed form, but what is the deal with that extremely isolated cluster at the top? After investigation, that cluster only has movies released in 2008, which is another feature I should have considered when defining movie similarity. As a sanity check, I faceted out the points by movie release year to better visualize where these clusters are forming: This shows that even the clusters movies have their values spread, but I unintentionally visualized how embedding drift changes over time. 2024 is also a bizarrely-clustered year: I have no idea why those two years specifically are weird in movies. The UMAP approach is more for fun, since it’s better for the downstream model building to use the raw 768D vector and have it learn the features from that. At the least, there’s some semantic signal preserved in these embeddings, which makes me optimistic that these embeddings alone can be used to train a viable movie rating predictor. So, we now have hundreds of thousands of 768D embeddings. How do we get them to predict movie ratings? What many don’t know is that all methods of traditional statistical modeling also work with embeddings — assumptions such as feature independence are invalid so the results aren’t explainable, but you can still get a valid predictive model. First, we will shuffle and split the data set into a training set and a test set: for the test set, I chose 20,000 movies (roughly 10% of the data) which is more than enough for stable results. To decide the best model, we will be using the model that minimizes the mean squared error (MSE) of the test set, which is a standard approach to solving regression problems that predict a single numeric value. Here are three approaches for using LLMs for solving non-next-token-prediction tasks. You can still fit a linear regression on top of the embeddings even if feature coefficients are completely useless and it serves as a decent baseline ( Jupyter Notebook ). The absolute laziest “model” where we just use the mean of the training set for every prediction results in a test MSE of 1.637 , but performing a simple linear regression on top of the 768D instead results in a more reasonable test MSE of 1.187 . We should be able to beat that handily with a more advanced model. Data scientists familiar with scikit-learn know there’s a rabbit hole of model options, but most of them are CPU-bound and single-threaded and would take considerable amount of time on a dataset of this size. That’s where cuML—the same library I used to create the UMAP projection—comes in, as cuML has GPU-native implementations of most popular scikit-learn models with a similar API. This notably includes support vector machines , which play especially nice with embeddings. And because we have the extra compute, we can also perform a brute force hyperparameter grid search to find the best parameters for fitting each model. Here’s the results of MSE on the test dataset for a few of these new model types, with the hyperparameter combination for each model type that best minimizes MSE: The winner is the Support Vector Machine, with a test MSE of 1.087 ! This is a good start for a simple approach that handily beats the linear regression baseline, and it also beats the model training from the Redditor’s original notebook which had a test MSE of 1.096 5 . In all cases, the train set MSE was close to the test set MSE, which means the models did not overfit either. Since we’re already dealing with AI models and already have PyTorch installed to generate the embeddings, we might as well try the traditional approach of training a multilayer perceptron (MLP) neural network on top of the embeddings ( Jupyter Notebook ). This workflow sounds much more complicated than just fitting a traditional model above, but PyTorch makes MLP construction straightforward, and Hugging Face’s Trainer class incorporates best model training practices by default, although its function has to be tweaked to minimize MSE specifically. The PyTorch model, using a loop to set up the MLP blocks, looks something like this: This MLP is 529k parameters total: large for a MLP, but given the 222k row input dataset, it’s not egregiously so. The real difficulty with this MLP approach is that it’s too effective : even with less than 1 million parameters, the model will extremely overfit and converge to 0.00 train MSE quickly, while the test set MSE explodes. That’s why is set to the atypically high probability of . Fortunately, MLPs are fast to train: training for 600 epochs (total passes through the full training dataset) took about 17 minutes on the GPU. Here’s the training results: The lowest logged test MSE was 1.074 : a slight improvement over the Support Vector Machine approach. There is a possibility that using a pretrained embedding model that was trained on the entire internet could intrinsically contain relevant signal about popular movies—such as movies winning awards which would imply a high IMDb rating—and that knowledge could leak into the test set and provide misleading results. This may not be a significant issue in practice since it’s such a small part of the model which is too small to memorize exact information. For the sake of comparison, let’s try training a LLM from scratch on top of the raw movie JSON representations to process this data to see if we can get better results without the possibility of leakage ( Jupyter Notebook ). I was specifically avoiding this approach because the compute required to train an LLM is much, much higher than a SVM or MLP model and generally leveraging a pretrained model gives better results. In this case, since we don’t need a LLM that has all the knowledge of human existence, we can train a much smaller model that only knows how to work with the movie JSON representations and can figure out relationships between actors and whether titles are sequels itself. Hugging Face transformers makes this workflow surprisingly straightforward by not only having functionality to train your own custom tokenizer (in this case, from 50k vocab to 5k vocab) that encodes the data more efficiently, but also allowing the construction a ModernBERT model with any number of layers and units. I opted for a 5M parameter LLM (SLM?), albeit with less dropout since high dropout causes learning issues for LLMs specifically. The actual PyTorch model code is surprisingly more concise than the MLP approach: Essentially, the model trains its own “text embedding,” although in this case instead of an embedding optimized for textual similarity, the embedding is just a representation that can easily be translated into a numeric rating. Because the computation needed for training a LLM from scratch is much higher, I only trained the model for 10 epochs, which was still twice as slow than the 600 epochs for the MLP approach. Given that, the results are surprising: The LLM approach did much better than my previous attempts with a new lowest test MSE of 1.026 , with only 4 passes through the data! And then it definitely overfit. I tried other smaller configurations for the LLM to avoid the overfitting, but none of them ever hit a test MSE that low. Let’s look at the model comparison again, this time adding the results from training a MLP and training a LLM from scratch: Coming into this post, I’m genuinely thought that training the MLP on top of embeddings would have been the winner given the base embedding model’s knowledge of everything, but maybe there’s something to just YOLOing and feeding raw JSON input data to a completely new LLM. More research and development is needed. The differences in model performance from these varying approaches aren’t dramatic, but some iteration is indeed interesting and it was a long shot anyways given the scarce amount of metadata. The fact that building a model off of text embeddings only didn’t result in a perfect model doesn’t mean this approach was a waste of time. The embedding and modeling pipelines I have constructed in the process of trying to solve this problem have already provided significant dividends on easier problems, such as identifying the efficiency of storing embeddings in Parquet and manipulating them with Polars . It’s impossible and pointless to pinpoint the exact reason the original Reddit poster got rejected: it could have been the neural network approach or even something out of their control such as the original company actually stopping hiring and being too disorganized to tell the candidate. To be clear, if I myself were to apply for a data science role, I wouldn’t use the techniques in this blog post (that UMAP data visualization would get me instantly rejected!) and do more traditional EDA and non-neural-network modeling to showcase my data science knowledge to the hiring manager. But for my professional work, I will definitely try starting any modeling exploration with an embeddings-based approach wherever possible: at the absolute worst, it’s a very strong baseline that will be hard to beat. All of the Jupyter Notebooks and data visualization code for this blog post is available open-source in this GitHub repository . I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or direction of the feature in the real world, but it does help identify which features can be pruned for model development iteration.  ↩︎ To get a sense on how old they are, they are only available as TSV files , which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead!  ↩︎ Two other useful features of but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as and as nomic-embed-text-v1.5 does to guide its results, which is an annoying requirement for those models.  ↩︎ The trick here is the function for the computed embeddings, otherwise the GPU doesn’t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way.  ↩︎ As noted earlier, minimizing MSE isn’t a competition, but the comparison on roughly the same dataset is good for a sanity check.  ↩︎ tconst : unique identifier of the title titleType : the type/format of the title (e.g. movie, tvmovie, short, tvseries, etc) primaryTitle : the more popular title / the title used by the filmmakers on promotional materials at the point of release isAdult : 0: non-adult title; 1: adult title startYear : represents the release year of a title. runtimeMinutes : primary runtime of the title, in minutes genres : includes up to three genres associated with the title tconst : unique identifier of the title (which can therefore be mapped to movie metadata using a JOIN) averageRating : average of all the individual user ratings numVotes : number of votes the title has received tconst : unique identifier of the title (which can be mapped to movie data using a JOIN) nconst : unique identifier of the principal (this is mapped to to get the principal’s , but nothing else useful) category : the role the principal served in the title, such as , , , , etc. ordering : the ordering of the principals within the title, which correlates to the order the principals appear on IMDb’s movie cast pages. I am not a fan of using GBT variable importance as a decision-making metric: variable importance does not tell you magnitude or direction of the feature in the real world, but it does help identify which features can be pruned for model development iteration.  ↩︎ To get a sense on how old they are, they are only available as TSV files , which is a data format so old and prone to errors that many data libraries have dropped explicit support for it. Amazon, please release the datasets as CSV or Parquet files instead!  ↩︎ Two other useful features of but not strictly relevant to these movie embeddings are a) its a cased model so it can identify meaning from upper-case text and b) it does not require a prefix such as and as nomic-embed-text-v1.5 does to guide its results, which is an annoying requirement for those models.  ↩︎ The trick here is the function for the computed embeddings, otherwise the GPU doesn’t free up the memory once moved back to the CPU. I may or may not have discovered that the hard way.  ↩︎ As noted earlier, minimizing MSE isn’t a competition, but the comparison on roughly the same dataset is good for a sanity check.  ↩︎

0 views

How to Train Your GPT Wrapper

One of the most common complaints I hear from users of AI agents is, "Why do I have to tell it the same thing over and over?" They expect their tools to learn from experience, but the reality is that most don't. This is because today's LLM-powered apps are fundamentally static; they don't learn purely from individual interactions. 1 As building agents becomes better defined and many products have shipped their first agentic MVPs, what’s becoming clear is that the next new thing may be how to get these agents to reliably and securely self-improve. This applies to both knowledge (gaining persistent user-related context) and behavior (learning to more effectively solve problems) which are independent but highly interrelated. In some online contexts, you’ll see this referred to as agent “memory,” and to me, that's just an implementation for achieving this experience. If machine learning (ML) was supposed to “ learn from experience E with respect to some class of tasks T …” why are our GPT wrappers, built using ML, not actually learning from experience? The answer is: technically they could, but training these next-token-prediction models is actually a fairly non-trivial problem compared to their task-specific classification/regression/etc counterparts. In this post, I wanted to go through the modern toolbox for agent self-improvement and why it’s complicated. 2 “How to Train Your GPT Wrapper” by ChatGPT. Why is self-learning hard? Training (as in updating parameters) LLMs is still hard 3 If you have a knowledge base, you can’t just “train” on it. Traditional Supervised Fine-Tuning (SFT) requires a large dataset of conversational examples ( , ) rather than just knowledge material. If you are building a tool-use agent or a reasoning model, you often can’t train on just examples but instead rely on reinforcement learning to steer the model towards a reward. This takes quite a bit more compute, relies on a high quality reward function (which isn’t maximizing user ratings! 4 ), and either user data or highly realistic simulated environments. While you can attempt to anonymize, a global model trained on one user's data still has the potential to leak information to others 5 . While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making this a non-starter in practice. Today's models have hundreds of billions of parameters with quite a bit of complexity around how to both train and serve them. While we’ve developed several ways of efficiently fine-tuning, there’s no platform (yet) that makes it trivial to regularly turn feedback into new, servable models. 6 Training (as in prompting, aka in-context-learning ) is costly Every piece of information added to the prompt, past conversations, tool outputs, user feedback, consumes tokens. This makes naive feedback quadratic in cost and latency as each interaction potentially generates feedback which is appended to the prompt in every future interaction. Applications rely heavily on prompt caching to manage costs. However, the more you personalize the context with user-specific rules and feedback, the lower your cache hit rate becomes. State makes everything more complicated 7 Once an agent starts learning, its past interactions can influence future behavior. Did the agent give a bad response because of a recent change in the system prompt, a new feature, or a piece of user feedback from three weeks ago? The "blast radius" of a single piece of learned information is hard to predict and control. What happens when a user's preferences change, or when information becomes outdated? A system that can't effectively forget is doomed to make mistakes based on old, irrelevant data. Imagine telling your agent to never answer questions on a certain topic, but then a product update makes that topic relevant again. The agent's "memory" might prevent it from adapting. For any of this to work, users have to trust you with their data and their feedback. This brings us back to the data leakage problem. There's an inherent tension between creating a globally intelligent system that learns from all users and a personalized one that respects individual privacy. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The core determiner for how you do self-improvement is what data you can get from the user, ranging from nothing at all to detailed corrections and explanations. The richer the feedback, the less samples needed to make a meaningful improvement. It’s also a key product decision to determine the effect radius for different forms of feedback. I’ll call this the “preference group”; the group of users (or interactions) in which a given piece of feedback causes a change in agent behavior. These groups could be along explicit boundaries (by user, team, or other legal organization) or derived boundaries (geographic region, working file paths, usage persona, etc). Grouping too small (e.g. user level) increases cold start friction and means several users will experience the same preventable mistakes, some never seeing any improvement until they provide sufficient feedback. For parameter-based training, it can also be unmanageable to have highly granular copies of the model weights (even if from PEFT). Grouping too large (e.g. globally) leads to riskier agent updates and unusual behavior. One user with “weird” feedback could directly degrade the efficacy of the agent for all other users. Even when you have no explicit signal from the user on how your agent is performing you can improve the system. While users get a potentially more focused experience, with a lack of signal, you’ll need to derive approximate feedback from high-volume, low-signal proxy data. There’s high potential to make false assumptions but this can be compensated for my aggregating more data (i.e. over time or preference group size) per model update. What you could do: Use LLMs to determine preferences or explanations — Take (question, answer) pairs and use LLMs (or even simpler heuristics) to determine if this was a preferred answer or what the preferred answer would have been. Effectively running your own LLM-as-judge setup to determine what the user might’ve told you 8 . With this, proceed to cases 1, 2, or 3. Use engagement metrics to determine preferences — Take traditional analytics on engagement with your agent to approximate the quality of responses. Did the user come back? Did the buy the thing you showed them? How much time did they spend? Turning these types of analytics into preferences on your agent’s responses. With this, proceed to case 1. Use agent tool failures as implicit signals — You can log every tool call and its outcome (success, failure, or the content of the response). Recurring tool failures, inefficient tool-use loops, or patterns where the agent calls a tool with nonsensical arguments are all strong implicit signals that the agent's reasoning is flawed for a particular type of task. These failed "trajectories" can be automatically flagged and used as negative examples for Case 1. Use simulation to generate feedback — Use an LLM to act as a "user simulator", generating a diverse set of realistic queries and tasks. Then, have your agent attempt to solve these tasks in a synthetic gym environment. Since you define the environment and task, you can often automatically verify if the agent succeeded (e.g., "Did it pass the tests?") and use this outcome as a reward signal. This synthetic data can then be used to create preference pairs or corrections, allowing you to train your agent using the methods from cases 1, 2, or 3. Keep the chat history — While their are plenty of reasons this might make things worse, another option when there’s no clear preferences or feedback is provided is to just include the previous chats (or chat summaries) in future prompts within the same preference group. You do this with the hope that the collective context of previous chats, the agent can steer towards better responses. Rely on third-party grounding — You could also rely on a 3rd party API to give the agents hints or updated instructions for how to solve a particular task. A simple example of this would be to have an agent that can “google” for how to solve the problem and as google indexes online posts, your agent might natural begin to improve. For any given agent you are building, there might be some pre-existing knowledge base you can lean on for “self-improvement”. Case 1: Users give preferences (👍👎) This is one of the most common feedback mechanisms. It's low-friction for the user, provides a clear signal that can be easily turned into a metric, and is a step up from inferring feedback from proxy data. However, the signal itself can be noisy. Users might downvote a correct answer because it was unhelpful for their specific need, or upvote an incorrect one that just sounds confident. What you could do: Fine-tune with preferences — You can train the model by constructing pairs from the data you collect. A response that receives a 👍 becomes a "chosen" example, while one that gets a 👎 becomes a "rejected" one, and these are then paired for training. From there, classic RLHF can use these pairs to train a reward model that guides the main agent. A more direct alternative is DPO, which skips the reward model and uses the constructed pairs to directly fine-tune the agent's policy. Use LLMs to derive explanations — Aggregate the 👍/👎 data across a preference group and use another LLM to analyze the patterns and generate a hypothesis for why certain responses were preferred. This process attempts to turn many low-quality signals into a single, higher-quality explanation, which you can then use to update documentation or create few-shot examples as described in Case 2. Use in-context learning with examples — Dynamically pull examples of highly-rated and poorly-rated responses and place them into the context window for future queries within the same preference group. This lets the agent "learn" at inference time to steer its answers towards a preferred style or content format. Case 2: Users give you explanations Here, instead of a simple preference, the user provides a natural language explanation of what went wrong (e.g., "That's not right, you should have considered the legacy API," or "Don't use that library, it's deprecated."). This feedback requires more effort from the user, but the signal quality is extremely high; a single good explanation can be more valuable than hundreds of thumbs-ups. Users are often willing to provide this level of detail if they believe the agent will actually learn from it and save them time in the future. This feedback can be collected through an explicit UI, in the flow of conversation, or even inferred from subsequent user actions. What you could do: Synthesize a corrected answer — One use of an explanation is to try and generate the corrected answer. You can use another LLM as a "refiner" that takes the and outputs a . If this synthesis is successful, you've effectively created a high-quality pair and can move to Case 3. Use in-context learning with explanations — Store the pairs. When a new, similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt. This gives the agent a just-in-time example of a pitfall to avoid and the reasoning behind it, steering it away from making the same mistake twice or doubling down on what worked. Distill feedback into reusable knowledge — Aggregate explanations to find recurring issues—like an agent's travel suggestions being too generic. An LLM can then synthesize these complaints into a single, concise rule. This new rule can either be added to the system prompt to fix the behavior for a user group, or it can be inserted into a knowledge base. For example, a synthesized rule like, "When planning itineraries, always include a mix of popular sites and unique local experiences," can be stored and retrieved for any future travel-related queries, ensuring more personalized and higher-quality suggestions. Here, the user doesn't just explain what's wrong; they provide the correct answer by directly editing the agent's output. The "diff" between the agent's suggestion and the user's final version creates a high-quality training example. Depending on the product's design, this can often be a low-friction way to gather feedback, as the user was going to make the correction anyway as part of their natural workflow, whether they're fixing a block of generated code or rewriting a paragraph in a document. What you could do: Fine-tune with edit pairs — Use the pair for Supervised Fine-Tuning (SFT) to teach the model the correct behavior. Alternatively, you can use the pair for preference tuning methods like DPO, treating the user's edit as the "chosen" response and the agent's initial attempt as the "rejected" one. Use in-context learning with corrections — Store the pairs. When a similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt as a concrete example of what to do and what to avoid, steering the agent toward the correct format or content at inference time. Derive explanations — You can also work backward from the edit to enrich your prompts and/or knowledge bases. Use an LLM to analyze the "diff" between the original and edited text to generate a natural language explanation for the change, in some sense capturing the user's intent. This synthesized explanation can then be used in all the ways described in Case 2. Other considerations How do you handle observability and debuggability? — When an agent's "memory" causes unexpected behavior, debugging becomes a challenge. A key design choice is whether to provide users with an observable "memory" panel to view, edit, or reset learned information. This creates a trade-off between debuggability and the risk of overwhelming or confusing users with their own data profile. How do you pick the "preference group"? — Choosing the scope for feedback involves a trade-off between cold-starts and risk. User-level learning is slow to scale, while global learning can be degraded by outlier feedback. A common solution is grouping users by explicit boundaries (like a company) or implicit ones (like a usage persona). The design of these groups also has business implications; a group could be defined to span across both free and paid tiers, allowing feedback from a large base of unpaid users to directly improve the product for paying customers. How do you decide which feedback case to use? — The progression from simple preferences (Case 1) to detailed explanations or edits (Cases 2 & 3) depends heavily on user trust. Users will only provide richer feedback when they believe the system is actually listening. This trust can be accelerated by making the agent's reasoning process transparent, which empowers users to self-debug and provide more targeted suggestions. How much should be learned via fine-tuning vs. in-context learning? — A core architectural choice is whether to learn via parameter changes (fine-tuning) or prompt changes (in-context learning/RAG). ICL is often faster and cheaper, especially as foundational models improve rapidly, making fine-tuned models quickly obsolete. While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making prompt-based learning the more practical path. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. See Dwarkesh’s post on “Why I don’t think AGI is right around the corner” particularly about continual learning. It’s truely a pretty critical gap with today’s agents and LLM-powered products but one that I’m pretty bullish is mostly solvable at a “scaffolding” layer (rather than a fundamental ceiling with LLMs). I wrote this based on my own brainstorm of ideas and hope that this is mostly conclusive but there’s definitely a chance I missed some, let me know! By calling both of these expensive—costs and latency-wise—I’m also implying this rationale will become less important over time but remains a medium-term design consideration. See OpenAI’s “Sycophancy in GPT-4o” kerfuffle. For example, imagine a manager's private feedback, "Bob on Project Stardust often misses deadlines," is naively anonymized for fine-tuning a global model. The model learns the association between the unique entity "Project Stardust" and the concept of "missing deadlines." A later query from another user about "Project Stardust" could then elicit a response about engineers on that project struggling with deadlines, effectively leaking the substance of the private feedback even if the name "Bob" is masked. This is one of those things that a lot of AI platform startups will say they can do this, but I haven’t seen anything yet that proves it can be done completely end-to-end while being something I’d trust in production. There are several interesting parallels to the complexity of agent memory and the more well-studied occurrences of state-complexity in software engineering . Contrary to popular belief, training LLMs to optimize their own preferences, when done carefully, can be a pretty powerful zero-data training technique. See “Absolute Zero: Reinforced Self-play Reasoning with Zero Data” and “Self-Adapting Language Models” . “How to Train Your GPT Wrapper” by ChatGPT. Why is self-learning hard? Training (as in updating parameters) LLMs is still hard 3 If you have a knowledge base, you can’t just “train” on it. Traditional Supervised Fine-Tuning (SFT) requires a large dataset of conversational examples ( , ) rather than just knowledge material. If you are building a tool-use agent or a reasoning model, you often can’t train on just examples but instead rely on reinforcement learning to steer the model towards a reward. This takes quite a bit more compute, relies on a high quality reward function (which isn’t maximizing user ratings! 4 ), and either user data or highly realistic simulated environments. While you can attempt to anonymize, a global model trained on one user's data still has the potential to leak information to others 5 . While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making this a non-starter in practice. Today's models have hundreds of billions of parameters with quite a bit of complexity around how to both train and serve them. While we’ve developed several ways of efficiently fine-tuning, there’s no platform (yet) that makes it trivial to regularly turn feedback into new, servable models. 6 Training (as in prompting, aka in-context-learning ) is costly Every piece of information added to the prompt, past conversations, tool outputs, user feedback, consumes tokens. This makes naive feedback quadratic in cost and latency as each interaction potentially generates feedback which is appended to the prompt in every future interaction. Applications rely heavily on prompt caching to manage costs. However, the more you personalize the context with user-specific rules and feedback, the lower your cache hit rate becomes. State makes everything more complicated 7 Once an agent starts learning, its past interactions can influence future behavior. Did the agent give a bad response because of a recent change in the system prompt, a new feature, or a piece of user feedback from three weeks ago? The "blast radius" of a single piece of learned information is hard to predict and control. What happens when a user's preferences change, or when information becomes outdated? A system that can't effectively forget is doomed to make mistakes based on old, irrelevant data. Imagine telling your agent to never answer questions on a certain topic, but then a product update makes that topic relevant again. The agent's "memory" might prevent it from adapting. For any of this to work, users have to trust you with their data and their feedback. This brings us back to the data leakage problem. There's an inherent tension between creating a globally intelligent system that learns from all users and a personalized one that respects individual privacy. Grouping too small (e.g. user level) increases cold start friction and means several users will experience the same preventable mistakes, some never seeing any improvement until they provide sufficient feedback. For parameter-based training, it can also be unmanageable to have highly granular copies of the model weights (even if from PEFT). Grouping too large (e.g. globally) leads to riskier agent updates and unusual behavior. One user with “weird” feedback could directly degrade the efficacy of the agent for all other users. Use LLMs to determine preferences or explanations — Take (question, answer) pairs and use LLMs (or even simpler heuristics) to determine if this was a preferred answer or what the preferred answer would have been. Effectively running your own LLM-as-judge setup to determine what the user might’ve told you 8 . With this, proceed to cases 1, 2, or 3. Use engagement metrics to determine preferences — Take traditional analytics on engagement with your agent to approximate the quality of responses. Did the user come back? Did the buy the thing you showed them? How much time did they spend? Turning these types of analytics into preferences on your agent’s responses. With this, proceed to case 1. Use agent tool failures as implicit signals — You can log every tool call and its outcome (success, failure, or the content of the response). Recurring tool failures, inefficient tool-use loops, or patterns where the agent calls a tool with nonsensical arguments are all strong implicit signals that the agent's reasoning is flawed for a particular type of task. These failed "trajectories" can be automatically flagged and used as negative examples for Case 1. Use simulation to generate feedback — Use an LLM to act as a "user simulator", generating a diverse set of realistic queries and tasks. Then, have your agent attempt to solve these tasks in a synthetic gym environment. Since you define the environment and task, you can often automatically verify if the agent succeeded (e.g., "Did it pass the tests?") and use this outcome as a reward signal. This synthetic data can then be used to create preference pairs or corrections, allowing you to train your agent using the methods from cases 1, 2, or 3. Keep the chat history — While their are plenty of reasons this might make things worse, another option when there’s no clear preferences or feedback is provided is to just include the previous chats (or chat summaries) in future prompts within the same preference group. You do this with the hope that the collective context of previous chats, the agent can steer towards better responses. Rely on third-party grounding — You could also rely on a 3rd party API to give the agents hints or updated instructions for how to solve a particular task. A simple example of this would be to have an agent that can “google” for how to solve the problem and as google indexes online posts, your agent might natural begin to improve. For any given agent you are building, there might be some pre-existing knowledge base you can lean on for “self-improvement”. Case 1: Users give preferences (👍👎) This is one of the most common feedback mechanisms. It's low-friction for the user, provides a clear signal that can be easily turned into a metric, and is a step up from inferring feedback from proxy data. However, the signal itself can be noisy. Users might downvote a correct answer because it was unhelpful for their specific need, or upvote an incorrect one that just sounds confident. What you could do: Fine-tune with preferences — You can train the model by constructing pairs from the data you collect. A response that receives a 👍 becomes a "chosen" example, while one that gets a 👎 becomes a "rejected" one, and these are then paired for training. From there, classic RLHF can use these pairs to train a reward model that guides the main agent. A more direct alternative is DPO, which skips the reward model and uses the constructed pairs to directly fine-tune the agent's policy. Use LLMs to derive explanations — Aggregate the 👍/👎 data across a preference group and use another LLM to analyze the patterns and generate a hypothesis for why certain responses were preferred. This process attempts to turn many low-quality signals into a single, higher-quality explanation, which you can then use to update documentation or create few-shot examples as described in Case 2. Use in-context learning with examples — Dynamically pull examples of highly-rated and poorly-rated responses and place them into the context window for future queries within the same preference group. This lets the agent "learn" at inference time to steer its answers towards a preferred style or content format. Case 2: Users give you explanations Here, instead of a simple preference, the user provides a natural language explanation of what went wrong (e.g., "That's not right, you should have considered the legacy API," or "Don't use that library, it's deprecated."). This feedback requires more effort from the user, but the signal quality is extremely high; a single good explanation can be more valuable than hundreds of thumbs-ups. Users are often willing to provide this level of detail if they believe the agent will actually learn from it and save them time in the future. This feedback can be collected through an explicit UI, in the flow of conversation, or even inferred from subsequent user actions. What you could do: Synthesize a corrected answer — One use of an explanation is to try and generate the corrected answer. You can use another LLM as a "refiner" that takes the and outputs a . If this synthesis is successful, you've effectively created a high-quality pair and can move to Case 3. Use in-context learning with explanations — Store the pairs. When a new, similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt. This gives the agent a just-in-time example of a pitfall to avoid and the reasoning behind it, steering it away from making the same mistake twice or doubling down on what worked. Distill feedback into reusable knowledge — Aggregate explanations to find recurring issues—like an agent's travel suggestions being too generic. An LLM can then synthesize these complaints into a single, concise rule. This new rule can either be added to the system prompt to fix the behavior for a user group, or it can be inserted into a knowledge base. For example, a synthesized rule like, "When planning itineraries, always include a mix of popular sites and unique local experiences," can be stored and retrieved for any future travel-related queries, ensuring more personalized and higher-quality suggestions. Case 3: Users give you edits Here, the user doesn't just explain what's wrong; they provide the correct answer by directly editing the agent's output. The "diff" between the agent's suggestion and the user's final version creates a high-quality training example. Depending on the product's design, this can often be a low-friction way to gather feedback, as the user was going to make the correction anyway as part of their natural workflow, whether they're fixing a block of generated code or rewriting a paragraph in a document. What you could do: Fine-tune with edit pairs — Use the pair for Supervised Fine-Tuning (SFT) to teach the model the correct behavior. Alternatively, you can use the pair for preference tuning methods like DPO, treating the user's edit as the "chosen" response and the agent's initial attempt as the "rejected" one. Use in-context learning with corrections — Store the pairs. When a similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt as a concrete example of what to do and what to avoid, steering the agent toward the correct format or content at inference time. Derive explanations — You can also work backward from the edit to enrich your prompts and/or knowledge bases. Use an LLM to analyze the "diff" between the original and edited text to generate a natural language explanation for the change, in some sense capturing the user's intent. This synthesized explanation can then be used in all the ways described in Case 2. Other considerations How do you handle observability and debuggability? — When an agent's "memory" causes unexpected behavior, debugging becomes a challenge. A key design choice is whether to provide users with an observable "memory" panel to view, edit, or reset learned information. This creates a trade-off between debuggability and the risk of overwhelming or confusing users with their own data profile. How do you pick the "preference group"? — Choosing the scope for feedback involves a trade-off between cold-starts and risk. User-level learning is slow to scale, while global learning can be degraded by outlier feedback. A common solution is grouping users by explicit boundaries (like a company) or implicit ones (like a usage persona). The design of these groups also has business implications; a group could be defined to span across both free and paid tiers, allowing feedback from a large base of unpaid users to directly improve the product for paying customers. How do you decide which feedback case to use? — The progression from simple preferences (Case 1) to detailed explanations or edits (Cases 2 & 3) depends heavily on user trust. Users will only provide richer feedback when they believe the system is actually listening. This trust can be accelerated by making the agent's reasoning process transparent, which empowers users to self-debug and provide more targeted suggestions. How much should be learned via fine-tuning vs. in-context learning? — A core architectural choice is whether to learn via parameter changes (fine-tuning) or prompt changes (in-context learning/RAG). ICL is often faster and cheaper, especially as foundational models improve rapidly, making fine-tuned models quickly obsolete. While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making prompt-based learning the more practical path.

0 views
fnands 5 months ago

Can multi-sensor foundation models be more than the sum of their parts?

Geospatial foundation models (GFMs) have been on my mind recently, partially because I attended the ESA-NASA International Workshop on AI Foundation Model for EO and partially because I’ve been working on fine tuning some GFMs for downstream use at work for the last while. This post was in part prompted by two recent LinkedIn posts, one by Christopher Ren and the other by Madeline Lisaius , both of which express some amount of skepticism about the way in which current GFMs are trained, although from somewhat different angles. Christopher Ren also wrote an expanded blog post on the subject, which takes aim mostly at IBMs new TerraMind GFM, but it is worth reading the responses from one of the TerraMind authors at the bottom of the post as that adds some nuance to the arguments. It’s somewhat clear that GFMs are a hot topic in the Earth Observation (EO) space at the moment, and it is fair to question whether the hype is warranted. At the ESA-NASA workshop one of the points made was that there seems to be much more activity in the creation of GFMs than actual downstream use of them so far, and there were some interesting discussions as to why this might be the case at the moment. A recent post from Bruno Sanchez-Andrade Nuño (director of the Clay project) also made me think that there is a rough bifurcation in the GFM space appearing: one branch goes deep and the other goes wide. I think it is best if we understand which branch a model fits into and not judge one by the standards of the others. I’m not going to directly respond to the other conversations going on: I’m just adding my two cents to the mix, and I want to be clear that the work I am doing definitely falls into the “go deep” branch, and my opinions are very much coloured by that fact. On the surface this might seem like a slightly odd question seeing as one of the principle reasons people are interested in GFMs (and FMs in general) is better generalization: EO is after all often a global endeavour, and it is desirable to have a foundation model that will help your downstream tasks generalize across geographies, illumination conditions, imaging angles etc. But there are many aspects to generalization, some of which don’t apply to all sensors. An example is the time of day an image was taken at. This can strongly affect what your image looks like, as shadows and illumination levels can vary greatly by time of day. This however does not really affect missions like Sentinel-2, where the orbit has been selected in such a way that the mean local solar time when the image is taken is always approximately 10:30 am, leading (by design) to very consistent illumination levels. Similar arguments go for viewing angles. One of the ways that people have been trying to get more general is to train foundation models on multiple sources. Examples of this are the Clay foundation model (at least V1) which was trained on a wide range of sensors from MODIS, with a 500 m GSD, to aerial imagery of New Zealand at under 50 cm GSD: Another example of this is DOFA, which takes a similar approach to variety in input sensors, this time including hyperspectral data at 224 spectral bands: The DOFA paper is worth a read, and putting on my scientist hat: this is really interesting work, and it’s really interesting to see the different solutions that these authors have come up with to make a single model deal with such varied inputs. But putting back on my engineer hat I have to ask: what do we gain from this? One of the points made in the DOFA paper: The increasing number of specialized foundation models makes it difficult to select the most appropriate one for a specific downstream task. On the surface this sounds fair, but is it really that hard to go to PapersWithCode , find the most similar dataset to your downstream task and select a model based on that? I can’t really think of a scenario where you would not just spend a day or two searching through the literature for the most fitting model for your particular use case. The one case I can maybe think this might be the case is if you are a geospatial person with no ML skills and have a model that was set up for you as a black box behind some interface and squeezing every last bit of performance out is not critical to you. When implementing a model for a specific product need, one often focuses on the specific sensor, or at least a set of very similar sensors, e.g. sub-meter GSD sensors with at least the four usual spectral bands. When building a product that will utilize exclusively Sentinel-1 data, does the model really gain anything from being trained on Sentinel-2 and aerial imagery as well? With all that being said, if you do have multiple sensors available at inference time (e.g. Sentinel-2 and Sentinel-1 data), it does seem to make sense to train/infer on multiple modalities at once. See e.g.  table 2 in the TerraMind paper . A while ago we were testing out a few foundation models as backbones for a product we are developing, which boils down to bi-temporal change detection using Planet’s SkySat constellation. We chose the main backbone we are using based on benchmarks, but I did have the nagging question of how much do we really gain from this, and if other backbones might offer better performance? This was basically the theme of my talk at the aforementioned ESA-NASA workshop. I ran a few test using a variety of FM backbones, some trained on remote sensing data, and some just on natural images, just to see how much the pre-training dataset really matters. To make the comparison fair, the backbones used all had around 100 M parameters, but I did throw in a smaller baseline ( ChangeFormer ), as well as a 300 M version of the best performing network just to see if size matters (spoiler: it does). One of the most interesting comparisons here is DINOv2: I used two variations, one using the original weights trained on natural images from Meta, and another with weights from Keumgang Cha , which were trained on the MillionAID and SkyScript datasets. MillionAID is exclusively aerial imagery, while SkyScript contains mostly aerial imagery, plus some SkySat, Sentinel-2 and Landsat images. It’s abundantly clear that the same architecture trained on remote sensing images greatly improve downstream performance compared to a the variant that was trained on natural images. This is expected, but it’s impressive to see how large this gap is. The best model we tested was trained mostly on aerial imagery, showing the domain gap isn’t so much whether or not your sensor is in space or on a plane, but has more to do with similar resolutions. The models were all trained for the same number of epochs, on the same relatively small dataset (around 1600 patches of 512 x 512 pixels) with the same optimizer etc. The encoders were not frozen, but trained with a lower learning rate than the decoders, as is common practice in most transfer learning scenarios. I will caveat this all with saying that I didn’t do a massive amount of hyperparameter tuning for this test, but I think the differences are significant enough that it probably wouldn’t make too much of a difference. What I would need to see to be convinced is that when training a foundation model on multiple sensors that it would perform better on downstream tasks on each of the sensors than it would be if it was trained exclusively on the specific sensor to be used. I.e. one would need to show that the model would be more than the sum of it’s parts. The question is pretty much, given the same architecture, compute budget and dataset size, can a model learn something from one sensor that improves its performance on another? Or could it be that we need to throw everything into a big bucket and burn a lot of compute in the fashion of the current big LLMs that are so popular right now in order to really see generalization? I’m definitely not ruling out the possibility that there might be some case (e.g. the sensor you are targeting doesn’t have a lot of data available), but I have the feeling that the further away in GSD and spectral characteristics you go the less helpful pre-training becomes. It’s fairly obvious that the best GFM you can choose will likely be the one trained on the exact sensor you are targeting for your downstream task. This is fairly easy for sensors like the Sentinel missions or the Landsat missions, where anyone with a bit of compute to burn can easily download tons of data from those missions and train a model. Even for aerial imagery there is a lot of open data available, with the caveat that the data is not as global, and aerial sensors do have some sensor to sensor variability. Where this gets tricky is in the commercial domain, where data isn’t freely available and providers put strict licenses on their data 1 . To train a foundation model on commercial data requires you to dump somewhere between hundreds of thousands up to millions of Euros on data alone, which is infeasible for most researchers, and a significant investment for most companies. The only case that I know of so far of someone creating a sensor specific foundation model is a Pleiades Neo foundation model created by Disaitek , which was made possible by being granted access to Pleiades Neo imagery through a “Call for Innovation” from Aribus and CNES. Disaitek of course does not make this model public, as this presumably gives them a bit of an edge over their competitors, and as the model was trained on data covering France only, it is questionable of how much use it would be in other parts of the world. So what can be done in the commercial space? Most companies don’t have access to enough data to easily train a foundation model, and those who do are unlikely to share it as it gives them an edge over their competition. The only players with both the access to the data and the incentive to make these models available to others are the imagery providers themselves, i.e. Planet, Airbus, Maxar, Blacksky, Capella etc. Do I think these providers will just open these models for all to use? I doubt it, but they might offer it as a perk to their customers. I.e. something along the lines of “buy at least X Euro worth of imagery per year and get access to our FM”. The competition in the 30 cm class imagery space seems to be heating up, with several players building up large constellations of satellites in this resolution range, like Maxar’s Legion, Planet’s Pelican and BlackSky’s Gen-3. One way these providers can differentiate their offerings would be by offering a foundation model trained on their specific sensor. Whether I think it’s likely that they do this is another question. Please take this post for what it is: the opinionated rant of someone who works in a somewhat privileged niche of the EO domain where I have a lot of expensive VHR data to play with. The problems I am trying to solve and the constraints I have are likely quite different from those that others might encounter. With that being said, if you find yourself if a similar boat to me and are wondering which foundation model to pick for your particular task: pick the one trained on the closest thing you can find to the sensor you are targeting. I am kind of hoping that someone does prove me wrong, and I will happily write an apology post if some does so. The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎ The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎

0 views