GreatReads - Blog Aggregator · Phoenix Framework

Beyond Standard LLMs

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism. However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance. After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 and many more. (The list above focuses on the open-weight models; there are proprietary models like GPT-5, Grok 4, Gemini 2.5, etc. that also fall into this category.) Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. In code, a simplified version of the Gated DeltaNet depicted above (without the convolutional mixing) can be implemented as follows (the code is inspired by the official implementation by the Qwen3 team): (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. (And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.) So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length. The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention. Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory ( S ) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs. That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier. In the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length. Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Understanding and Coding the KV Cache in LLMs from Scratch article). Instead, as mentioned earlier, they keep a fixed-size recurrent state, so memory stays constant with context length. For a regular multi-head attention (MHA) layer, we can compute the KV cache size as follows: (The 2 multiplier is there because we have both keys and values that we store in the cache.) For the simplified DeltaNet version implemented above, we have: Note that the memory size doesn’t have a context length ( ) dependency. Also, we have only the memory state S that we store instead of separate keys and values, hence becomes just bytes. However, note that we now have a quadratic in here. This comes from the state: But that’s usually nothing to worry about, as the head dimension is usually relatively small. For instance, it’s 128 in Qwen3-Next. The full version with the convolutional mixing is a bit more complex, including the kernel size and so on, but the formulas above should illustrate the main trend and motivation behind the Gated DeltaNet. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights. It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here. While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences. HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.) TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few. HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating. Performance-wise, TRM performs really well compared to HRM, as shown in the figure below. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length. While HRM and TRM achieve really good reasoning performance on these benchmarks, comparing them to large LLMs is not quite fair. HRM and TRM are specialized models for tasks like ARC, Sudoku, and Maze pathfinding, whereas LLMs are generalists. Sure, HRM and TRM can be adopted for other tasks as well, but they have to be specially trained on each task. So, in that sense, we can perhaps think of HRM and TRM as efficient pocket calculators, whereas LLM are more like computers, which can do a lot of other things as well. Still, these recursive architectures are exciting proof-of-concepts that highlight how small, efficient models can “reason” through iterative self-refinement. Perhaps, in the future, such models could act as reasoning or planning modules embedded within larger tool-using LLM systems. For now, LLMs remain ideal for broad tasks, but domain-specific recursive models like TRM can be developed to solve certain problems more efficiently once the target domain is well understood. Beyond the Sudoku, Maze finding, and ARC proof-of-concept benchmarks, there are possibly lots of use cases in the physics and biology domain where such models could find use. As an interesting tidbit, the author shared that it took less than $500 to train this model, with 4 H100s for around 2 days. I am delighted to see that it’s still possible to do interesting work without a data center. I originally planned to cover all models categories in the overview figure, but since the article ended up longer than I expected, I will have to save xLSTMs, Liquid Foundation Models, Transformer-RNN hybrids, and State Space Models for another time (although, Gated DeltaNet already gave a taste of State Space Models and recurrent designs.) As a conclusion to this article, I want to repeat the earlier words, i.e., that standard autoregressive transformer LLMs are proven and have stood the test of time so far. They are also, if efficiency is not the main factor, the best we have for now. Traditional Decoder-Style, Autoregressive Transformers + Proven & mature tooling + “well-understood” + Scaling laws + SOTA - Expensive training - Expensive inference (except for aforementioned tricks) If I were to start a new LLM-based project today, autoregressive transformer-based LLMs would be my first choice. I definitely find the upcoming attention hybrids very promising, which are especially interesting when working with longer contexts where efficiency is a main concern. Linear Attention Hybrids + Same as decoder-style transformers + Cuts FLOPs/KV memory at long-context tasks - Added complexity - Trades a bit of accuracy for efficiency On the more extreme end, text diffusion models are an interesting development. I’m still somewhat skeptical about how well they perform in everyday use, as I’ve only tried a few quick demos. Hopefully, we’ll soon see a large-scale production deployment with Google’s Gemini Diffusion that we can test on daily and coding tasks, and then find out how people actually feel about them. Text Diffusion Models + Iterative denoising is a fresh idea for text + Better parallelism (no next-token dependence) - Can’t stream answers - Doesn’t benefit from CoT? - Tricky tool-calling? - Solid models but not SOTA While the main selling point of text diffusion models is improved efficiency, code world models sit on the other end of the spectrum, where they aim to improve modeling performance. As of this writing, coding models, based on standard LLMs, are mostly improved through reasoning techniques, yet if you have tried them on trickier challenges, you have probably noticed that they (more or less) still fall short and can’t solve many of the trickier coding problems well. I find code world models particularly interesting and believe they could be an important next step toward developing more capable coding systems. Code World Model + Promising approach to improve code understanding + Verifiable intermediate states - Inclusion of executable code traces complicates training - Code running adds latency Lastly, we covered small recursive transformers such as hierarchical and tiny reasoning models. These are super interesting proof-of-concept models. However, as of today, they are primarily puzzle solvers, not general text or coding models. So, they are not in the same category as the other non-standard LLM alternatives covered in this article. Nonetheless, they are very interesting proofs-of-concept, and I am glad researchers are working on them. Right now, LLMs like GPT-5, DeepSeek R1, Kimi K2, and so forth are developed as special purpose models for free-form text, code, math problems and much more. They feel like brute-force and jack-of-all-trades approach that we use on a variety of tasks, from general knowledge questions to math and code. However, when we perform the same task repeatedly, such brute-force approaches become inefficient and may not even be ideal in terms of specialization. This is where tiny recursive transformers become interesting: they could serve as lightweight, task-specific models that are both efficient and purpose-built for repeated or structured reasoning tasks. Also, I can see them as potential “tools” for other tool-calling LLMs; for instance, when LLMs use Python or calculator APIs to solve math problems, special tiny reasoning models could fill this niche for other types of puzzle- or reasoning-like problems. Small Recursive Transformers + Very small architecture + Good generalization on puzzles - Special purpose models - Limited to puzzles (so far) This has been a long article, but I hope you discovered some of the fascinating approaches that often stay outside the spotlight of mainstream LLMs. And if you’ve been feeling a bit bored by the more or less conventional LLM releases, I hope this helped rekindle your excitement about AI again because there’s a lot of interesting work happening right now! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: Overview of the LLM landscape. This article covers those architectures surrounded by the black frames. The decoder-style transformers are covered in my “The Big Architecture Comparison” article. Other non-framed architectures may be covered in future articles. Note that ideally each of these topics shown in the figure above would deserve at least a whole article itself (and hopefully get it in the future). So, to keep this article at a reasonable length, many sections are reasonably short. However, I hope this article is still useful as an introduction to all the interesting LLM alternatives that emerged in recent years. PS: The aforementioned PyTorch conference talk will be uploaded to the official PyTorch YouTube channel. In the meantime, if you are curious, you can find a practice recording version below. (There is also a YouTube version here .) 1. Transformer-Based LLMs Transformer-based LLMs based on the classic Attention Is All You Need architecture are still state-of-the-art across text and code. If we just consider some of the highlights from late 2024 to today, notable models include DeepSeek V3/R1 Mistral Small 3.1 Figure 2: An overview of the most notable decoder-style transformers released in the past year. Since I talked and wrote about transformer-based LLMs so many times, I assume you are familiar with the broad idea and architecture. If you’d like a deeper coverage, I compared the architectures listed above (and shown in the figure below) in my The Big LLM Architecture Comparison article. (Side note: I could have grouped Qwen3-Next and Kimi Linear with the other transformer-state space model (SSM) hybrids in the overview figure. Personally, I see these other transformer-SSM hybrids as SSMs with transformer components, whereas I see the models discussed here (Qwen3-Next and Kimi Linear) as transformers with SSM components. However, since I have listed IBM Granite 4.0 and NVIDIA Nemotron Nano 2 in the transformer-SSM box, an argument could be made for putting them into a single category.) Figure 3. A subset of the architectures discussed in my The Big Architecture Comparison (https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison) article. If you are working with or on LLMs, for example, building applications, fine-tuning models, or trying new algorithms, I would make these models my go-to. They are tested, proven, and perform well. Moreover, as discussed in the The Big Architecture Comparison article, there are many efficiency improvements, including grouped-query attention, sliding-window attention, multi-head latent attention, and others. However, it would be boring (and shortsighted) if researchers and engineers didn’t work on trying alternatives. So, the remaining sections will cover some of the interesting alternatives that emerged in recent years. 2. (Linear) Attention Hybrids Before we discuss the “more different” approaches, let’s first look at transformer-based LLMs that have adopted more efficient attention mechanisms. In particular, the focus is on those that scale linearly rather than quadratically with the number of input tokens. There’s recently been a revival in linear attention mechanisms to improve the efficiency of LLMs. The attention mechanism introduced in the Attention Is All You Need paper (2017), aka scaled-dot-product attention, remains the most popular attention variant in today’s LLMs. Besides traditional multi-head attention, it’s also used in the more efficient flavors like grouped-query attention, sliding window attention, and multi-head latent attention as discussed in my talk . 2.1 Traditional Attention and Quadratic Costs The original attention mechanism scales quadratically with the sequence length: This is because the query (Q), key (K), and value (V) are n -by- d matrices, where d is the embedding dimension (a hyperparameter) and n is the sequence length (i.e., the number of tokens). (You can find more details in my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article ) Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n. 2.2 Linear attention Linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism: Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1. This approximation is efficient because it avoids explicitly computing the n×n attention matrix QK T . I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n 2 ) to O(n) to make attention much more efficient for long sequences. However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. 2.3 Linear Attention Revival In the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below. Figure 5: An overview of the linear attention hybrid architectures. The first notable model was MiniMax-M1 with lightning attention. MiniMax-M1 is a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June. Then, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 . (DeepSeek V3.2 sparse attention mechanism is not strictly linear but at least subquadratic in terms of computational costs, so I think it’s fair to put it into the same category as MiniMax-M1, Qwen3-Next, and Kimi Linear.) All three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. Interestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications. This could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. In October, the Kimi team released their new Kimi Linear model with linear attention. For this linear attention aspect, both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which I wanted to discuss in the next few sections as one example of a hybrid attention architecture. 2.4 Qwen3-Next Let’s start with Qwen3-Next, which replaced the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) Their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks within a 3:1 ratio as shown in the figure below. Figure 6: Qwen3-Next with gated attention and Gated DeltaNet. As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows: Otherwise, the architecture is pretty standard and similar to Qwen3: Figure 7: A previous “regular” Qwen3 model (left) next to Qwen3-Next (right). So, what are gated attention and Gated DeltaNet? 2.5 Gated Attention Before we get to the Gated DeltaNet itself, let’s briefly talk about the gate. As you can see in the upper part of the Qwen3-Next architecture in the previous figure, Qwen3-Next uses “gated attention”. This is essentially regular full attention with an additional sigmoid gate. This gating is a simple modification that I added to an implementation (based on code from chapter 3 of my LLMs from Scratch book ) below for illustration purposes: As we can see, after computing attention as usual, the model uses a separate gating signal from the same input, applies a sigmoid to keep it between 0 and 1, and multiplies it with the attention output. This allows the model to scale up or down certain features dynamically. The Qwen3-Next developers state that this helps with training stability: [...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model. In short, gated attention modulates the output of standard attention. In the next section, we discuss Gated DeltaNet, which replaces the attention mechanism itself with a recurrent delta-rule memory update. 2.6 Gated DeltaNet Now, what is Gated DeltaNet? Gated DeltaNet (short for Gated Delta Network ) is Qwen3-Next’s linear-attention layer, which is intended as an alternative to standard softmax attention. It was adopted from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper as mentioned earlier. Gated DeltaNet was originally proposed as an improved version of Mamba2, where it combines the gated decay mechanism of Mamba2 with a delta rule. Mamba is a state-space model (an alternative to transformers), a big topic that deserves separate coverage in the future. The delta rule part refers to computing the difference (delta, Δ) between new and predicted values to update a hidden state that is used as a memory state (more on that later). (Side note: Readers with classic machine learning literature can think of this as similar to Hebbian learning inspired by biology: “Cells that fire together wire together.” It’s basically a precursor of the perceptron update rule and gradient descent-based learning, but without supervision.) Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.) Figure 8: Gated attention compared to Gated DeltaNet. However, as shown in the figure above, next to the output gate, the “gated” in the Gated DeltaNet also refers to several additional gates: α (decay gate) controls how fast the memory decays or resets over time, β (update gate) controls how strongly new inputs modify the state. (Note that for simplicity, I omitted the convolutional mixing that Qwen3-Next and Kimi Linear use to keep the code more readable and focus on the recurrent aspects.) So, as we can see above, there are lots of differences to standard (or gated) attention. In gated attention, the model computes normal attention between all tokens (every token attends or looks at every other token). Then, after getting the attention output, a gate (a sigmoid) decides how much of that output to keep. The takeaway is that it’s still the regular scaled-dot product attention that scales quadratically with the context length. As a refresher, scaled-dot product attention is computed as softmax(QKᵀ)V, where Q and K are n -by- d matrices, where n is the number of input tokens, and d is the embedding dimension. So QKᵀ results in an attention n -by- n matrix, that is multiplied by an n -by- d dimensional value matrix V . Figure 9: The traditional attention mechanism (again), which scales with the number of tokens n . In Gated DeltaNet, there’s no n -by- n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what’s implemented as, where S is the state that gets updated recurrently for each time step t . And the gates control how that memory changes: α (alpha) regulates how much of the old memory to forget (decay). β (beta) regulates how much the current token at time step t updates the memory. Figure 10: A comparison of the growing KV cache size. The 3:1 ratio refers to the ratio of Gated DeltaNet to full attention layers. The calculation assumes emb_dim=2048, n_heads=16, n_layers=48, bf16. You can find the code to reproduce this here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/08_deltanet. 2.8 Kimi Linear vs. Qwen3-Next Kimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention as shown in the figure below. Figure 11: Qwen3-Next and Kimi Linear side by side. Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Gated Delta Networks: Improving Mamba2 with Delta Rule paper. In a sense, Gated DeltaNet is a DeltaNet with Mamba-style gating, and DeltaNet is a linear attention mechanism (more on that in the next section) The MLA in Kimi Linear, depicted in the upper right box in the Figure 11 above, does not use the sigmoid gate.This omission was intentional so that the authors could compare the architecture more directly to standard MLA, however, they stated that they plan to add it in the future. Also note that the omission of the RoPE box in the Kimi Linear part of the figure above is intentional as well. Kimi applies NoPE (No Positional Embedding) in multi-head latent attention MLA) layers (global attention). As the authors state, this lets MLA run as pure multi-query attention at inference and avoids RoPE retuning for long‑context scaling (the positional bias is supposedly handled by the Kimi Delta Attention blocks). For more information on MLA, and multi-query attention, which is a special case of grouped-query attention, please see my The Big LLM Architecture Comparison article. 2.9 Kimi Delta Attention Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Whereas Qwen3-Next applies a scalar gate (one value per attention head) to control the memory decay rate, Kimi Linear replaces it with a channel-wise gating for each feature dimension. According to the authors, this gives more control over the memory, and this, in turn, improves long-context reasoning. In addition, for the full attention layers, Kimi Linear replaces Qwen3-Next’s gated attention layers (which are essentially standard multi-head attention layers with output gating) with multi-head latent attention (MLA). This is the same MLA mechanism used by DeepSeek V3/R1 (as discussed in my The Big LLM Architecture Comparison article) but with an additional gate. (To recap, MLA compresses the key/value space to reduce the KV cache size.) There’s no direct comparison to Qwen3-Next, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed. Figure 12: Annotated figure from the Kimi Linear paper (https://arxiv.org/abs/2510.26692) showing that Kimi Linear is as fast as GatedDeltaNet, and much faster than an architecture with multi-head latent attention (like DeepSeek V3/R1), while having a higher benchmark performance. Furthermore, according to the ablation studies in the DeepSeek-V2 paper , MLA is on par with regular full attention when the hyperparameters are carefully chosen. And the fact that Kimi Linear compares favorably to MLA on long-context and reasoning benchmarks makes linear attention variant once again promising for larger state-of-the-art models. That being said, Kimi Linear is 48B-parameter large, but it’s 20x smaller than Kimi K2. It will be interesting to see if the Kimi team adopts this approach for their upcoming K3 model. 2.10 The Future of Attention Hybrids Linear attention is not a new concept, but the recent revival of hybrid approaches shows that researchers are again seriously looking for practical ways to make transformers more efficient. For example Kimi Linear, compared to regular full attention, has a 75% KV cache reduction and up to 6x decoding throughput. What makes this new generation of linear attention variants different from earlier attempts is that they are now used together with standard attention rather than replacing it completely. Looking ahead, I expect that the next wave of attention hybrids will focus on further improving long-context stability and reasoning accuracy so that they get closer to the full-attention state-of-the-art. 3. Text Diffusion Models A more radical departure from the standard autoregressive LLM architecture is the family of text diffusion models. You are probably familiar with diffusion models, which are based on the Denoising Diffusion Probabilistic Models paper from 2020 for generating images (as a successor to generative adversarial networks) that was later implemented, scaled, and popularized by Stable Diffusion and others. Figure 13: Illustration of an image diffusion process from my very first Substack article in 2022. Here, Gaussian noise is added from left to right, and the model’s task is to learn how to remove the noise (from right to left). 3.1 Why Work on Text Diffusion? With the Diffusion‑LM Improves Controllable Text Generation paper in 2022, we also started to see the beginning of a trend where researchers started to adopt diffusion models for generating text. And I’ve seen a whole bunch of text diffusion papers in 2025. When I just checked my paper bookmark list, there are 39 text diffusion models on there! Given the rising popularity of these models, I thought it was finally time to talk about them. Figure 14: This section covers text diffusion models. So, what’s the advantage of diffusion models, and why are researchers looking into this as an alternative to traditional, autoregressive LLMs? Traditional transformer-based (autoregressive) LLMs generate one token at a time. For brevity, let’s refer to them simply as autoregressive LLMs . Now, the main selling point of text diffusion-based LLMs (let’s call them “diffusion LLMs”) is that they can generate multiple tokens in parallel rather than sequentially. Note that diffusion LLMs still require multiple denoising steps. However, even if a diffusion model needs, say, 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2,000 sequential generation steps to produce a 2,000-token response. 3.2 The Denoising Process The denoising process in a diffusion LLM, analogous to the denoising process in regular image diffusion models, is shown in the GIF below. (The key difference is that, instead of adding Gaussian noise to pixels, text diffusion corrupts sequences by masking tokens probabilistically.) For this experiment, I ran the 8B instruct model from the Large Language Diffusion Models (LLaDA) paper that came out earlier this year. Figure 15: Illustration of the denoising process using the 8B LLaDA model. As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with BERT and masked language modeling, you can think of this diffusion process as an iterative application of the BERT forward pass (where BERT is used with different masking rates). Architecture-wise, diffusion LLMs are usually decoder-style transformers but without the causal attention mask. For instance, the aforementioned LLaDA model uses the Llama 3 architecture. We call those architectures without a causal mask “bidirectional” as they have access to all sequence elements all at once. (Note that this is similar to the BERT architecture, which is called “encoder-style” for historical reasons.) So, the main difference between autoregressive LLMs and diffusion LLMs (besides removing the causal mask) is the training objective. Diffusion LLMs like LLaDA use a generative diffusion objective instead of a next-token prediction objective. In image models, the generative diffusion objective is intuitive because we have a continuous pixel space. For instance, adding Gaussian noise and learning to denoise are mathematically natural operations. Text, however, consists of discrete tokens, so we can’t directly add or remove “noise” in the same continuous sense. So, instead of perturbing pixel intensities, these diffusion LLMs corrupt text by progressively masking tokens at random, where each token is replaced by a special mask token with a specified probability. The model then learns a reverse process that predicts the missing tokens at each step, which effectively “denoises” (or unmasks) the sequence back to the original text, as shown in the animation in Figure 15 earlier. Explaining the math behind it would be better suited for a separate tutorial, but roughly, we can think about it as BERT extended into a probabilistic maximum-likelihood framework. 3.3 Autoregressive vs Diffusion LLMs Earlier, I said that what makes diffusion LLMs appealing is that they generate (or denoise) tokens in parallel instead of generating them sequentially as in a regular autoregressive LLM. This has the potential for making diffusion models more efficient than autoregressive LLMs. That said, the autoregressive nature of traditional LLMs is one of their key strengths, though. And the problem with pure parallel decoding can be illustrated with an excellent example from the recent ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper. Figure 16: Annotated figure from ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper (https://arxiv.org/abs/2510.04767) showing the issue with parallel decoding. For example, consider the following prompt: > “Pick a random city for travel: New York, New Orleans, Mexico City, or Panama > City?” Suppose we ask the LLM to generate a two-token answer. It might first sample the token “New” according to the conditional probability p(y t = ”New” | X). In the next iteration, it would then condition on the previously-generated token and likely choose “York” or “Orleans,” since both conditional probabilities p(y t+1 = ”York” | X, y t = ”New”) and p(y t+1 = ”Orleans” | X, y t = ”New”) are relatively high (because “New” frequently co-occurs with these continuations in the training set). But if instead both tokens were sampled in parallel, the model might independently select the two highest-probability tokens p(y t = “New” | X) and p(y {t+1} = “City” | X) leading to awkward outputs like “New City.” (This is because the model lacks autoregressive conditioning and fails to capture token dependencies.) In any case, the above is a simplification that makes it sound as if there is no conditional dependency in diffusion LLMs at all. This is not true. A diffusion LLM predicts all tokens in parallel, as said earlier, but the predictions are jointly dependent through the iterative refinement (denoising) steps. Here, each diffusion step conditions on the entire current noisy text. And tokens influence each other through cross-attention and self-attention in every step. So, even though all positions are updated simultaneously, the updates are conditioned on each other through shared attention layers. However, as mentioned earlier, in theory, 20-60 diffusion steps may be cheaper than the 2000 inference steps in an autoregressive LLM when generating a 2000-token answer. 3.4 Text Diffusion Today It’s an interesting trend that vision models adopt components from LLMs like attention and the transformer architecture itself, whereas text-based LLMs are getting inspired by pure vision models, implementing diffusion for text. Personally, besides trying a few demos, I haven’t used many diffusion models yet, but I consider it a trade-off. If we use a low number of diffusion steps, we generate the answer faster but may produce an answer with degraded quality. If we increase the diffusion steps to generate better answers, we may end up with a model that has similar costs to an autoregressive one. To quote the authors of the ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs paper: [...] we systematically analyse both [diffusion LLMs] and autoregressive LLMs, revealing that: (i) [diffusion LLMs] under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speed-up without compromising quality. Additionally, another particular downside I see is that diffusion LLMs cannot use tools as part of their chain because there is no chain. Maybe it’s possible to interleave them between diffusion steps, but I assume this is not trivial. (Please correct me if I am wrong.) In short, it appears that diffusion LLMs are an interesting direction to explore, but for now, they may not replace autoregressive LLMs. However, I can see them as interesting alternatives to smaller, on-device LLMs, or perhaps replacing smaller, distilled autoregressive LLMs. For instance, Google announced that it is working on a Gemini Diffusion model for text, where they state Rapid response: Generates content significantly faster than even our fastest model so far. And while being faster, it appears that the benchmark performance remains on par with their fast Gemini 2.0 Flash-Lite model. It will be interesting to see what the adoption and feedback will be like once the model is released and users try it on different tasks and domains. Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities. 4. World Models So far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance. Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.” World models have traditionally been developed independently of language modeling, but the recent Code World Models paper in September 2025 has made them directly relevant in this context for the first time. Ideally, similar to the other topics of this article, world models are a whole dedicated article (or book) by themselves. However, before we get to the Code World Models (CWM) paper, let me provide at least a short introduction to world models. 4.1 The Main Idea Behind World Models Originally, the idea behind world models is to model outcomes implicitly, i.e., to anticipate what might happen next without those outcomes actually occurring (as illustrated in the figure below). It is similar to how the human brain continuously predicts upcoming events based on prior experience. For example, when we reach for a cup of coffee or tea, our brain already predicts how heavy it will feel, and we adjust our grip before we even touch or lift the cup. Figure 18: Conceptual overview of a world model system. The agent interacts with the environment by observing its current state(t) and taking action(t) to achieve a given objective. In parallel, the agent learns an internal world mode l , which serves as a mental simulation of the environment, which allows it to predict outcomes and plan actions before executing them in the real world. The term “world model”, as far as I know, was popularized by Ha and Schmidhuber’s 2018 paper of the same name: World Models , which used a VAE plus RNN architecture to learn an internal environment simulator for reinforcement learning agents. (But the term or concept itself essentially just refers to modeling a concept of a world or environment, so it goes back to reinforcement learning and robotics research in the 1980s.) To be honest, I didn’t have the new interpretation of world models on my radar until Yann LeCun’s 2022 article A Path Towards Autonomous Machine Intelligence . It was essentially about mapping an alternative path to AI instead of LLMs. 4.2 From Vision to Code That being said, world model papers were all focused on vision domains and spanned a wide range of architectures: from early VAE- and RNN-based models to transformers, diffusion models, and even Mamba-layer hybrids. Now, as someone currently more focused on LLMs, the Code World Model paper (Sep 30, 2025) is the first paper to capture my full attention (no pun intended). This is the first world model (to my knowledge) that maps from text to text (or, more precisely, from code to code). CWM is a 32-billion-parameter open-weight model with a 131k-token context window. Architecturally, it is still a dense decoder-only Transformer with sliding-window attention. Also, like other LLMs, it goes through pre-training, mid-training, supervised fine-tuning (SFT), and reinforcement learning stages, but the mid-training data introduces the world-modeling component. 4.3 Code World Models Vs Regular LLMs for Code So, how does this differ from a regular code LLM such as Qwen3-Coder ? Regular models like Qwen3-Coder are trained purely with next-token prediction. They learn patterns of syntax and logic to produce plausible code completions, which gives them a static text-level understanding of programming. CWM, in contrast, learns to simulate what happens when the code runs. It is trained to predict the resulting program state, such as the value of a variable, after performing an action like modifying a line of code, as shown in the figure below. Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior . Annotated figure from https://www.arxiv.org/abs/2510.02387. At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text. So, I would maybe not call it a world model, but a world model-augmented LLM. For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size. If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller. Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort). Figure 20: Performance of the code world model (CWM) compared to other popular LLMs on a coding benchmark (SWE-bench). Annotated figure from https://www.arxiv.org/abs/2510.02387. 5. Small Recursive Transformers You may have noticed that all previous approaches still build on the transformer architecture. The topic of this last section does too, but in contrast to the models we discussed earlier, these are small, specialized transformers designed for reasoning. Yes, reasoning-focused architectures don’t always have to be large. In fact, with the Hierarchical Reasoning Model (HRM) a new approach to small recursive transformers has recently gained a lot of attention in the research community. Figure 21: LLM landscape overview; this section small recursive transformers. More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge. Figure 22: Example ARC-AGI 1 task (top) from arcprize.org/arc-agi/1 and the Hierarchical Reasoning Model (HRM) ranked on the leaderboard (bottom) from arcprize.org/blog/hrm-analysis. The idea behind recursive models like HRM is that instead of producing an answer in one forward pass, the model repeatedly refines its own output in a recursive fashion. (As part of this process, each iteration refines a latent representation, which the authors see as the model’s “thought” or “reasoning” process.) The first major example was HRM earlier in the summer, followed by the Mixture-of-Recursions (MoR) paper . And most recently, Less is More: Recursive Reasoning with Tiny Networks (October 2025) proposes the Tiny Recursive Model (TRM, illustrated in the figure below), which is a simpler and even smaller model (7 million parameters, about 4× smaller than HRM) that performs even better on the ARC benchmark. Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871. In the remainder of this section, let’s take a look at TRM in a bit more detail. 5.1 What Does Recursion Mean Here? TRM refines its answer through two alternating updates: It computes a latent reasoning state from the current question and answer. It then updates the answer based on that latent state. Figure 24: Performance comparison of the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The paper included a surprising number of ablation studies, which yielded some interesting additional insights. Here are two that stood out to me: Fewer layers leads to better generalization. Reducing from 4 to 2 layers improved Sudoku accuracy from 79.5% to 87.4%. Attention is not required . Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%). But this is only feasible here because the context is small and fixed-length.

Python

Machine Learning

1 views

Ahead of AI 1 months ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.

Tutorial

Machine Learning

C++

Python

0 views

Ahead of AI 2 months ago

Understanding and Implementing Qwen3 From Scratch

Previously, I compared the most notable open-weight architectures of 2025 in The Big LLM Architecture Comparison . Then, I zoomed in and discussed the various architecture components in From GPT-2 to gpt-oss: Analyzing the Architectural Advances on a conceptual level. Since all good things come in threes, before covering some of the noteworthy research highlights of this summer, I wanted to now dive into these architectures hands-on, in code. By following along, you will understand how it actually works under the hood and gain building blocks you can adapt for your own experiments or projects. For this, I picked Qwen3 ( initially released in May and updated in July) because it is one of the most widely liked and used open-weight model families as of this writing. The reasons why Qwen3 models are so popular are, in my view, as follows: A developer- and commercially friendly open-source ( Apache License v2.0 ) without any strings attached beyond the original open-source license terms (some other open-weight LLMs impose additional usage limits) The performance is really good; for example, as of this writing, the open-weight 235B-Instruct variant is ranked 8 on the LMArena leaderboard , tied with the proprietary Claude Opus 4. The only 2 other open-weight LLMs that rank higher are DeepSeek 3.1 (3x larger) and Kimi K2 (4x larger). On September 5th, Qwen3 released a 1T parameter “max” variant on their platform that beats Kimi K2, DeepSeek 3.1, and Claude Opus 4 on all major benchmarks; however, this model is closed-source for now. There are many different model sizes available for different compute budgets and use-cases, from 0.6B dense models to 480B parameter Mixture-of-Experts models. This is going to be a long article due to the from-scratch code in pure PyTorch. While the code sections may look verbose, I hope that they help explain the building blocks better than conceptual figures alone! Tip 1: If you are reading this article in your email inbox, the narrow line width may cause code snippets to wrap awkwardly. For a better experience, I recommend opening it in your web browser . Tip 2: You can use the table of contents on the left side of the website for easier navigation between sections. Figure 1: Preview of the Qwen3 Dense and Mixture-of-Experts architectures discussed and (re)implemented in pure PyTorch in this article. A developer- and commercially friendly open-source ( Apache License v2.0 ) without any strings attached beyond the original open-source license terms (some other open-weight LLMs impose additional usage limits) The performance is really good; for example, as of this writing, the open-weight 235B-Instruct variant is ranked 8 on the LMArena leaderboard , tied with the proprietary Claude Opus 4. The only 2 other open-weight LLMs that rank higher are DeepSeek 3.1 (3x larger) and Kimi K2 (4x larger). On September 5th, Qwen3 released a 1T parameter “max” variant on their platform that beats Kimi K2, DeepSeek 3.1, and Claude Opus 4 on all major benchmarks; however, this model is closed-source for now. There are many different model sizes available for different compute budgets and use-cases, from 0.6B dense models to 480B parameter Mixture-of-Experts models.

Tutorial

0 views

Ahead of AI 3 months ago

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later). This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details. I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.) Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 I hope you find it informative! Before we discuss the architecture in more detail, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below. Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. That being said, there are still many interesting aspects of their design choices. Some are shown in the figure above (while others are not, but we will discuss them later as well). In the rest of this article, I will highlight these features and compare them to other architectures, one at a time. I should also note that I am not affiliated with OpenAI in any way. My information comes from reviewing the released model code and reading their technical reports. If you want to learn how to use these models locally, the best place to start is OpenAI's official model hub pages: https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b The 20B model can run on a consumer GPU with up to 16 GB of RAM. The 120B model can run on a single H100 with 80 GB of RAM or newer hardware. I will return to this later, as there are some important caveats. Before we jump into comparisons between gpt-oss and a more recent architecture, let's hop into the time machine and take a side-by-side look at GPT-2 (Figure 2) to see just how far things have come. Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 That is fc1 + fc2 = 8,388,608 parameters. For the GLU variant, we have fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 I.e., 3 × 1,048,576 = 3,145,728 weight parameters. So, overall, using the GLU variants results in fewer parameters, and they perform better as well. The reason for this better performance is that these GLU variants provide an additional multiplicative interaction, which improves expressivity (the same reason deep & slim neural nets perform better than shallow & wide neural nets, provided they are trained well). In addition to upgrading the feed forward module to a SwiGLU, as discussed in the previous section, gpt-oss replaces the single feed forward module with multiple feed forward modules, using only a subset for each token generation step. This approach is known as a Mixture-of-Experts (MoE) and illustrated in Figure 8 below. Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 It's also worth noting that gpt-oss uses twice as many attention heads, but this doesn't directly increase the model's width. The width is determined by the embedding dimension. Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate). Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost. When it comes to modeling performance, there's unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Gemma 2 paper (Table 9) , which found that for a 9B parameter architecture, a wider setup is slightly better than a deeper setup. Across 4 benchmarks, the wider model achieved a 52.0 average score, and the deeper model achieved a 50.8 average score. As shown in Figure 14 above, it's also noteworthy that gpt-oss has a surprisingly small number of experts (32 instead of 128), and only uses 4 instead of 8 active experts per token. However, each expert is much larger than the experts in Qwen3. This is interesting because the recent trends and developments point towards more, smaller models as being beneficial. This change, at a constant total parameter size, is nicely illustrated in Figure 15 below from the DeepSeekMoE paper. Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Note that the models will also run on older hardware but without MXFP4 support and will thus consume more RAM. Without MXFP4 optimization, the models in bfloat16 will consume more like 48 GB (gpt-oss-20b) and 240 GB (gpt-oss-120b). By the way, I can run the gpt-oss-20b model comfortably on my Mac Mini using ollama. It uses about 13.5 Gb or memory, which is really reasonable. The models are still a bit too new for independent benchmarks. Checking the LM Arena leaderboard , I found that gpt-oss is not listed, yet. So, Qwen3-Instruct remains the top open-weight model, according to users on the LM Arena, for now (Figure 22). Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents. Thanks for reading, and for helping support independent research! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. 2.1 Removing Dropout Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. 2.2 RoPE Replaces Absolute Positional Embeddings In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. 2.3 Swish/SwiGLU Replaces GELU Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) 2.5 Grouped Query Attention Replaces Multi-Head Attention As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. 2.6 Sliding Window Attention Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. 2.7 RMSNorm Replaces LayerNorm Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. 2.8 The GPT-2 Legacy I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3. Comparing gpt-oss To A Recent Architecture (Qwen3) Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. 3.1 Width Versus Depth If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. 3.3 Attention Bias and Attention Sinks Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . 3.4 License Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . 4 Other Interesting Tidbits While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. 4.1 Training Overview Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. 4.2 Reasoning Efforts As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source 4.3 MXFP4 Optimization: A Small But Important Detail One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. 5 gpt-oss and GPT-5 OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents.

Machine Learning

0 views

Ahead of AI 4 months ago

The Big LLM Architecture Comparison

It has been seven years since the original GPT architecture was developed. At first glance, looking back at GPT-2 (2019) and forward to DeepSeek-V3 and Llama 4 (2024-2025), one might be surprised at how structurally similar these models still are. Sure, positional embeddings have evolved from absolute to rotational (RoPE), Multi-Head Attention has largely given way to Grouped-Query Attention, and the more efficient SwiGLU has replaced activation functions like GELU. But beneath these minor refinements, have we truly seen groundbreaking changes, or are we simply polishing the same architectural foundations? Comparing LLMs to determine the key ingredients that contribute to their good (or not-so-good) performance is notoriously challenging: datasets, training techniques, and hyperparameters vary widely and are often not well documented. However, I think that there is still a lot of value in examining the structural changes of the architectures themselves to see what LLM developers are up to in 2025. (A subset of them are shown in Figure 1 below.) Figure 1: A subset of the architectures covered in this article. So, in this article, rather than writing about benchmark performance or training algorithms, I will focus on the architectural developments that define today's flagship open models. (As you may remember, I wrote about multimodal LLMs not too long ago; in this article, I will focus on the text capabilities of recent models and leave the discussion of multimodal capabilities for another time.) Tip: This is a fairly comprehensive article, so I recommend using the navigation bar to access the table of contents (just hover over the left side of the Substack page). Optional: The video below is a narrated and abridged version of this article. As you have probably heard more than once by now, DeepSeek R1 made a big impact when it was released in January 2025. DeepSeek R1 is a reasoning model built on top of the DeepSeek V3 architecture , which was introduced in December 2024. While my focus here is on architectures released in 2025, I think it’s reasonable to include DeepSeek V3, since it only gained widespread attention and adoption following the launch of DeepSeek R1 in 2025. If you are interested in the training of DeepSeek R1 specifically, you may also find my article from earlier this year useful: In this section, I’ll focus on two key architectural techniques introduced in DeepSeek V3 that improved its computational efficiency and distinguish it from many other LLMs: Multi-Head Latent Attention (MLA) Mixture-of-Experts (MoE) Before discussing Multi-Head Latent Attention (MLA), let's briefly go over some background to motivate why it's used. For that, let's start with Grouped-Query Attention (GQA), which has become the new standard replacement for a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA) in recent years. So, here's a brief GQA summary. Unlike MHA, where each head also has its own set of keys and values, to reduce memory usage, GQA groups multiple heads to share the same key and value projections. For example, as further illustrated in Figure 2 below, if there are 2 key-value groups and 4 attention heads, then heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This reduces the total number of key and value computations, which leads to lower memory usage and improved efficiency (without noticeably affecting the modeling performance, according to ablation studies). Figure 2: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. Now, Multi-Head Latent Attention (MLA) offers a different memory-saving strategy that also pairs particularly well with KV caching. Instead of sharing key and value heads like GQA, MLA compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the Figure 3 below. This adds an extra matrix multiplication but reduces memory usage. Figure 3: Comparison between MLA (used in DeepSeek V3 and R1) and regular MHA. (As a side note, the queries are also compressed, but only during training, not inference.) By the way, MLA is not new in DeepSeek V3, as its DeepSeek-V2 predecessor also used (and even introduced) it. Also, the V2 paper contains a few interesting ablation studies that may explain why the DeepSeek team chose MLA over GQA (see Figure 4 below). Figure 4: Annotated tables from the DeepSeek-V2 paper, https://arxiv.org/abs/2405.04434 As shown in Figure 4 above, GQA appears to perform worse than MHA, whereas MLA offers better modeling performance than MHA, which is likely why the DeepSeek team chose MLA over GQA. (It would have been interesting to see the "KV Cache per Token" savings comparison between MLA and GQA as well!) To summarize this section before we move on to the next architecture component, MLA is a clever trick to reduce KV cache memory use while even slightly outperforming MHA in terms of modeling performance. The other major architectural component in DeepSeek worth highlighting is its use of Mixture-of-Experts (MoE) layers. While DeepSeek did not invent MoE, it has seen a resurgence this year, and many of the architectures we will cover later also adopt it. You are likely already familiar with MoE, but a quick recap may be helpful. The core idea in MoE is to replace each FeedForward module in a transformer block with multiple expert layers, where each of these expert layers is also a FeedForward module. This means that we swap a single FeedForward block for multiple FeedForward blocks, as illustrated in the Figure 5 below. Figure 5: An illustration of the Mixture-of-Experts (MoE) module in DeepSeek V3/R1 (right) compared to an LLM with a standard FeedForward block (left). The FeedForward block inside a transformer block (shown as the dark gray block in the figure above) typically contains a large number of the model's total parameters. (Note that the transformer block, and thereby the FeedForward block, is repeated many times in an LLM; in the case of DeepSeek-V3, 61 times.) So, replacing a single FeedForward block with multiple FeedForward blocks (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. (In the interest of time, or rather article space, I'll cover the router in more detail another time.) Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. For example, DeepSeek-V3 has 256 experts per MoE module and a total of 671 billion parameters. Yet during inference, only 9 experts are active at a time (1 shared expert plus 8 selected by the router). This means just 37 billion parameters are used per inference step as opposed to all 671 billion. One notable feature of DeepSeek-V3's MoE design is the use of a shared expert. This is an expert that is always active for every token. This idea is not new and was already introduced in the DeepSeek 2024 MoE and 2022 DeepSpeedMoE paper s. Figure 6: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 The benefit of having a shared expert was first noted in the DeepSpeedMoE paper , where they found that it boosts overall modeling performance compared to no shared experts. This is likely because common or repeated patterns don't have to be learned by multiple individual experts, which leaves them with more room for learning more specialized patterns. To summarize, DeepSeek-V3 is a massive 671-billion-parameter model that, at launch, outperformed other open-weight models, including the 405B Llama 3. Despite being larger, it is much more efficient at inference time thanks to its Mixture-of-Experts (MoE) architecture, which activates only a small subset of (just 37B) parameters per token. Another key distinguishing feature is DeepSeek-V3's use of Multi-Head Latent Attention (MLA) instead of Grouped-Query Attention (GQA). Both MLA and GQA are inference-efficient alternatives to standard Multi-Head Attention (MHA), particularly when using KV caching. While MLA is more complex to implement, a study in the DeepSeek-V2 paper has shown it delivers better modeling performance than GQA. The OLMo series of models by the non-profit Allen Institute for AI is noteworthy due to its transparency in terms of training data and code, as well as the relatively detailed technical reports. While you probably won’t find OLMo models at the top of any benchmark or leaderboard, they are pretty clean and, more importantly, a great blueprint for developing LLMs, thanks to their transparency. And while OLMo models are popular because of their transparency, they are not that bad either. In fact, at the time of release in January (before Llama 4, Gemma 3, and Qwen 3), OLMo 2 models were sitting at the Pareto frontier of compute to performance, as shown in Figure 7 below. Figure 7: Modeling benchmark performance (higher is better) vs pre-training cost (FLOPs; lower is better) for different LLMs. This is an annotated figure from the OLMo 2 paper, https://arxiv.org/abs/2501.00656 As mentioned earlier in this article, I aim to focus only on the LLM architecture details (not training or data) to keep it at a manageable length. So, what were the interesting architectural design choices in OLMo2 ? It mainly comes down to normalizations: the placement of RMSNorm layers as well as the addition of a QK-norm, which I will discuss below. Another thing worth mentioning is that OLMo 2 still uses traditional Multi-Head Attention (MHA) instead of MLA or GQA. Overall, OLMo 2 largely follows the architecture of the original GPT model, similar to other contemporary LLMs. However, there are some noteworthy deviations. Let's start with the normalization layers. Similar to Llama, Gemma, and most other LLMs, OLMo 2 switched from LayerNorm to RMSNorm. But since RMSNorm is old hat (it's basically a simplified version of LayerNorm with fewer trainable parameters), I will skip the discussion of RMSNorm vs LayerNorm. (Curious readers can find an RMSNorm code implementation in my GPT-2 to Llama conversion guide .) However, it's worth discussing the placement of the RMSNorm layer. The original transformer (from the " Attention is all you need " paper) placed the two normalization layers in the transformer block after the attention module and the FeedForward module, respectively. This is also known as Post-LN or Post-Norm. GPT and most other LLMs that came after placed the normalization layers before the attention and FeedForward modules, which is known as Pre-LN or Pre-Norm. A comparison between Post- and Pre-Norm is shown in the figure below. Figure 8: A comparison of Post-Norm, Pre-Norm, and OLMo 2's flavor of Post-Norm. In 2020, Xiong et al. showed that Pre-LN results in more well-behaved gradients at initialization. Furthermore, the researchers mentioned that Pre-LN even works well without careful learning rate warm-up, which is otherwise a crucial tool for Post-LN. Now, the reason I am mentioning that is that OLMo 2 adopted a form of Post-LN (but with RMSNorm instead of LayerNorm, so I am calling it Post-Norm ). In OLMo 2, instead of placing the normalization layers before the attention and FeedForward layers, they place them after, as shown in the figure above. However, notice that in contrast to the original transformer architecture, the normalization layers are still inside the residual layers (skip connections). So, why did they move the position of the normalization layers? The reason is that it helped with training stability, as shown in the figure below. Figure 9: A plot showing the training stability for Pre-Norm (like in GPT-2, Llama 3, and many others) versus OLMo 2's flavor of Post-Norm. This is an annotated figure from the OLMo 2 paper, https://arxiv.org/abs/2501.00656 Unfortunately this figure shows the results of the reordering together with QK-Norm, which is a separate concept. So, it’s hard to tell how much the normalization layer reordering contributed by itself. Since the previous section already mentioned the QK-norm, and other LLMs we discuss later, such as Gemma 2 and Gemma 3, also use QK-norm, let's briefly discuss what this is. QK-Norm is essentially yet another RMSNorm layer. It's placed inside the Multi-Head Attention (MHA) module and applied to the queries (q) and keys (k) before applying RoPE. To illustrate this, below is an excerpt of a Grouped-Query Attention (GQA) layer I wrote for my Qwen3 from-scratch implementation (the QK-norm application in GQA is similar to MHA in OLMo): As mentioned earlier, together with Post-Norm, QK-Norm stabilizes the training. Note that QK-Norm was not invented by OLMo 2 but goes back to the 2023 Scaling Vision Transformers paper . In short, the noteworthy OLMo 2 architecture design decisions are primarily the RMSNorm placements: RMSNorm after instead of before the attention and FeedForward modules (a flavor of Post-Norm), as well as the addition of RMSNorm for the queries and keys inside the attention mechanism (QK-Norm), which both, together, help stabilize the training loss. Below is a figure that further compares OLMo 2 to Llama 3 side by side; as one can see, the architectures are otherwise relatively similar except for the fact that OLMo 2 still uses the traditional MHA instead of GQA. (However, the OLMo 2 team released a 32B variant 3 months later that uses GQA.) Figure 10: An architecture comparison between Llama 3 and OLMo 2. Google's Gemma models have always been really good, and I think they have always been a bit underhyped compared to other popular models, like the Llama series. One of the distinguishing aspects of Gemma is the rather large vocabulary size (to support multiple languages better), and the stronger focus on the 27B size (versus 8B or 70B). But note that Gemma 2 also comes in smaller sizes: 1B, 4B, and 12B. The 27B size hits a really nice sweet spot: it's much more capable than an 8B model but not as resource-intensive as a 70B model, and it runs just fine locally on my Mac Mini. So, what else is interesting in Gemma 3 ? As discussed earlier, other models like Deepseek-V3/R1 use a Mixture-of-Experts (MoE) architecture to reduce memory requirements at inference, given a fixed model size. (The MoE approach is also used by several other models we will discuss later.) Gemma 3 uses a different "trick" to reduce computational costs, namely sliding window attention. With sliding window attention (originally introduced in the LongFormer paper in 2020 and also already used by Gemma 2 ), the Gemma 3 team was able to reduce the memory requirements in the KV cache by a substantial amount, as shown in the figure below. Figure 11: An annotated figure from Gemma 3 paper (https://arxiv.org/abs/2503.19786) showing the KV cache memory savings via sliding window attention. So, what is sliding window attention? If we think of regular self-attention as a global attention mechanism, since each sequence element can access every other sequence element, then we can think of sliding window attention as local attention, because here we restrict the context size around the current query position. This is illustrated in the figure below. Figure 12: A comparison between regular attention (left) and sliding window attention (right). Please note that sliding window attention can be used with both Multi-Head Attention and Grouped-Query Attention; Gemma 3 uses grouped-query attention. As mentioned above, sliding window attention is also referred to as local attention because the local window surrounds and moves with the current query position. In contrast, regular attention is global as each token can access all other tokens. Now, as briefly mentioned above, the Gemma 2 predecessor architecture also used sliding window attention before. The difference in Gemma 3 is that they adjusted the ratio between global (regular) and local (sliding) attention. For instance, Gemma 2 uses a hybrid attention mechanism that combines sliding window (local) and global attention in a 1:1 ratio. Each token can attend to a 4k-token window of nearby context. Where Gemma 2 used sliding window attention in every other layer, Gemma 3 now has a 5:1 ratio, meaning there's only 1 full attention layer for every 5 sliding windows (local) attention layers; moreover, the sliding window size was reduced from 4096 (Gemma 2) to just 1024 (Gemma 3). This shifts the model's focus towards more efficient, localized computations. According to their ablation study, the use of sliding window attention has minimal impact on modeling performance, as shown in the figure below. Figure 13: An annotated figure from Gemma 3 paper (https://arxiv.org/abs/2503.19786) showing that sliding window attention has little to no impact on the LLM-generated output perplexity. While sliding window attention is the most notable architecture aspect of Gemma 3, I want to also briefly go over the placement of the normalization layers as a follow-up to the previous OLMo 2 section. A small but interesting tidbit to highlight is that Gemma 3 uses RMSNorm in both a Pre-Norm and Post-Norm setting around its grouped-query attention module. This is similar to Gemma 2 but still worth highlighting, as it differs from (1) the Post-Norm used in the original transformer (“Attention is all you need”), (2) the Pre-Norm, which was popularized by GPT-2 and used in many other architectures afterwards, and (3) the Post-Norm flavor in OLMo 2 that we saw earlier. Figure 14: An architecture comparison between OLMo2 and Gemma 3; note the additional normalization layers in Gemma 3. I think this normalization layer placement is a relatively intuitive approach as it gets the best of both worlds: Pre-Norm and Post-Norm. In my opinion, a bit of extra normalization can't hurt. In the worst case, if the extra normalization is redundant, this adds a bit of inefficiency through redundancy. In practice, since RMSNorm is relatively cheap in the grand scheme of things, this shouldn't have any noticeable impact, though. Gemma 3 is a well-performing open-weight LLM that, in my opinion, is a bit underappreciated in the open-source circles. The most interesting part is the use of sliding window attention to improve efficiency (it will be interesting to combine it with MoE in the future). Also, Gemma 3 has a unique normalization layer placement, placing RMSNorm layers both before and after the attention and FeedForward modules. A few months after the Gemma 3 release, Google shared Gemma 3n , which is a Gemma 3 model that has been optimized for small-device efficiency with the goal of running on phones. One of the changes in Gemma 3n to achieve better efficiency is the so-called Per-Layer Embedding (PLE) parameters layer. The key idea here is to keep only a subset of the model's parameters in GPU memory. Token-layer specific embeddings, such as those for text, audio, and vision modalities, are then streamed from the CPU or SSD on demand. The figure below illustrates the PLE memory savings, listing 5.44 billion parameters for a standard Gemma 3 model. This likely refers to the Gemma 3 4-billion variant. Figure 15: An annotated figure from Google's Gemma 3n blog (https://developers.googleblog.com/en/introducing-gemma-3n/) illustrating the PLE memory savings. The 5.44 vs. 4 billion parameter discrepancy is because Google has an interesting way of reporting parameter counts in LLMs. They often exclude embedding parameters to make the model appear smaller, except in cases like this, where it is convenient to include them to make the model appear larger. This is not unique to Google, as this approach has become a common practice across the field. Another interesting trick is the MatFormer concept (short for Matryoshka Transformer). For instance, Gemma 3n uses a single shared LLM (transformer) architecture that can be sliced into smaller, independently usable models. Each slice is trained to function on its own, so at inference time, we can run just the part you need (instead of the large model). Mistral Small 3.1 24B , which was released in March shortly after Gemma 3, is noteworthy for outperforming Gemma 3 27B on several benchmarks (except for math) while being faster. The reasons for the lower inference latency of Mistral Small 3.1 over Gemma 3 are likely due to their custom tokenizer, as well as shrinking the KV cache and layer count. Otherwise, it's a standard architecture as shown in the figure below. Figure 16: An architecture comparison between Gemma 3 27B and Mistral 3.1 Small 24B. Interestingly, earlier Mistral models had utilized sliding window attention, but they appear to have abandoned it in Mistral Small 3.1 if we consider the default setting ( ) in the official Model Hub configuration file . Also, the model card makes no mention of it. So, since Mistral uses regular Grouped-Query Attention instead of Grouped-Query Attention with a sliding window as in Gemma 3, maybe there are additional inference compute savings due to being able to use more optimized code (i.e., FlashAttention). For instance, I speculate that while sliding window attention reduces memory usage, it doesn't necessarily reduce inference latency, which is what Mistral Small 3.1 is focused on. The extensive introductory discussion on Mixture-of-Experts (MoE) earlier in this article pays off again. Llama 4 has also adopted an MoE approach and otherwise follows a relatively standard architecture that is very similar to DeepSeek-V3, as shown in the figure below. (Llama 4 includes native multimodal support, similar to models like Gemma and Mistral. However, since this article focuses on language modeling, we only focus on the text model.) Figure 17: An architecture comparison between DeepSeek V3 (671-billion parameters) and Llama 4 Maverick (400-billion parameters). While the Llama 4 Maverick architecture looks very similar to DeepSeek-V3 overall, there are some interesting differences worth highlighting. First, Llama 4 uses Grouped-Query Attention similar to its predecessors, whereas DeepSeek-V3 uses Multi-Head Latent Attention, which we discussed at the beginning of this article. Now, both DeepSeek-V3 and Llama 4 Maverick are very large architectures, with DeepSeek-V3 being approximately 68% larger in its total parameter count. However, with 37 billion active parameters, DeepSeek-V3 has more than twice as many active parameters as Llama 4 Maverick (17B). Llama 4 Maverick uses a more classic MoE setup with fewer but larger experts (2 active experts with 8,192 hidden size each) compared to DeepSeek-V3 (9 active experts with 2,048 hidden size each). Also, DeepSeek uses MoE layers in each transformer block (except the first 3), whereas Llama 4 alternates MoE and dense modules in every other transformer block. Given the many small differences between architectures, it is difficult to determine their exact impact on final model performance. The main takeaway, however, is that MoE architectures have seen a significant rise in popularity in 2025. The Qwen team consistently delivers high-quality open-weight LLMs. When I helped co-advising the LLM efficiency challenge at NeurIPS 2023, I remember that the top winning solutions were all Qwen2-based. Now, Qwen3 is another hit model series at the top of the leaderboards for their size classes. There are 7 dense models: 0.6B, 1.7B, 4B, 8B, 14B, and 32B. And there are 2 MoE models: 30B-A3B, and 235B-A22B. (By the way, note that the missing whitespace in "Qwen3" is not a typo; I simply try to preserve the original spelling the Qwen developers chose.) Let's discuss the dense model architecture first. As of this writing, the 0.6B model may well be the smallest current-generation open-weight model out there. And based on my personal experience, it performs really well given its small size. It has great token/sec throughput and a low memory footprint if you are planning to run it locally. But what's more, it's also easy to train locally (for educational purposes) due to its small size. So, Qwen3 0.6B has replaced Llama 3 1B for me for most purposes. A comparison between these two architectures is shown below. Figure 18: An architecture comparison between Qwen3 0.6B and Llama 3 1B; notice that Qwen3 is a deeper architecture with more layers, whereas Llama 3 is a wider architecture with more attention heads. If you are interested in a human-readable Qwen3 implementation without external third-party LLM library dependencies, I recently implemented Qwen3 from scratch (in pure PyTorch) . The computational performance numbers in the figure above are based on my from-scratch PyTorch implementations when run on an A100 GPU. As one can see, Qwen3 has a smaller memory footprint as it is a smaller architecture overall, but also uses smaller hidden layers and fewer attention heads. However, it uses more transformer blocks than Llama 3, which leads to a slower runtime (lower tokens/sec generation speed). As mentioned earlier, Qwen3 also comes in two MoE flavors: 30B-A3B and 235B-A22B. Why do some architectures, like Qwen3, come as regular (dense) and MoE (sparse) variants? As mentioned at the beginning of this article, MoE variants help reduce inference costs for large base models. Offering both dense and MoE versions gives users flexibility depending on their goals and constraints. Dense models are typically more straightforward to fine-tune, deploy, and optimize across various hardware. On the other hand, MoE models are optimized for scaling inference. For instance, at a fixed inference budget, they can achieve a higher overall model capacity (i.e., knowledge uptake during training due to being larger) without proportionally increasing inference costs. By releasing both types, the Qwen3 series can support a broader range of use cases: dense models for robustness, simplicity, and fine-tuning, and MoE models for efficient serving at scale. To round up this section, let's look at Qwen3 235B-A22B (note that the A22B stands for "22B active parameters) to DeepSeek-V3, which has almost twice as many active parameters (37B). Figure 19: An architecture comparison between DeepSeek-V3 and Qwen3 235B-A22B. As shown in the figure above, the DeepSeek-V3 and Qwen3 235B-A22B architectures are remarkably similar. What's noteworthy, though, is that the Qwen3 model moved away from using a shared expert (earlier Qwen models, such as Qwen2.5-MoE did use a shared expert). Unfortunately, the Qwen3 team did not disclose any reason as to why they moved away from shared experts. If I had to guess, it was perhaps simply not necessary for training stability for their setup when they increased the experts from 2 (in Qwen2.5-MoE) to 8 (in Qwen3). And then they were able to save the extra compute/memory cost by using only 8 instead of 8+1 experts. (However, this doesn't explain why DeepSeek-V3 is still keeping their shared expert.) Update. Junyang Lin , one of the developers of Qwen3, responded as follows: At that moment we did not find significant enough improvement on shared expert and we were worrying about the optimization for inference caused by shared expert. No straight answer to this question honestly. SmolLM3 is perhaps not as nearly as popular as the other LLMs covered in this article, but I thought it is still an interesting model to include as it offers really good modeling performance at a relatively small and convenient 3-billion parameter model size that sits between the 1.7B and 4B Qwen3 model, as shown in the figure below. Moreover, it also shared a lot of the training details, similar to OLMo, which is rare and always appreciated! Figure 20: An annotated figure from the SmolLM3 announcement post, https://huggingface.co/blog/smollm3, comparing the SmolLM3 win rate to Qwen3 1.7B and 4B as well as Llama 3 3B and Gemma 3 4B. As shown in the architecture comparison figure below, the SmolLM3 architecture looks fairly standard. The perhaps most interesting aspect is its use of NoPE (No Positional Embeddings), though. Figure 21: A side-by-side architecture comparison between Qwen3 4B and SmolLM3 3B. NoPE is, in LLM contexts, an older idea that goes back to a 2023 paper ( The Impact of Positional Encoding on Length Generalization in Transformers ) to remove explicit positional information injection (like through classic absolute positional embedding layers in early GPT architectures or nowadays RoPE). In transformer-based LLMs, positional encoding is typically necessary because self-attention treats tokens independently of order. Absolute position embeddings solve this by adding an additional embedding layer that adds information to the token embeddings. Figure 22: A modified figure from my Build A Large Language Model (From Scratch) book (https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167) illustrating absolute positional embeddings. RoPE, on the other hand, solves this by rotating the query and key vectors relative to their token position. In NoPE layers, however, no such positional signal is added at all: not fixed, not learned, not relative. Nothing. Even though there is no positional embedding, the model still knows which tokens come before, thanks to the causal attention mask. This mask prevents each token from attending to future ones. As a result, a token at position t can only see tokens at positions ≤ t , which preserves the autoregressive ordering. So while there is no positional information that is explicitly added, there is still an implicit sense of direction baked into the model's structure, and the LLM, in the regular gradient-descent-based training, can learn to exploit it if it finds it beneficial for the optimization objective. (Check out the NoPE paper's theorems for more information.) So, overall, the NoPE paper not only found that no positional information injection is necessary, but it also found that NoPE has better length generalization, which means that LLM answering performance deteriorates less with increased sequence length, as shown in the figure below. Figure 23: An annotated figure from the NoPE paper (https://arxiv.org/abs/2305.19466) showing better length generalization with NoPE. Note that the experiments shown above were conducted with a relatively small GPT-style model of approximately 100 million parameters and relatively small context sizes. It is unclear how well these findings generalize to larger, contemporary LLMs. For this reason, the SmolLM3 team likely only "applied" NoPE (or rather omitted RoPE) in every 4th layer. Kimi K2 recently made big waves in the AI community due to being an open-weight model with an incredibly good performance. According to benchmarks, it's on par with the best proprietary models like Google's Gemini, Anthropic's Claude, and OpenAI's ChatGPT models. A notable aspect is its use of a variant of the relatively new Muon optimizer over AdamW. As far as I know, this is the first time Muon was used over AdamW for any production model of this size ( previously , it has only been shown to scale up to 16B). This resulted in very nice training loss curves, which probably helped catapult this model to the top of the aforementioned benchmarks. While people commented that the loss was exceptionally smooth (due to the lack of spikes), I think it's not exceptionally smooth (e.g., see the OLMo 2 loss curve in the figure below; also, the L2 norm of the gradient would probably be a better metric to track training stability). However, what's remarkable is how well the loss curve decays. However, as mentioned in the introduction of this article, training methodologies are a topic for another time. Figure 24: Annotated figures from the Kimi K2 announcement blog article (https://moonshotai.github.io/Kimi-K2/) and the OLMo 2 paper (https://arxiv.org/abs/2305.19466). The model itself is 1 trillion parameters large, which is truly impressive. It may be the biggest LLM of this generation as of this writing (given the constraints that Llama 4 Behemoth is not released, proprietary LLMs don't count, and Google's 1.6 trillion Switch Transformer is an encoder-decoder architecture from a different generation). It's also coming full circle as Kimi K2 uses the DeepSeek-V3 architecture we covered at the beginning of this article except they made it larger, as shown in the figure below. Figure 25: An architecture comparison between DeepSeek V3 and Kimi K2. As shown in the figure above, Kimi K2 is basically the same as DeepSeek V3, except that it uses more experts in the MoE modules and fewer heads in the Multi-head Latent Attention (MLA) module. Kimi K2 is not coming out of nowhere. The earlier Kimi 1.5 model discussed in the Kimi k1.5: Scaling Reinforcement Learning with LLMs paper , was impressive as well. However, it had the bad luck that the DeepSeek R1 model paper was published on exactly the same date on January 22nd. Moreover, as far as I know, the Kimi 1.5 weights were never publicly shared. So, most likely the Kimi K2 team took these lessons to heart and shared Kimi K2 as an open-weight model, before DeepSeek R2 was released. As of this writing, Kimi K2 is the most impressive open-weight model. OpenAI’s released gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019, about one week after I wrote this article. Since OpenAI’s open-weight models have been so widely anticipated, I updated this article to include them. I will keep this section brief, but I have written another, much more detailed article dedicated to the gpt-oss models here: Before summarizing the interesting tidbits, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, as shown in Figure 26 below. Figure 26: Architecture overview of the two gpt-oss models. Looking at Figure 26, the architecture contains all the familiar components we have seen in other architectures discussed previously. For instance, Figure 27 puts the smaller gpt-oss architecture next to Qwen3 30B-A3B, which is also an MoE model with a similar number of active parameters (gpt-oss has 3.6B active parameters, and Qwen3 30B-A3B has 3.3B). Figure 27: Architecture comparison between gpt-oss and Qwen3 One aspect not shown in Figure 27 is that gpt-oss uses sliding window attention (similar to Gemma 3, but in every other layer instead of using a 5:1 ratio). Figure 27 shows that gpt-oss and Qwen3 use similar components. But if we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 It's also worth noting that gpt-oss uses twice as many attention heads, but this doesn't directly increase the model's width. The width is determined by the embedding dimension. Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate). Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost. When it comes to modeling performance, there's unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Gemma 2 paper (Table 9) , which found that for a 9B parameter architecture, a wider setup is slightly better than a deeper setup. Across 4 benchmarks, the wider model achieved a 52.0 average score, and the deeper model achieved a 50.8 average score. As shown in Figure 27 above, it's also noteworthy that gpt-oss has a surprisingly small number of experts (32 instead of 128), and only uses 4 instead of 8 active experts per token. However, each expert is much larger than the experts in Qwen3. This is interesting because the recent trends and developments point towards more, smaller models as being beneficial. This change, at a constant total parameter size, is nicely illustrated in Figure 28 below from the DeepSeekMoE paper. Figure 28: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in Figure 29 below. Figure 29: gpt-oss models use bias units in the attention layers. See code example here. I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation ( ). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 30 below). Figure 30: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 30. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 31). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 31: The use of attention sinks in gpt-oss; based on the Hugging Face code here . For more information about gpt-oss, and how it compares to GPT-2, please see my other gpt-oss article: A few weeks after this article first went online, xAI released the weights of their 270B-parameter Grok 2.5 model. I thought it would be worth including here, since Grok 2.5 was xAI's flagship production model last year. Up to this point, all models we discussed were released as open-weight models from the start. For example, gpt-oss is likely not an open-weight clone of GPT-4 but rather a custom model trained specifically for the open-source community. With Grok 2.5, we get a rare look at a real production system, even if it is last year's. Architecturally, Grok 2.5 looks fairly standard overall (Figure 32), but there are a few noteworthy details. Figure 32: Grok 2.5 next to a Qwen3 model of comparable size For instance, Grok 2.5 uses a small number of large experts (eight), which reflects an older trend. As discussed earlier, more recent designs such as those in the DeepSeekMoE paper favor a larger number of smaller experts (this is also present in Qwen3). Another interesting choice is the use of what amounts to a shared expert. The additional SwiGLU module shown on the left in Figure 32 functions as an always-on, shared expert. It is not identical to the classic shared-expert design since its intermediate dimension is doubled, but the idea is the same. (I still find it interesting that Qwen3 omitted shared experts, and it will be interesting to see if that changes with Qwen4 and later models.) GLM-4.5 is another major release this year. It is an instruction/reasoning hybrid similar to Qwen3, but even better optimized for function calling and agent-style contexts. Figure 33: GLM-4.5 benchmark from the official GitHub repository at https://github.com/zai-org/GLM-4.5 As shown in Figure 34, GLM-4.5 comes in two variants. The flagship 355-billion-parameter model outperforms Claude 4 Opus on average across 12 benchmarks and trails only slightly behind OpenAI’s o3 and xAI’s Grok 4. There is also GLM-4.5-Air, a more compact 106-billion-parameter version that delivers performance only marginally below the 355-billion model. Figure 35 compares the 355-billion architecture to Qwen3. Figure 34: GLM-4.5 next to a similarly-sized Qwen3 model. The designs are largely similar, but GLM-4.5 adopts a structural choice first introduced by DeepSeek V3: 3 dense layers precede the Mixture-of-Experts (MoE) blocks. Why? Starting with several dense layers improves convergence stability and overall performance in large MoE systems. If MoE routing is introduced immediately, the instability of sparse expert selection can interfere with early syntactic and semantic feature extraction. So, one might say that by keeping the initial layers dense ensures the model forms stable low-level representations before routing decisions begin to shape higher-level processing. Also, GLM-4.5 uses a shared expert similar to DeepSeek-V3 (and unlike Qwen3). (Interestingly, GLM-4.5 also retains the attention bias mechanism used in GPT-2 and gpt-oss.) On 11 September 2025, the Qwen3 team released Qwen3 Next 80B-A3B (Figure 35), available in both Instruct and Thinking variants. While its design builds on the previously discussed Qwen3 architecture, I included it here as a separate entry to keep the figure numbering consistent and to draw attention to some of its design changes. The new Qwen3 Next architecture stands out because, despite being 3× smaller than the previous 235B-A22B model (Figure 35), it introduces four times as many experts and even adds a shared expert. Both of these design choices (a high expert count and the inclusion of a shared expert) were future directions I had highlighted prior to this release, particularly in the video version of the article that I linked at the top. Figure 35: The original Qwen3 model released in May (left) next to the Qwen3 Next model released in September (right). The other highlight is that they replace the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) So how does this new attention hybrid work? Compared to grouped‑query attention (GQA), which is still standard scaled dot‑product attention (sharing K/V across query‑head groups to cut KV‑cache size and memory bandwidth as discussed earlier but whose decode cost and cache still grow with sequence length), their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks with in a 3:1 ratio as shown in Figure 36. Figure 36: The Gated DeltaNet + Gated Attention hybrid mechanism. Note that these are arranges in a 3:1 ratio, meaning that 3 transformer blocks with Gated DeltaNet are followed by 1 transformer block with Gated Attention. The right subfigure is from the official Qwen3 blog: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list We can think of the gated attention block as standard scaled-dot-product attention that can be used in GQA, but it has a few tweaks on top. The main differences between gated attention and plain GQA block are: an output gate (sigmoid-controlled, usually per-channel) that scales the attention result before it is added back to the residual; zero-centered RMSNorm for QKNorm, rather than a standard RMSNorm; partial RoPE (on a subset of dimensions). Note that these are essentially just stability changes to GQA. The Gated DeltaNet is a more significant change. In the DeltaNet block, q, k, v and two gates (α, β) are produced by linear and lightweight convolutional layers with normalization, and the layer replaces attention with a fast‑weight delta rule update. However, the tradeoff is that DeltaNet offers less precise content‑based retrieval than full attention, which is why one gated attention layer remains. Given that attention grows quadratically, the DeltaNet component was added to help with memory efficiency. In the "linear-time, cache-free" family, the DeltaNet block is a essentially an alternative to Mamba. Mamba keeps a state with a learned state-space filter (essentially a dynamic convolution over time). DeltaNet keeps a tiny fast-weight memory updated with α and β and reads it with q, with small convolutions only used only to help form q, k, v, α, β. The two subsections above describe two design decisions geared towards efficiency. Since all good things come in threes, the Qwen3 added another technique on top: Multi-Token Prediction (MTP). Multi-token prediction trains the LLM to predict several future tokens, instead of a single one, at each step. Here, at each position t , small extra heads (linear layers) output logits for t+1...t+k , and we sum cross-entropy losses for these offsets (in the MTP paper the researchers recommended k=4 ). This additional signal speeds up training, and inference may remains one token at a time. However, the extra heads can be used in speculative multi-token decoding, which is what Qwen3-Next seems to do, however, the details are still a bit sparse: Qwen3-Next introduces a native Multi-Token Prediction (MTP) mechanism, which not only yields an MTP module with a high acceptance rate for Speculative Decoding but also enhances the overall performance.Additionally, Qwen3-Next specifically optimizes the multi-step inference performance of MTP, further improving the acceptance rate of Speculative Decoding in real scenarios through multi-step training that maintains consistency between training and inference. Souce: Qwen3-Next blog post After all these years, LLM releases remain exciting, and I am curious to see what's next! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Figure 1: A subset of the architectures covered in this article. So, in this article, rather than writing about benchmark performance or training algorithms, I will focus on the architectural developments that define today's flagship open models. (As you may remember, I wrote about multimodal LLMs not too long ago; in this article, I will focus on the text capabilities of recent models and leave the discussion of multimodal capabilities for another time.) Tip: This is a fairly comprehensive article, so I recommend using the navigation bar to access the table of contents (just hover over the left side of the Substack page). Optional: The video below is a narrated and abridged version of this article. 1. DeepSeek V3/R1 As you have probably heard more than once by now, DeepSeek R1 made a big impact when it was released in January 2025. DeepSeek R1 is a reasoning model built on top of the DeepSeek V3 architecture , which was introduced in December 2024. While my focus here is on architectures released in 2025, I think it’s reasonable to include DeepSeek V3, since it only gained widespread attention and adoption following the launch of DeepSeek R1 in 2025. If you are interested in the training of DeepSeek R1 specifically, you may also find my article from earlier this year useful: In this section, I’ll focus on two key architectural techniques introduced in DeepSeek V3 that improved its computational efficiency and distinguish it from many other LLMs: Multi-Head Latent Attention (MLA) Mixture-of-Experts (MoE) Figure 2: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. Now, Multi-Head Latent Attention (MLA) offers a different memory-saving strategy that also pairs particularly well with KV caching. Instead of sharing key and value heads like GQA, MLA compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache. At inference time, these compressed tensors are projected back to their original size before being used, as shown in the Figure 3 below. This adds an extra matrix multiplication but reduces memory usage. Figure 3: Comparison between MLA (used in DeepSeek V3 and R1) and regular MHA. (As a side note, the queries are also compressed, but only during training, not inference.) By the way, MLA is not new in DeepSeek V3, as its DeepSeek-V2 predecessor also used (and even introduced) it. Also, the V2 paper contains a few interesting ablation studies that may explain why the DeepSeek team chose MLA over GQA (see Figure 4 below). Figure 4: Annotated tables from the DeepSeek-V2 paper, https://arxiv.org/abs/2405.04434 As shown in Figure 4 above, GQA appears to perform worse than MHA, whereas MLA offers better modeling performance than MHA, which is likely why the DeepSeek team chose MLA over GQA. (It would have been interesting to see the "KV Cache per Token" savings comparison between MLA and GQA as well!) To summarize this section before we move on to the next architecture component, MLA is a clever trick to reduce KV cache memory use while even slightly outperforming MHA in terms of modeling performance. 1.2 Mixture-of-Experts (MoE) The other major architectural component in DeepSeek worth highlighting is its use of Mixture-of-Experts (MoE) layers. While DeepSeek did not invent MoE, it has seen a resurgence this year, and many of the architectures we will cover later also adopt it. You are likely already familiar with MoE, but a quick recap may be helpful. The core idea in MoE is to replace each FeedForward module in a transformer block with multiple expert layers, where each of these expert layers is also a FeedForward module. This means that we swap a single FeedForward block for multiple FeedForward blocks, as illustrated in the Figure 5 below. Figure 5: An illustration of the Mixture-of-Experts (MoE) module in DeepSeek V3/R1 (right) compared to an LLM with a standard FeedForward block (left). The FeedForward block inside a transformer block (shown as the dark gray block in the figure above) typically contains a large number of the model's total parameters. (Note that the transformer block, and thereby the FeedForward block, is repeated many times in an LLM; in the case of DeepSeek-V3, 61 times.) So, replacing a single FeedForward block with multiple FeedForward blocks (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. (In the interest of time, or rather article space, I'll cover the router in more detail another time.) Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. For example, DeepSeek-V3 has 256 experts per MoE module and a total of 671 billion parameters. Yet during inference, only 9 experts are active at a time (1 shared expert plus 8 selected by the router). This means just 37 billion parameters are used per inference step as opposed to all 671 billion. One notable feature of DeepSeek-V3's MoE design is the use of a shared expert. This is an expert that is always active for every token. This idea is not new and was already introduced in the DeepSeek 2024 MoE and 2022 DeepSpeedMoE paper s. Figure 6: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 The benefit of having a shared expert was first noted in the DeepSpeedMoE paper , where they found that it boosts overall modeling performance compared to no shared experts. This is likely because common or repeated patterns don't have to be learned by multiple individual experts, which leaves them with more room for learning more specialized patterns. 1.3 DeepSeek Summary To summarize, DeepSeek-V3 is a massive 671-billion-parameter model that, at launch, outperformed other open-weight models, including the 405B Llama 3. Despite being larger, it is much more efficient at inference time thanks to its Mixture-of-Experts (MoE) architecture, which activates only a small subset of (just 37B) parameters per token. Another key distinguishing feature is DeepSeek-V3's use of Multi-Head Latent Attention (MLA) instead of Grouped-Query Attention (GQA). Both MLA and GQA are inference-efficient alternatives to standard Multi-Head Attention (MHA), particularly when using KV caching. While MLA is more complex to implement, a study in the DeepSeek-V2 paper has shown it delivers better modeling performance than GQA. 2. OLMo 2 The OLMo series of models by the non-profit Allen Institute for AI is noteworthy due to its transparency in terms of training data and code, as well as the relatively detailed technical reports. While you probably won’t find OLMo models at the top of any benchmark or leaderboard, they are pretty clean and, more importantly, a great blueprint for developing LLMs, thanks to their transparency. And while OLMo models are popular because of their transparency, they are not that bad either. In fact, at the time of release in January (before Llama 4, Gemma 3, and Qwen 3), OLMo 2 models were sitting at the Pareto frontier of compute to performance, as shown in Figure 7 below. Figure 7: Modeling benchmark performance (higher is better) vs pre-training cost (FLOPs; lower is better) for different LLMs. This is an annotated figure from the OLMo 2 paper, https://arxiv.org/abs/2501.00656 As mentioned earlier in this article, I aim to focus only on the LLM architecture details (not training or data) to keep it at a manageable length. So, what were the interesting architectural design choices in OLMo2 ? It mainly comes down to normalizations: the placement of RMSNorm layers as well as the addition of a QK-norm, which I will discuss below. Another thing worth mentioning is that OLMo 2 still uses traditional Multi-Head Attention (MHA) instead of MLA or GQA. 2.1 Normalization Layer Placement Overall, OLMo 2 largely follows the architecture of the original GPT model, similar to other contemporary LLMs. However, there are some noteworthy deviations. Let's start with the normalization layers. Similar to Llama, Gemma, and most other LLMs, OLMo 2 switched from LayerNorm to RMSNorm. But since RMSNorm is old hat (it's basically a simplified version of LayerNorm with fewer trainable parameters), I will skip the discussion of RMSNorm vs LayerNorm. (Curious readers can find an RMSNorm code implementation in my GPT-2 to Llama conversion guide .) However, it's worth discussing the placement of the RMSNorm layer. The original transformer (from the " Attention is all you need " paper) placed the two normalization layers in the transformer block after the attention module and the FeedForward module, respectively. This is also known as Post-LN or Post-Norm. GPT and most other LLMs that came after placed the normalization layers before the attention and FeedForward modules, which is known as Pre-LN or Pre-Norm. A comparison between Post- and Pre-Norm is shown in the figure below. Figure 8: A comparison of Post-Norm, Pre-Norm, and OLMo 2's flavor of Post-Norm. In 2020, Xiong et al. showed that Pre-LN results in more well-behaved gradients at initialization. Furthermore, the researchers mentioned that Pre-LN even works well without careful learning rate warm-up, which is otherwise a crucial tool for Post-LN. Now, the reason I am mentioning that is that OLMo 2 adopted a form of Post-LN (but with RMSNorm instead of LayerNorm, so I am calling it Post-Norm ). In OLMo 2, instead of placing the normalization layers before the attention and FeedForward layers, they place them after, as shown in the figure above. However, notice that in contrast to the original transformer architecture, the normalization layers are still inside the residual layers (skip connections). So, why did they move the position of the normalization layers? The reason is that it helped with training stability, as shown in the figure below. Figure 9: A plot showing the training stability for Pre-Norm (like in GPT-2, Llama 3, and many others) versus OLMo 2's flavor of Post-Norm. This is an annotated figure from the OLMo 2 paper, https://arxiv.org/abs/2501.00656 Unfortunately this figure shows the results of the reordering together with QK-Norm, which is a separate concept. So, it’s hard to tell how much the normalization layer reordering contributed by itself. 2.2 QK-Norm Since the previous section already mentioned the QK-norm, and other LLMs we discuss later, such as Gemma 2 and Gemma 3, also use QK-norm, let's briefly discuss what this is. QK-Norm is essentially yet another RMSNorm layer. It's placed inside the Multi-Head Attention (MHA) module and applied to the queries (q) and keys (k) before applying RoPE. To illustrate this, below is an excerpt of a Grouped-Query Attention (GQA) layer I wrote for my Qwen3 from-scratch implementation (the QK-norm application in GQA is similar to MHA in OLMo): As mentioned earlier, together with Post-Norm, QK-Norm stabilizes the training. Note that QK-Norm was not invented by OLMo 2 but goes back to the 2023 Scaling Vision Transformers paper . 2.3 OLMo 2 Summary In short, the noteworthy OLMo 2 architecture design decisions are primarily the RMSNorm placements: RMSNorm after instead of before the attention and FeedForward modules (a flavor of Post-Norm), as well as the addition of RMSNorm for the queries and keys inside the attention mechanism (QK-Norm), which both, together, help stabilize the training loss. Below is a figure that further compares OLMo 2 to Llama 3 side by side; as one can see, the architectures are otherwise relatively similar except for the fact that OLMo 2 still uses the traditional MHA instead of GQA. (However, the OLMo 2 team released a 32B variant 3 months later that uses GQA.) Figure 10: An architecture comparison between Llama 3 and OLMo 2. 3. Gemma 3 Google's Gemma models have always been really good, and I think they have always been a bit underhyped compared to other popular models, like the Llama series. One of the distinguishing aspects of Gemma is the rather large vocabulary size (to support multiple languages better), and the stronger focus on the 27B size (versus 8B or 70B). But note that Gemma 2 also comes in smaller sizes: 1B, 4B, and 12B. The 27B size hits a really nice sweet spot: it's much more capable than an 8B model but not as resource-intensive as a 70B model, and it runs just fine locally on my Mac Mini. So, what else is interesting in Gemma 3 ? As discussed earlier, other models like Deepseek-V3/R1 use a Mixture-of-Experts (MoE) architecture to reduce memory requirements at inference, given a fixed model size. (The MoE approach is also used by several other models we will discuss later.) Gemma 3 uses a different "trick" to reduce computational costs, namely sliding window attention. 3.1 Sliding Window Attention With sliding window attention (originally introduced in the LongFormer paper in 2020 and also already used by Gemma 2 ), the Gemma 3 team was able to reduce the memory requirements in the KV cache by a substantial amount, as shown in the figure below. Figure 11: An annotated figure from Gemma 3 paper (https://arxiv.org/abs/2503.19786) showing the KV cache memory savings via sliding window attention. So, what is sliding window attention? If we think of regular self-attention as a global attention mechanism, since each sequence element can access every other sequence element, then we can think of sliding window attention as local attention, because here we restrict the context size around the current query position. This is illustrated in the figure below. Figure 12: A comparison between regular attention (left) and sliding window attention (right). Please note that sliding window attention can be used with both Multi-Head Attention and Grouped-Query Attention; Gemma 3 uses grouped-query attention. As mentioned above, sliding window attention is also referred to as local attention because the local window surrounds and moves with the current query position. In contrast, regular attention is global as each token can access all other tokens. Now, as briefly mentioned above, the Gemma 2 predecessor architecture also used sliding window attention before. The difference in Gemma 3 is that they adjusted the ratio between global (regular) and local (sliding) attention. For instance, Gemma 2 uses a hybrid attention mechanism that combines sliding window (local) and global attention in a 1:1 ratio. Each token can attend to a 4k-token window of nearby context. Where Gemma 2 used sliding window attention in every other layer, Gemma 3 now has a 5:1 ratio, meaning there's only 1 full attention layer for every 5 sliding windows (local) attention layers; moreover, the sliding window size was reduced from 4096 (Gemma 2) to just 1024 (Gemma 3). This shifts the model's focus towards more efficient, localized computations. According to their ablation study, the use of sliding window attention has minimal impact on modeling performance, as shown in the figure below. Figure 13: An annotated figure from Gemma 3 paper (https://arxiv.org/abs/2503.19786) showing that sliding window attention has little to no impact on the LLM-generated output perplexity. While sliding window attention is the most notable architecture aspect of Gemma 3, I want to also briefly go over the placement of the normalization layers as a follow-up to the previous OLMo 2 section. 3.2 Normalization Layer Placement in Gemma 3 A small but interesting tidbit to highlight is that Gemma 3 uses RMSNorm in both a Pre-Norm and Post-Norm setting around its grouped-query attention module. This is similar to Gemma 2 but still worth highlighting, as it differs from (1) the Post-Norm used in the original transformer (“Attention is all you need”), (2) the Pre-Norm, which was popularized by GPT-2 and used in many other architectures afterwards, and (3) the Post-Norm flavor in OLMo 2 that we saw earlier. Figure 14: An architecture comparison between OLMo2 and Gemma 3; note the additional normalization layers in Gemma 3. I think this normalization layer placement is a relatively intuitive approach as it gets the best of both worlds: Pre-Norm and Post-Norm. In my opinion, a bit of extra normalization can't hurt. In the worst case, if the extra normalization is redundant, this adds a bit of inefficiency through redundancy. In practice, since RMSNorm is relatively cheap in the grand scheme of things, this shouldn't have any noticeable impact, though. 3.3 Gemma 3 Summary Gemma 3 is a well-performing open-weight LLM that, in my opinion, is a bit underappreciated in the open-source circles. The most interesting part is the use of sliding window attention to improve efficiency (it will be interesting to combine it with MoE in the future). Also, Gemma 3 has a unique normalization layer placement, placing RMSNorm layers both before and after the attention and FeedForward modules. 3.4 Bonus: Gemma 3n A few months after the Gemma 3 release, Google shared Gemma 3n , which is a Gemma 3 model that has been optimized for small-device efficiency with the goal of running on phones. One of the changes in Gemma 3n to achieve better efficiency is the so-called Per-Layer Embedding (PLE) parameters layer. The key idea here is to keep only a subset of the model's parameters in GPU memory. Token-layer specific embeddings, such as those for text, audio, and vision modalities, are then streamed from the CPU or SSD on demand. The figure below illustrates the PLE memory savings, listing 5.44 billion parameters for a standard Gemma 3 model. This likely refers to the Gemma 3 4-billion variant. Figure 15: An annotated figure from Google's Gemma 3n blog (https://developers.googleblog.com/en/introducing-gemma-3n/) illustrating the PLE memory savings. The 5.44 vs. 4 billion parameter discrepancy is because Google has an interesting way of reporting parameter counts in LLMs. They often exclude embedding parameters to make the model appear smaller, except in cases like this, where it is convenient to include them to make the model appear larger. This is not unique to Google, as this approach has become a common practice across the field. Another interesting trick is the MatFormer concept (short for Matryoshka Transformer). For instance, Gemma 3n uses a single shared LLM (transformer) architecture that can be sliced into smaller, independently usable models. Each slice is trained to function on its own, so at inference time, we can run just the part you need (instead of the large model). 4. Mistral Small 3.1 Mistral Small 3.1 24B , which was released in March shortly after Gemma 3, is noteworthy for outperforming Gemma 3 27B on several benchmarks (except for math) while being faster. The reasons for the lower inference latency of Mistral Small 3.1 over Gemma 3 are likely due to their custom tokenizer, as well as shrinking the KV cache and layer count. Otherwise, it's a standard architecture as shown in the figure below. Figure 16: An architecture comparison between Gemma 3 27B and Mistral 3.1 Small 24B. Interestingly, earlier Mistral models had utilized sliding window attention, but they appear to have abandoned it in Mistral Small 3.1 if we consider the default setting ( ) in the official Model Hub configuration file . Also, the model card makes no mention of it. So, since Mistral uses regular Grouped-Query Attention instead of Grouped-Query Attention with a sliding window as in Gemma 3, maybe there are additional inference compute savings due to being able to use more optimized code (i.e., FlashAttention). For instance, I speculate that while sliding window attention reduces memory usage, it doesn't necessarily reduce inference latency, which is what Mistral Small 3.1 is focused on. 5. Llama 4 The extensive introductory discussion on Mixture-of-Experts (MoE) earlier in this article pays off again. Llama 4 has also adopted an MoE approach and otherwise follows a relatively standard architecture that is very similar to DeepSeek-V3, as shown in the figure below. (Llama 4 includes native multimodal support, similar to models like Gemma and Mistral. However, since this article focuses on language modeling, we only focus on the text model.) Figure 17: An architecture comparison between DeepSeek V3 (671-billion parameters) and Llama 4 Maverick (400-billion parameters). While the Llama 4 Maverick architecture looks very similar to DeepSeek-V3 overall, there are some interesting differences worth highlighting. First, Llama 4 uses Grouped-Query Attention similar to its predecessors, whereas DeepSeek-V3 uses Multi-Head Latent Attention, which we discussed at the beginning of this article. Now, both DeepSeek-V3 and Llama 4 Maverick are very large architectures, with DeepSeek-V3 being approximately 68% larger in its total parameter count. However, with 37 billion active parameters, DeepSeek-V3 has more than twice as many active parameters as Llama 4 Maverick (17B). Llama 4 Maverick uses a more classic MoE setup with fewer but larger experts (2 active experts with 8,192 hidden size each) compared to DeepSeek-V3 (9 active experts with 2,048 hidden size each). Also, DeepSeek uses MoE layers in each transformer block (except the first 3), whereas Llama 4 alternates MoE and dense modules in every other transformer block. Given the many small differences between architectures, it is difficult to determine their exact impact on final model performance. The main takeaway, however, is that MoE architectures have seen a significant rise in popularity in 2025. 6. Qwen3 The Qwen team consistently delivers high-quality open-weight LLMs. When I helped co-advising the LLM efficiency challenge at NeurIPS 2023, I remember that the top winning solutions were all Qwen2-based. Now, Qwen3 is another hit model series at the top of the leaderboards for their size classes. There are 7 dense models: 0.6B, 1.7B, 4B, 8B, 14B, and 32B. And there are 2 MoE models: 30B-A3B, and 235B-A22B. (By the way, note that the missing whitespace in "Qwen3" is not a typo; I simply try to preserve the original spelling the Qwen developers chose.) 6.1 Qwen3 (Dense) Let's discuss the dense model architecture first. As of this writing, the 0.6B model may well be the smallest current-generation open-weight model out there. And based on my personal experience, it performs really well given its small size. It has great token/sec throughput and a low memory footprint if you are planning to run it locally. But what's more, it's also easy to train locally (for educational purposes) due to its small size. So, Qwen3 0.6B has replaced Llama 3 1B for me for most purposes. A comparison between these two architectures is shown below. Figure 18: An architecture comparison between Qwen3 0.6B and Llama 3 1B; notice that Qwen3 is a deeper architecture with more layers, whereas Llama 3 is a wider architecture with more attention heads. If you are interested in a human-readable Qwen3 implementation without external third-party LLM library dependencies, I recently implemented Qwen3 from scratch (in pure PyTorch) . The computational performance numbers in the figure above are based on my from-scratch PyTorch implementations when run on an A100 GPU. As one can see, Qwen3 has a smaller memory footprint as it is a smaller architecture overall, but also uses smaller hidden layers and fewer attention heads. However, it uses more transformer blocks than Llama 3, which leads to a slower runtime (lower tokens/sec generation speed). 6.2 Qwen3 (MoE) As mentioned earlier, Qwen3 also comes in two MoE flavors: 30B-A3B and 235B-A22B. Why do some architectures, like Qwen3, come as regular (dense) and MoE (sparse) variants? As mentioned at the beginning of this article, MoE variants help reduce inference costs for large base models. Offering both dense and MoE versions gives users flexibility depending on their goals and constraints. Dense models are typically more straightforward to fine-tune, deploy, and optimize across various hardware. On the other hand, MoE models are optimized for scaling inference. For instance, at a fixed inference budget, they can achieve a higher overall model capacity (i.e., knowledge uptake during training due to being larger) without proportionally increasing inference costs. By releasing both types, the Qwen3 series can support a broader range of use cases: dense models for robustness, simplicity, and fine-tuning, and MoE models for efficient serving at scale. To round up this section, let's look at Qwen3 235B-A22B (note that the A22B stands for "22B active parameters) to DeepSeek-V3, which has almost twice as many active parameters (37B). Figure 19: An architecture comparison between DeepSeek-V3 and Qwen3 235B-A22B. As shown in the figure above, the DeepSeek-V3 and Qwen3 235B-A22B architectures are remarkably similar. What's noteworthy, though, is that the Qwen3 model moved away from using a shared expert (earlier Qwen models, such as Qwen2.5-MoE did use a shared expert). Unfortunately, the Qwen3 team did not disclose any reason as to why they moved away from shared experts. If I had to guess, it was perhaps simply not necessary for training stability for their setup when they increased the experts from 2 (in Qwen2.5-MoE) to 8 (in Qwen3). And then they were able to save the extra compute/memory cost by using only 8 instead of 8+1 experts. (However, this doesn't explain why DeepSeek-V3 is still keeping their shared expert.) Update. Junyang Lin , one of the developers of Qwen3, responded as follows: At that moment we did not find significant enough improvement on shared expert and we were worrying about the optimization for inference caused by shared expert. No straight answer to this question honestly. 7. SmolLM3 SmolLM3 is perhaps not as nearly as popular as the other LLMs covered in this article, but I thought it is still an interesting model to include as it offers really good modeling performance at a relatively small and convenient 3-billion parameter model size that sits between the 1.7B and 4B Qwen3 model, as shown in the figure below. Moreover, it also shared a lot of the training details, similar to OLMo, which is rare and always appreciated! Figure 20: An annotated figure from the SmolLM3 announcement post, https://huggingface.co/blog/smollm3, comparing the SmolLM3 win rate to Qwen3 1.7B and 4B as well as Llama 3 3B and Gemma 3 4B. As shown in the architecture comparison figure below, the SmolLM3 architecture looks fairly standard. The perhaps most interesting aspect is its use of NoPE (No Positional Embeddings), though. Figure 21: A side-by-side architecture comparison between Qwen3 4B and SmolLM3 3B. 7.1 No Positional Embeddings (NoPE) NoPE is, in LLM contexts, an older idea that goes back to a 2023 paper ( The Impact of Positional Encoding on Length Generalization in Transformers ) to remove explicit positional information injection (like through classic absolute positional embedding layers in early GPT architectures or nowadays RoPE). In transformer-based LLMs, positional encoding is typically necessary because self-attention treats tokens independently of order. Absolute position embeddings solve this by adding an additional embedding layer that adds information to the token embeddings. Figure 22: A modified figure from my Build A Large Language Model (From Scratch) book (https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167) illustrating absolute positional embeddings. RoPE, on the other hand, solves this by rotating the query and key vectors relative to their token position. In NoPE layers, however, no such positional signal is added at all: not fixed, not learned, not relative. Nothing. Even though there is no positional embedding, the model still knows which tokens come before, thanks to the causal attention mask. This mask prevents each token from attending to future ones. As a result, a token at position t can only see tokens at positions ≤ t , which preserves the autoregressive ordering. So while there is no positional information that is explicitly added, there is still an implicit sense of direction baked into the model's structure, and the LLM, in the regular gradient-descent-based training, can learn to exploit it if it finds it beneficial for the optimization objective. (Check out the NoPE paper's theorems for more information.) So, overall, the NoPE paper not only found that no positional information injection is necessary, but it also found that NoPE has better length generalization, which means that LLM answering performance deteriorates less with increased sequence length, as shown in the figure below. Figure 23: An annotated figure from the NoPE paper (https://arxiv.org/abs/2305.19466) showing better length generalization with NoPE. Note that the experiments shown above were conducted with a relatively small GPT-style model of approximately 100 million parameters and relatively small context sizes. It is unclear how well these findings generalize to larger, contemporary LLMs. For this reason, the SmolLM3 team likely only "applied" NoPE (or rather omitted RoPE) in every 4th layer. 8. Kimi K2 Kimi K2 recently made big waves in the AI community due to being an open-weight model with an incredibly good performance. According to benchmarks, it's on par with the best proprietary models like Google's Gemini, Anthropic's Claude, and OpenAI's ChatGPT models. A notable aspect is its use of a variant of the relatively new Muon optimizer over AdamW. As far as I know, this is the first time Muon was used over AdamW for any production model of this size ( previously , it has only been shown to scale up to 16B). This resulted in very nice training loss curves, which probably helped catapult this model to the top of the aforementioned benchmarks. While people commented that the loss was exceptionally smooth (due to the lack of spikes), I think it's not exceptionally smooth (e.g., see the OLMo 2 loss curve in the figure below; also, the L2 norm of the gradient would probably be a better metric to track training stability). However, what's remarkable is how well the loss curve decays. However, as mentioned in the introduction of this article, training methodologies are a topic for another time. Figure 24: Annotated figures from the Kimi K2 announcement blog article (https://moonshotai.github.io/Kimi-K2/) and the OLMo 2 paper (https://arxiv.org/abs/2305.19466). The model itself is 1 trillion parameters large, which is truly impressive. It may be the biggest LLM of this generation as of this writing (given the constraints that Llama 4 Behemoth is not released, proprietary LLMs don't count, and Google's 1.6 trillion Switch Transformer is an encoder-decoder architecture from a different generation). It's also coming full circle as Kimi K2 uses the DeepSeek-V3 architecture we covered at the beginning of this article except they made it larger, as shown in the figure below. Figure 25: An architecture comparison between DeepSeek V3 and Kimi K2. As shown in the figure above, Kimi K2 is basically the same as DeepSeek V3, except that it uses more experts in the MoE modules and fewer heads in the Multi-head Latent Attention (MLA) module. Kimi K2 is not coming out of nowhere. The earlier Kimi 1.5 model discussed in the Kimi k1.5: Scaling Reinforcement Learning with LLMs paper , was impressive as well. However, it had the bad luck that the DeepSeek R1 model paper was published on exactly the same date on January 22nd. Moreover, as far as I know, the Kimi 1.5 weights were never publicly shared. So, most likely the Kimi K2 team took these lessons to heart and shared Kimi K2 as an open-weight model, before DeepSeek R2 was released. As of this writing, Kimi K2 is the most impressive open-weight model. 9. GPT-OSS OpenAI’s released gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019, about one week after I wrote this article. Since OpenAI’s open-weight models have been so widely anticipated, I updated this article to include them. I will keep this section brief, but I have written another, much more detailed article dedicated to the gpt-oss models here: Before summarizing the interesting tidbits, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, as shown in Figure 26 below. Figure 26: Architecture overview of the two gpt-oss models. Looking at Figure 26, the architecture contains all the familiar components we have seen in other architectures discussed previously. For instance, Figure 27 puts the smaller gpt-oss architecture next to Qwen3 30B-A3B, which is also an MoE model with a similar number of active parameters (gpt-oss has 3.6B active parameters, and Qwen3 30B-A3B has 3.3B). Figure 27: Architecture comparison between gpt-oss and Qwen3 One aspect not shown in Figure 27 is that gpt-oss uses sliding window attention (similar to Gemma 3, but in every other layer instead of using a 5:1 ratio). 9.1 Width Versus Depth Figure 27 shows that gpt-oss and Qwen3 use similar components. But if we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 Figure 28: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. 9.3 Attention Bias and Attention Sinks Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in Figure 29 below. Figure 29: gpt-oss models use bias units in the attention layers. See code example here. I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation ( ). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 30 below). Figure 30: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 30. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 31). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 31: The use of attention sinks in gpt-oss; based on the Hugging Face code here . For more information about gpt-oss, and how it compares to GPT-2, please see my other gpt-oss article: 10. Grok 2.5 A few weeks after this article first went online, xAI released the weights of their 270B-parameter Grok 2.5 model. I thought it would be worth including here, since Grok 2.5 was xAI's flagship production model last year. Up to this point, all models we discussed were released as open-weight models from the start. For example, gpt-oss is likely not an open-weight clone of GPT-4 but rather a custom model trained specifically for the open-source community. With Grok 2.5, we get a rare look at a real production system, even if it is last year's. Architecturally, Grok 2.5 looks fairly standard overall (Figure 32), but there are a few noteworthy details. Figure 32: Grok 2.5 next to a Qwen3 model of comparable size For instance, Grok 2.5 uses a small number of large experts (eight), which reflects an older trend. As discussed earlier, more recent designs such as those in the DeepSeekMoE paper favor a larger number of smaller experts (this is also present in Qwen3). Another interesting choice is the use of what amounts to a shared expert. The additional SwiGLU module shown on the left in Figure 32 functions as an always-on, shared expert. It is not identical to the classic shared-expert design since its intermediate dimension is doubled, but the idea is the same. (I still find it interesting that Qwen3 omitted shared experts, and it will be interesting to see if that changes with Qwen4 and later models.) 11. GLM-4.5 GLM-4.5 is another major release this year. It is an instruction/reasoning hybrid similar to Qwen3, but even better optimized for function calling and agent-style contexts. Figure 33: GLM-4.5 benchmark from the official GitHub repository at https://github.com/zai-org/GLM-4.5 As shown in Figure 34, GLM-4.5 comes in two variants. The flagship 355-billion-parameter model outperforms Claude 4 Opus on average across 12 benchmarks and trails only slightly behind OpenAI’s o3 and xAI’s Grok 4. There is also GLM-4.5-Air, a more compact 106-billion-parameter version that delivers performance only marginally below the 355-billion model. Figure 35 compares the 355-billion architecture to Qwen3. Figure 34: GLM-4.5 next to a similarly-sized Qwen3 model. The designs are largely similar, but GLM-4.5 adopts a structural choice first introduced by DeepSeek V3: 3 dense layers precede the Mixture-of-Experts (MoE) blocks. Why? Starting with several dense layers improves convergence stability and overall performance in large MoE systems. If MoE routing is introduced immediately, the instability of sparse expert selection can interfere with early syntactic and semantic feature extraction. So, one might say that by keeping the initial layers dense ensures the model forms stable low-level representations before routing decisions begin to shape higher-level processing. Also, GLM-4.5 uses a shared expert similar to DeepSeek-V3 (and unlike Qwen3). (Interestingly, GLM-4.5 also retains the attention bias mechanism used in GPT-2 and gpt-oss.) 12. Qwen3-Next On 11 September 2025, the Qwen3 team released Qwen3 Next 80B-A3B (Figure 35), available in both Instruct and Thinking variants. While its design builds on the previously discussed Qwen3 architecture, I included it here as a separate entry to keep the figure numbering consistent and to draw attention to some of its design changes. 12.1 Expert Size and Number The new Qwen3 Next architecture stands out because, despite being 3× smaller than the previous 235B-A22B model (Figure 35), it introduces four times as many experts and even adds a shared expert. Both of these design choices (a high expert count and the inclusion of a shared expert) were future directions I had highlighted prior to this release, particularly in the video version of the article that I linked at the top. Figure 35: The original Qwen3 model released in May (left) next to the Qwen3 Next model released in September (right). 12.2 Gated DeltaNet + Gated Attention Hybrid The other highlight is that they replace the regular attention mechanism by a Gated DeltaNet + Gated Attention hybrid, which helps enable the native 262k token context length in terms of memory usage (the previous 235B-A22B model model supported 32k natively, and 131k with YaRN scaling.) So how does this new attention hybrid work? Compared to grouped‑query attention (GQA), which is still standard scaled dot‑product attention (sharing K/V across query‑head groups to cut KV‑cache size and memory bandwidth as discussed earlier but whose decode cost and cache still grow with sequence length), their hybrid mechanism mixes Gated DeltaNet blocks with Gated Attention blocks with in a 3:1 ratio as shown in Figure 36. Figure 36: The Gated DeltaNet + Gated Attention hybrid mechanism. Note that these are arranges in a 3:1 ratio, meaning that 3 transformer blocks with Gated DeltaNet are followed by 1 transformer block with Gated Attention. The right subfigure is from the official Qwen3 blog: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list We can think of the gated attention block as standard scaled-dot-product attention that can be used in GQA, but it has a few tweaks on top. The main differences between gated attention and plain GQA block are: an output gate (sigmoid-controlled, usually per-channel) that scales the attention result before it is added back to the residual; zero-centered RMSNorm for QKNorm, rather than a standard RMSNorm; partial RoPE (on a subset of dimensions).

Deep Learning

0 views

Ahead of AI 5 months ago

LLM Research Papers: The 2025 List (January to June)

As some of you know, I keep a running list of research papers I (want to) read and reference. About six months ago, I shared my 2024 list , which many readers found useful. So, I was thinking about doing this again. However, this time, I am incorporating that one piece of feedback kept coming up: "Can you organize the papers by topic instead of date?" The categories I came up with are: Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Also, as LLM research continues to be shared at a rapid pace, I have decided to break the list into bi-yearly updates. This way, the list stays digestible, timely, and hopefully useful for anyone looking for solid summer reading material. Please note that this is just a curated list for now. In future articles, I plan to revisit and discuss some of the more interesting or impactful papers in larger topic-specific write-ups. Stay tuned! It's summer! And that means internship season, tech interviews, and lots of learning. To support those brushing up on intermediate to advanced machine learning and AI topics, I have made all 30 chapters of my Machine Learning Q and AI book freely available for the summer: 🔗 https://sebastianraschka.com/books/ml-q-and-ai/#table-of-contents Whether you are just curious and want to learn something new or prepping for interviews, hopefully this comes in handy. Happy reading, and best of luck if you are interviewing! This year, my list is very reasoning model-heavy. So, I decided to subdivide it into 3 categories: Training, inference-time scaling, and more general understanding/evaluation. This subsection focuses on training strategies specifically designed to improve reasoning abilities in LLMs. As you may see, much of the recent progress has centered around reinforcement learning (with verifiable rewards), which I covered in more detail in a previous article. Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495 This part of the list covers methods that improve reasoning dynamically at test time, without requiring retraining. Often, these papers are focused on trading of computational performance for modeling performance. Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495

Machine Learning

0 views

Ahead of AI 5 months ago

Understanding and Coding the KV Cache in LLMs from Scratch

KV caches are one of the most critical techniques for efficient inference in LLMs in production. KV caches are an important component for compute-efficient LLM inference in production. This article explains how they work conceptually and in code with a from-scratch, human-readable implementation. It's been a while since I shared a technical tutorial explaining fundamental LLM concepts. As I am currently recovering from an injury and working on a bigger LLM research-focused article, I thought I'd share a tutorial article on a topic several readers asked me about (as it was not included in my Building a Large Language Model From Scratch book). Happy reading! In short, a KV cache stores intermediate key (K) and value (V) computations for reuse during inference (after training), which results in a substantial speed-up when generating text. The downside of a KV cache is that it adds more complexity to the code, increases memory requirements (the main reason I initially didn't include it in the book), and can't be used during training. However, the inference speed-ups are often well worth the trade-offs in code complexity and memory when using LLMs in production. Imagine the LLM is generating some text. Concretely, suppose the LLM is given the following prompt: "Time". As you may already know, LLMs generate one word (or token) at a time, and the two following text generation steps may look as illustrated in the figure below: The diagram illustrates how an LLM generates text one token at a time. Starting with the prompt "Time", the model generates the next token "flies." In the next step, the full sequence "Time flies" is reprocessed to generate the token "fast". Note that there is some redundancy in the generated LLM text outputs, as highlighted in the next figure: This figure highlights the repeated context ("Time flies") that must be reprocessed by the LLM at each generation step. Since the LLM does not cache intermediate key/value states, it re-encodes the full sequence every time a new token (e.g., "fast") is generated. When we implement an LLM text generation function, we typically only use the last generated token from each step. However, the visualization above highlights one of the main inefficiencies on a conceptual level. This inefficiency (or redundancy) becomes more clear if we zoom in on the attention mechanism itself. (If you are curious about attention mechanisms, you can read more in Chapter 3 of my Build a Large Language Model (From Scratch) book or my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article). The following figure shows an excerpt of an attention mechanism computation that is at the core of an LLM. Here, the input tokens ("Time" and "flies") are encoded as 3-dimensional vectors (in reality, these vectors are much larger, but this would make it challenging to fit them into a small figure). The matrices W are the weight matrices of the attention mechanism that transform these inputs into key, value, and query vectors. The figure below shows an excerpt of the underlying attention score computation with the key and value vectors highlighted: This figure illustrates how the LLM derives key ( ) and value ( ) vectors from token embeddings during attention computation. Each input token (e.g., "Time" and "flies") is projected using learned matrices and to obtain its corresponding key and value vectors. As mentioned earlier, LLMs generate one word (or token) at a time. Suppose the LLM generated the word "fast" so that the prompt for the next round becomes "Time flies fast". This is illustrated in the next figure below: This diagram shows how the LLM recomputes key and value vectors for previously seen tokens ("Time" and "flies") during each generation step. When generating the third token ("fast"), the model recomputes the same and vectors again, rather than reusing them. This repeated computation highlights the inefficiency of not using a KV cache during autoregressive decoding. As we can see, based on comparing the previous 2 figures, the keys and value vectors for the first two tokens are exactly the same, and it would be wasteful to recompute them in each next-token text generation round. Now, the idea of the KV cache is to implement a caching mechanism that stores the previously generated key and value vectors for reuse, which helps us to avoid these unnecessary recomputations. After we went over the basic concept in the previous section, let's go into a bit more detail before we look at a concrete code implementation. If we have a text generation process without KV cache for "Time flies fast", we can think of it as follows: Notice the redundancy: tokens "Time" and "flies" are recomputed at every new generation step. The KV cache resolves this inefficiency by storing and reusing previously computed key and value vectors: Initially, the model computes and caches key and value vectors for the input tokens. For each new token generated, the model only computes key and value vectors for that specific token. Previously computed vectors are retrieved from the cache to avoid redundant computations. The table below summarizes the computation and caching steps and states: The benefits here are that is computed once and reused twice, and is computed once and reused once. (It's a short text example for simplicity, but it should be intuitive to see that the longer the text, the more we get to reuse already computed keys and values, which increases the generation speed.)n speed.) The following figure illustrates generation step 3 with and without a KV cache side by side. Comparing text generation with and without a KV cache. In the top panel (without cache), key and value vectors are recomputed for each token step, which results in redundant operations. In the bottom panel (with cache), previously computed keys and values are retrieved from the KV cache to avoid recomputation for faster generation. So, if we want to implement a KV cache in code, all we have to do is compute the keys and values as usual but then store them so that we can retrieve them in the next round. The next section illustrates this with a concrete code example. There are many ways to implement a KV cache, with the main idea being that we only compute the key and value tensors for the newly generated tokens in each generation step. I opted for a simple one that emphasizes code readability. I think it's easiest to just scroll through the code changes to see how it's implemented. There are two files I shared on GitHub, which are self-contained Python scripts that implement an LLM with and without KV cache from scratch: gpt_ch04.py : Self-contained code taken from Chapters 3 and 4 of my Build a Large Language Model (From Scratch) book to implement the LLM and run the simple text generation function gpt_with_kv_cache.py : The same as above, but with the necessary changes made to implement the KV cache. To read through the KV cache-relevant code modifications, you can either: a. Open the file and look out for the sections that mark the new changes: b. Check out the two code files via a file diff tool of your choice to compare the changes: In additoin, to summarize the implementation details, there's a short walkthrough in the following subsections. Inside the constructor, we add two non-persistent buffers, and , which will hold concatenated keys and values across steps: (I made a YouTube video if you want to learn more about buffers: Understanding PyTorch Buffers .) Next, we extend the method of the class to accept a argument: The storage and retrieval of keys and values here implements the core idea of the KV cache. Concretely, after the cache is initialized via the if , we add the newly generated keys and values via and to the cache, respectively. Then, retrieves the stored values and keys from the cache. And that's basically it: the core store & retrieve mechanism of a KV cache. The following sections, 3 and 4, just take care of minor implementation details. When generating text, we have to remember to reset both the keys and value buffers between two separate text-generation calls. Otherwise, the queries of a new prompt will attend to stale keys left over from the previous sequence, which causes the model to rely on irrelevant context and produce incoherent output. To prevent this, we add a method to the class that we can use between text-generation calls later: With the changes to the class in place, we now modify the class. First, we add a position tracking for the token indices to the instructor: This is a simple counter that remembers how many tokens the model has already cached during an incremental generation session. Then, we replace the one-liner block call with an explicit loop, passing through each transformer block: What happens above if we set is that we start at the and count steps. Then, bump the counter so the next decoding call continues where we left off. The reason for the tracking is that new queries must line up directly after the keys and values that are already stored. Without using a counter, every new step would start at position 0 again, so the model would treat the new tokens as if they overlapped the earlier ones. (Alternatively, we could also keep track via an .) The above change then also requires a small modification to the class to accept the argument: Lastly, we add a model-level reset to to clear all block caches at once for our convenience: With the changes to the , , and , finally, here's how we use the KV cache in a simple text generation function: Note that we only feed the model the new token in c) via . Without caching, we feed the model the whole input as it has no stored keys and values to reuse. After covering the KV cache on a conceptual level, the big question is how well it actually performs in practice on a small example. To give the implementation a try, we can run the two aforementioned code files as Python scripts, which will run the small 124 M parameter LLM to generate 200 new tokens (given a 4-token prompt "Hello, I am" to start with): On a Mac Mini with M4 chip (CPU), the results are as follows: So, as we can see, we already get a ~5x speed-up with a small 124 M parameter model and a short 200-token sequence length. (Note that this implementation is optimized for code readability and not optimized for CUDA or MPS runtime speed, which would require pre-allocating tensors instead of reinstating and concatenating them.) Note: The model generates "gibberish" in both cases, i.e., text that looks like this: Output text: Hello, I am Featureiman Byeswickattribute argue logger Normandy Compton analogous bore ITVEGIN ministriesysics Kle functional recountrictionchangingVirgin embarrassedgl ... This is because we haven't trained the model yet. The next chapter trains the model, and you can use the KV cache on the trained model (however, the KV cache is only meant to be used during inference) to generate coherent text. Here, we are using the untrained model to keep the code simple(r). What's more important, though, is that both the and implementations produce exactly the same text. This tells us that the KV cache is implemented correctly -- it is easy to make indexing mistakes that can lead to divergent results. Thanks for reading Ahead of AI! Subscribe for free to receive new posts and support my work. As sequence length increases, the benefits and downsides of a KV cache become more pronounced in the following ways: [Good] Computational efficiency increases : Without caching, the attention at step t must compare the new query with t previous keys, so the cumulative work scales quadratically, O(n²). With a cache, each key and value is computed once and then reused, reducing the total per-step complexity to linear, O(n). [Bad] Memory usage increases linearly : Each new token appends to the KV cache. For long sequences and larger LLMs, the cumulative KV cache grows larger, which can consume a significant or even prohibitive amount of (GPU) memory. As a workaround, we can truncate the KV cache, but this adds even more complexity (but again, it may well be worth it when deploying LLMs.) While my conceptual implementation of a KV cache above helps with clarity and is mainly geared towards code readability and educational purposes, deploying it in real-world scenarios (especially with larger models and longer sequence lengths) requires more careful optimization. Memory fragmentation and repeated allocations : Continuously concatenating tensors via , as shown earlier, leads to performance bottlenecks due to frequent memory allocation and reallocation. Linear growth in memory usage : Without proper handling, the KV cache size becomes impractical for very long sequences. Rather than concatenating tensors repeatedly, we could pre-allocate a sufficiently large tensor based on the expected maximum sequence length. This ensures consistent memory use and reduces overhead. In pseudo-code, this may look like as follows: During inference, we can then simply write into slices of these pre-allocated tensors. To avoid blowing up our GPU memory, we can implement a sliding window approach with dynamic truncation. Via the sliding window, we maintain only the last tokens in the cache: You can find these optimizations in the gpt_with_kv_cache_optimized.py file. On a Mac Mini with an M4 chip (CPU), with a 200-token generation and a window size equal to the LLM's context length (to guarantee the same results and thus a fair comparison) below, the code runtimes compare as follows: Unfortunately, the speed advantages disappear on CUDA devices as this is a tiny model, and the device transfer and communication outweigh the benefits of a KV cache for this small model. Although caching introduces additional complexity and memory considerations, the noticeable gains in efficiency typically outweigh these trade-offs, especially in production environments. Remember, while I prioritized code clarity and readability over efficiency here, the takeaway is that practical implementations often require thoughtful optimizations, such as pre-allocating memory or applying a sliding-window cache to manage memory growth effectively. In that sense, I hope this article turned out to be informative. Feel free to experiment with these techniques, and happy coding! After adding KV caches to my from-scratch implementations of Qwen3 (0.6 B) and Llama 3 (1 B), I ran additional experiments comparing the model runtimes with and without KV cache. Note that I opted for the torch.cat approach mentioned above rather than pre-allocating the KV cache tensors as described in the Optimizing the KV Cache Implementation section. Since Llama 3 and Qwen3 have very large supported context sizes (131k and 41k tokens, respectively), the pre-allocated tensors consume ~8 GB of additional memory, which is quite expensive. Moreover, because I am using the more memory-efficient approach to creating the tensors on the fly, I moved the KV cache outside the model to compile the model with for a computational efficiency boost. The codes can be found here: qwen3.py | README llama3.py | README And the performances are shown below. As we can see, on CPUs, the KV cache results in the most substantial speed-up. And compilation boosts that performance even further. However, on a GPU, the best performance can be achieved with the regular compiled model, which is likely because we don’t pre-allocate the tensors on the GPU, and the models are relatively small. This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book , or signing up for a paid subscription . Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! The diagram illustrates how an LLM generates text one token at a time. Starting with the prompt "Time", the model generates the next token "flies." In the next step, the full sequence "Time flies" is reprocessed to generate the token "fast". Note that there is some redundancy in the generated LLM text outputs, as highlighted in the next figure: This figure highlights the repeated context ("Time flies") that must be reprocessed by the LLM at each generation step. Since the LLM does not cache intermediate key/value states, it re-encodes the full sequence every time a new token (e.g., "fast") is generated. When we implement an LLM text generation function, we typically only use the last generated token from each step. However, the visualization above highlights one of the main inefficiencies on a conceptual level. This inefficiency (or redundancy) becomes more clear if we zoom in on the attention mechanism itself. (If you are curious about attention mechanisms, you can read more in Chapter 3 of my Build a Large Language Model (From Scratch) book or my Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article). The following figure shows an excerpt of an attention mechanism computation that is at the core of an LLM. Here, the input tokens ("Time" and "flies") are encoded as 3-dimensional vectors (in reality, these vectors are much larger, but this would make it challenging to fit them into a small figure). The matrices W are the weight matrices of the attention mechanism that transform these inputs into key, value, and query vectors. The figure below shows an excerpt of the underlying attention score computation with the key and value vectors highlighted: This figure illustrates how the LLM derives key ( ) and value ( ) vectors from token embeddings during attention computation. Each input token (e.g., "Time" and "flies") is projected using learned matrices and to obtain its corresponding key and value vectors. As mentioned earlier, LLMs generate one word (or token) at a time. Suppose the LLM generated the word "fast" so that the prompt for the next round becomes "Time flies fast". This is illustrated in the next figure below: This diagram shows how the LLM recomputes key and value vectors for previously seen tokens ("Time" and "flies") during each generation step. When generating the third token ("fast"), the model recomputes the same and vectors again, rather than reusing them. This repeated computation highlights the inefficiency of not using a KV cache during autoregressive decoding. As we can see, based on comparing the previous 2 figures, the keys and value vectors for the first two tokens are exactly the same, and it would be wasteful to recompute them in each next-token text generation round. Now, the idea of the KV cache is to implement a caching mechanism that stores the previously generated key and value vectors for reuse, which helps us to avoid these unnecessary recomputations. How LLMs Generate Text (Without and With a KV Cache) After we went over the basic concept in the previous section, let's go into a bit more detail before we look at a concrete code implementation. If we have a text generation process without KV cache for "Time flies fast", we can think of it as follows: Notice the redundancy: tokens "Time" and "flies" are recomputed at every new generation step. The KV cache resolves this inefficiency by storing and reusing previously computed key and value vectors: Initially, the model computes and caches key and value vectors for the input tokens. For each new token generated, the model only computes key and value vectors for that specific token. Previously computed vectors are retrieved from the cache to avoid redundant computations. Comparing text generation with and without a KV cache. In the top panel (without cache), key and value vectors are recomputed for each token step, which results in redundant operations. In the bottom panel (with cache), previously computed keys and values are retrieved from the KV cache to avoid recomputation for faster generation. So, if we want to implement a KV cache in code, all we have to do is compute the keys and values as usual but then store them so that we can retrieve them in the next round. The next section illustrates this with a concrete code example. Implementing a KV Cache from Scratch There are many ways to implement a KV cache, with the main idea being that we only compute the key and value tensors for the newly generated tokens in each generation step. I opted for a simple one that emphasizes code readability. I think it's easiest to just scroll through the code changes to see how it's implemented. There are two files I shared on GitHub, which are self-contained Python scripts that implement an LLM with and without KV cache from scratch: gpt_ch04.py : Self-contained code taken from Chapters 3 and 4 of my Build a Large Language Model (From Scratch) book to implement the LLM and run the simple text generation function gpt_with_kv_cache.py : The same as above, but with the necessary changes made to implement the KV cache. b. Check out the two code files via a file diff tool of your choice to compare the changes: In additoin, to summarize the implementation details, there's a short walkthrough in the following subsections. 1. Registering the Cache Buffers Inside the constructor, we add two non-persistent buffers, and , which will hold concatenated keys and values across steps: (I made a YouTube video if you want to learn more about buffers: Understanding PyTorch Buffers .) 2. Forward pass with flag Next, we extend the method of the class to accept a argument: The storage and retrieval of keys and values here implements the core idea of the KV cache. Storing Concretely, after the cache is initialized via the if , we add the newly generated keys and values via and to the cache, respectively. Retrieving Then, retrieves the stored values and keys from the cache. And that's basically it: the core store & retrieve mechanism of a KV cache. The following sections, 3 and 4, just take care of minor implementation details. 3. Clearing the Cache When generating text, we have to remember to reset both the keys and value buffers between two separate text-generation calls. Otherwise, the queries of a new prompt will attend to stale keys left over from the previous sequence, which causes the model to rely on irrelevant context and produce incoherent output. To prevent this, we add a method to the class that we can use between text-generation calls later: 4. Propagating in the Full Model With the changes to the class in place, we now modify the class. First, we add a position tracking for the token indices to the instructor: This is a simple counter that remembers how many tokens the model has already cached during an incremental generation session. Then, we replace the one-liner block call with an explicit loop, passing through each transformer block: What happens above if we set is that we start at the and count steps. Then, bump the counter so the next decoding call continues where we left off. The reason for the tracking is that new queries must line up directly after the keys and values that are already stored. Without using a counter, every new step would start at position 0 again, so the model would treat the new tokens as if they overlapped the earlier ones. (Alternatively, we could also keep track via an .) The above change then also requires a small modification to the class to accept the argument: Lastly, we add a model-level reset to to clear all block caches at once for our convenience: 5. Using the Cache in Generation With the changes to the , , and , finally, here's how we use the KV cache in a simple text generation function: Note that we only feed the model the new token in c) via . Without caching, we feed the model the whole input as it has no stored keys and values to reuse. A Simple Performance Comparison After covering the KV cache on a conceptual level, the big question is how well it actually performs in practice on a small example. To give the implementation a try, we can run the two aforementioned code files as Python scripts, which will run the small 124 M parameter LLM to generate 200 new tokens (given a 4-token prompt "Hello, I am" to start with): On a Mac Mini with M4 chip (CPU), the results are as follows: So, as we can see, we already get a ~5x speed-up with a small 124 M parameter model and a short 200-token sequence length. (Note that this implementation is optimized for code readability and not optimized for CUDA or MPS runtime speed, which would require pre-allocating tensors instead of reinstating and concatenating them.) Note: The model generates "gibberish" in both cases, i.e., text that looks like this: Output text: Hello, I am Featureiman Byeswickattribute argue logger Normandy Compton analogous bore ITVEGIN ministriesysics Kle functional recountrictionchangingVirgin embarrassedgl ... This is because we haven't trained the model yet. The next chapter trains the model, and you can use the KV cache on the trained model (however, the KV cache is only meant to be used during inference) to generate coherent text. Here, we are using the untrained model to keep the code simple(r). What's more important, though, is that both the and implementations produce exactly the same text. This tells us that the KV cache is implemented correctly -- it is easy to make indexing mistakes that can lead to divergent results. Thanks for reading Ahead of AI! Subscribe for free to receive new posts and support my work. KV cache Advantages and Disadvantages As sequence length increases, the benefits and downsides of a KV cache become more pronounced in the following ways: [Good] Computational efficiency increases : Without caching, the attention at step t must compare the new query with t previous keys, so the cumulative work scales quadratically, O(n²). With a cache, each key and value is computed once and then reused, reducing the total per-step complexity to linear, O(n). [Bad] Memory usage increases linearly : Each new token appends to the KV cache. For long sequences and larger LLMs, the cumulative KV cache grows larger, which can consume a significant or even prohibitive amount of (GPU) memory. As a workaround, we can truncate the KV cache, but this adds even more complexity (but again, it may well be worth it when deploying LLMs.) Memory fragmentation and repeated allocations : Continuously concatenating tensors via , as shown earlier, leads to performance bottlenecks due to frequent memory allocation and reallocation. Linear growth in memory usage : Without proper handling, the KV cache size becomes impractical for very long sequences. qwen3.py | README llama3.py | README

Tutorial

Python

0 views

Ahead of AI 6 months ago

Coding LLMs from the Ground Up: A Complete Course

I wrote a lot about reasoning models in recent months (4 articles in a row)! Next to everything "agentic," reasoning is one of the biggest LLM topics of 2025. This month, however, I wanted to share more fundamental or "foundational" content with you on how to code LLMs, which is one of the best ways to understand how LLMs work. Why? Many people really liked and benefited from the abbreviated LLM workshop I shared last year: So, I thought this ~5× longer and more detailed content (~15 hours in total) would be even more useful. Also, I'm sadly dealing with a bad neck injury and haven't really been able to work on a computer for the past 3 weeks. I am currently trying a conservative treatment before considering the suggested surgical route. This is the worst timing as I just started to get back on track before life threw another curveball. So, during my recovery, I thought sharing these videos I recorded in the last couple of months would be a nice in-between content. I hope you find this useful, and thanks for your support! PS: The videos originally started as supplementary content for my Build a Large Language Model (From Scratch) book . But it turns out they also work pretty well as standalone content. Why build from scratch? It's probably the best and most efficient way to learn how LLMs really work. Plus, many readers have told me they had a lot of fun doing it. To offer an analogy: if you are into cars and want to understand how they work, following a tutorial that walks you through building one from the ground up is a great way to learn. Of course, we probably wouldn't want to start by building a Formula 1 race car since it would be prohibitively expensive and overly complex for a first project. Instead, it makes more sense to start with something simpler, like a go-kart. Building a go-kart still teaches you how the steering works, how the motor functions, and more. You can even take it to the track and practice (and have a lot of fun with it) before stepping into a professional race car (or joining a company or team that is focused on building one). After all, the best race drivers often got their start by building and tinkering with their own go-karts (think Michael Schumacher and Ayrton Senna). By doing that, they not only developed a great feel for the car but could also provide valuable feedback to their mechanics, which gave them an edge over the other drivers. Build an LLM from Scratch book ( Manning | Amazon ) Build an LLM from Scratch GitHub repository This is a supplementary video explaining how to set up a Python environment using uv. In particular, we are using “ , which is explained in this document . Alternatively, the native “ syntax (mentioned but not explicitly covered in this video) is described here . Note / Tip: The installation may cause issues on certain versions of Windows. If you are on a Windows machine and have troubles with the installation (likely due to a TensorFlow dependency to load the original GPT-2 model weights from OpenAI in video 5), please don’t worry about it and feel free to skip the TensorFlow installation (you can do this by removing the TensorFlow line from the requirements file.) To provide an alternative, I converted the GPT-2 model weights from a TensorFlow tensor format to PyTorch tensors and shared them on the Hugging Face model hub, which you can use as an alternative to the weight loading portion in video 5: https://huggingface.co/rasbt/gpt2-from-scratch-pytorch . In any case, you don’t have to worry about this weight-loading code until the end of video 5. This video goes over text data preparations steps (tokenization, byte pair encoding, data loaders, etc.) for LLM training. This is a supplementary video explaining how attention mechanisms (self-attention, causal attention, multi-head attention) work by coding them from scratch. You can think of it as building the engine of a car (before adding the frame, seats, and wheels). This video covers how to code an LLM architecture from scratch. This video explains how to pretrain a LLM from scratch. This is a video explaining how to fine-tune an LLM as a classifier (here using a spam classification example) as a gentle introduction to finetuning, before instruction finetuning the LLM in the next video. Finally, this video explains how to instruction finetune the LLM. Happy viewing & tinkering! As a big thank you to the paid subscribers, I want to share a 2.5h (non-coding) bonus video I recorded earlier in April, approximately 2 days after the Llama 4 release. In this talk, I discuss the current LLM landscape in 2025 with a focus on what and how things have changed since GPT-2 in 2018. Thanks for your support, as an independent and self-employed researcher, this really means a lot to me! Hopefully, things will improve in the next few weeks/months as I have lots of ideas for upcoming articles and can’t wait to work on them! Build an LLM from Scratch book ( Manning | Amazon ) Build an LLM from Scratch GitHub repository

Tutorial

Programming

Python

0 views

Ahead of AI 7 months ago

The State of Reinforcement Learning for LLM Reasoning

A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning. Meanwhile, competitors such as xAI and Anthropic have added more reasoning capabilities and features into their models. For instance, both the xAI Grok and Anthropic Claude interfaces now include a "thinking" (or "extended thinking") button for certain models that explicitly toggles reasoning capabilities. In any case, the muted response to GPT-4.5 and Llama 4 (non-reasoning) models suggests we are approaching the limits of what scaling model size and data alone can achieve. However, OpenAI's recent release of the o3 reasoning model demonstrates there is still considerable room for improvement when investing compute strategically, specifically via reinforcement learning methods tailored for reasoning tasks. (According to OpenAI staff during the recent livestream, o3 used 10× more training compute compared to o1.) Source: OpenAI livestream (https://openai.com/live/) on April 16, 2025 While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks (so far). And I expect reasoning-focused post-training to become standard practice in future LLM pipelines. So, in this article, let's explore the latest developments in reasoning via reinforcement learning. This article focuses on reinforcement learning training methods used to develop and improve reasoning models Because it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view. Understanding reasoning models RLHF basics: where it all started A brief introduction to PPO: RL's workhorse algorithm RL algorithms: from PPO to GRPO RL reward modeling: from RLHF to RLVR How the DeepSeek-R1 reasoning models were trained Lessons from recent RL papers on training reasoning models Noteworthy research papers on training reasoning models Tip: If you are already familiar with reasoning basics, RL, PPO, and GRPO, please feel free to directly jump ahead to the “Lessons from recent RL papers on training reasoning models” section, which contains summaries of interesting insights from recent reasoning research papers. The big elephant in the room is, of course, the definition of reasoning. In short, reasoning is about inference and training techniques that make LLMs better at handling complex tasks. To provide a bit more detail on how this is achieved (so far), I'd like to define reasoning as follows: Reasoning, in the context of LLMs, refers to the model's ability to produce intermediate steps before providing a final answer. This is a process that is often described as chain-of-thought (CoT) reasoning. In CoT reasoning, the LLM explicitly generates a structured sequence of statements or computations that illustrate how it arrives at its conclusion. And below is a figure along with the definition. A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation. If you are new to reasoning models and would like a more comprehensive introduction, I recommend my previous articles: Now, as hinted at the beginning of this section, the reasoning abilities of LLMs can be improved in two ways, as nicely illustrated in a figure from an OpenAI blog post: Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods. The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. Conventional LLMs typically undergo a 3-step training procedure: Pre-training Supervised fine-tuning Alignment (typically via RLHF) The "original" LLM alignment method is RLHF, which is part of the standard repertoire when developing LLMs following the InstructGPT paper, which described the recipe that was used to develop the first ChatGPT model. The original goal of RLHF is to align LLMs with human preferences. For instance, suppose you use an LLM multiple times where the LLM generates multiple answers for a given prompt. RLHF guides the LLM towards generating more of the style of answer that you prefer. (Often, RLHF is also used to safety-tune LLMs: to avoid sharing sensitive information, using swear words, and so on.) If you are new to RLHF, here is an excerpt from a talk I gave a few years ago that explains RLHF in less than 5 minutes: Alternatively, the paragraphs below describe RLHF in text form. The RLHF pipeline takes a pre-trained model and fine-tunes it in a supervised fashion. This fine-tuning is not the RL part yet but is mainly a prerequisite. Then, RLHF further aligns the LLM using an algorithm called proximal policy optimization (PPO). (Note that there are other algorithms that can be used instead of PPO; I was specifically saying PPO because that's what was originally used in RLHF and is still the most popular one today.) For simplicity, we will look at the RLHF pipeline in three separate steps: RLHF Step 1 (prerequisite): Supervised fine-tuning (SFT) of the pre-trained model RLHF Step 2: Creating a reward model RLHF Step 3: Fine-tuning via proximal policy optimization (PPO) RLHF Step 1, shown below, is a supervised fine-tuning step to create the base model for further RLHF fine-tuning. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF step 1, we create or sample prompts (from a database, for example) and ask humans to write good-quality responses. We then use this dataset to fine-tune the pre-trained base model in a supervised fashion. As mentioned before, this is not technically part of RL training but merely a prerequisite. In RLHF Step 2, we then use this model from supervised fine-tuning (SFT) to create a reward model, as shown below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 As depicted in the figure above, for each prompt, we generate four responses from the fine-tuned LLM created in the prior step. Human annotators then rank these responses based on their preferences. Although this ranking process is time-consuming, it might be somewhat less labor-intensive than creating the dataset for supervised fine-tuning. This is because ranking responses is likely simpler than writing them. Upon compiling a dataset with these rankings, we can design a reward model that outputs a reward score for the optimization subsequent stage in RLHF Step 3. The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets. This reward model (RM) generally originates from the LLM created in the prior supervised fine-tuning (SFT) step. To turn the model from RLHF Step 1 into a reward model, its output layer (the next-token classification layer) is substituted with a regression layer, which features a single output node. The third step in the RLHF pipeline is to use the reward model (RM) to fine-tune the previous model from supervised fine-tuning (SFT), which is illustrated in the figure below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF Step 3, the final stage, we are now updating the SFT model using proximal policy optimization (PPO) based on the reward scores from the reward model we created in RLHF Step 2. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. As mentioned earlier, the original RLHF method uses a reinforcement learning algorithm called proximal policy optimization (PPO). PPO was developed to improve the stability and efficiency of training a policy. (In reinforcement learning, “policy” just means the model we want to train; in this case, policy = LLM.) One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training. On top of that, PPO also includes a KL divergence penalty in the loss. This term compares the current policy (the model being trained) to the original SFT model. This encourages the updates to stay reasonably close. The idea is to preference-tune the model, not to completely re-train, after all. This is where the “proximal” in proximal policy optimization comes from: the algorithm tries to keep the updates close to the existing model while still allowing for improvement. And to encourage a bit of exploration, PPO also adds an entropy bonus, which this encourages the model to vary the outputs during training. In the following paragraphs, I want to introduce some more terminology to illustrate PPO on a relatively high level. Still, there's a lot of jargon involved, so I tried to summarize the key terminology in the figure below before we continue. Illustration of the key terms in RLHF. For instance, several models are involved in PPO, where PPO is an algorithm used in RLHF (and RLHF is one of the most popular LLM alignment methods). Below, I aim to illustrate the key steps in PPO via pseudo-code. In addition, to make it more intuitive, I will also use an analogy: Imagine you are a chef running a small food delivery service. And you are constantly trying out new recipe variations to improve customer satisfaction. Your overall goal is to tweak your recipe (policy) based on customer feedback (reward). 1. Compute the ratio of the next-token probabilities from the new vs the old policy: In short, this checks how different our new recipe is from the old one. Side note: Regarding "new_policy_prob", we are not using the final updated policy yet. We are using the current version of the policy (i.e., the model we are in the middle of training). However, it's a convention to call it "new". So, even though you're still experimenting, we call your current draft the "new policy" as per convention. 2. Multiply that ratio by how good the action was (called the advantage): Here, for simplicity, we may assume the advantage is computed based on the reward signal: In the chef analogy, we can think of the advantage as how well the new dish performed: For example, if a customer rates the new dish with a 9/10, and the customers normally give us a 7/10, that's a +2 advantage. Note that this is a simplification. In reality, this involves generalized advantage estimation (GAE), which I am omitting here so as not to bloat the article further. However, one important detail to mention is that the expected reward is computed by a so-called "critic" (sometimes also called "value model"), and a reward model computes the actual reward. I.e., the advantage computation involves 2 other models, typically the same size as the original model we are fine-tuning. In the analogy, we can think of this critic or value model as a friend we ask to try our new dish before serving it to the customers. We also ask our friend to estimate how a customer would rank it (that's the expected reward). The reward model is the actual customer then who gives the feedback (i.e., the actual reward). 3. Compute a clipped score: If the new policy changes too much (e.g., ratio > 1.2 or < 0.8), we clip the ratio, as follows: In the analogy, imagine that the new recipe got an exceptionally great (or bad) review. We might be tempted to overhaul the entire menu now. But that's risky. So, instead, we clip how much our recipe can change for now. (For instance, maybe we made the dish much spicier, and that one customer happened to love spicy food, but that doesn't mean everyone else will.) 4. Then we use the smaller of the raw score and clipped scor e: Again, this is related to being a bit cautious. For instance, if the advantage is positive (the new behavior is better), we cap the reward. That's because we don't want to over-trust a good result that might be a coincidence or luck. If the advantage is negative (the new behavior is worse), we limit the penalty. The idea here is similar. Namely, we don't want to overreact to one bad result unless we are really sure. In short, we use the smaller of the two scores if the advantage is positive (to avoid over-rewarding), and the larger when the advantage is negative (to avoid over-penalizing). In the analogy, this ensures that if a recipe is doing better than expected, we don't over-reward it unless we are confident. And if it's underperforming, we don't over-penalize it unless it's consistently bad. 5. Calculating the loss: This final score is what we maximize during training (using gradient descent after flipping the sign of the score to minimize). In addition, we also add a KL penalty term, where β is a hyperparameter for the penalty strength: In the analogy, we add the penalty to ensure new recipes are not too different from our original style. This prevents you from "reinventing the kitchen" every week. For example, we don't want to turn an Italian restaurant into a BBQ place all of a sudden. This was a lot of information, so I summarized it with a concrete, numeric example in an LLM context via the figure below. But please feel free to skip it if it's too complicated; you should be able to follow the rest of the article just fine. I admit that I may have gone overboard with the PPO walkthrough. But once I had written it, it was hard to delete it. I hope some of you will find it useful! That being said, the main takeaways that will be relevant in the next section are that there are multiple models involved in PPO: 1. The policy, which is the LLM that has been trained with SFT and that we want to further align). 2. The reward model, which is a model that has been trained to predict the reward (see RLHF step 2). 3. The critic, which is a trainable model that estimates the reward. 4. A reference model (original policy) that we use to make sure that the policy doesn't deviate too much. By the way, you might wonder why we need both a reward model and a critic model. The reward model is usually trained before training the policy with PPO. It's to automate the preference labeling by human judges, and it gives the score for the complete responses generated by the policy LLM. The critic, in contrast, judges partial responses. We use it to create the final response. While the reward model typically remains frozen, the critic model is updated during training to estimate the reward created by the reward model better. More details about PPO are out of the scope of this article, but interested readers can find the mathematical details in these four papers that predate the InstructGPT paper: (1) Asynchronous Methods for Deep Reinforcement Learning (2016) by Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu introduces policy gradient methods as an alternative to Q-learning in deep learning-based RL. (2) Proximal Policy Optimization Algorithms (2017) by Schulman, Wolski, Dhariwal, Radford, and Klimov presents a modified proximal policy-based reinforcement learning procedure that is more data-efficient and scalable than the vanilla policy optimization algorithm above. (3) Fine-Tuning Language Models from Human Preferences (2020) by Ziegler, Stiennon, Wu, Brown, Radford, Amodei, Christiano, Irving illustrates the concept of PPO and reward learning to pretrained language models including KL regularization to prevent the policy from diverging too far from natural language. (4) Learning to Summarize from Human Feedback (2022) by Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano introduces the popular RLHF three-step procedure that was later also used in the InstructGPT paper . As mentioned before, PPO was the original algorithm used in RLHF. From a technical standpoint, it works perfectly fine in the RL pipeline that's being used to develop reasoning models. However, what DeepSeek-R1 used for their RL pipeline is an algorithm called Group Relative Policy Optimization (GRPO), which was introduced in one of their earlier papers: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) The DeepSeek team introduced GRPO as a variant of Proximal Policy Optimization (PPO) that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO. So, the key motivation here is to improve computational efficiency. The efficiency improvements are achieved by dropping the "critic" (value model), i.e., the LLM that computes the value function (i.e., the expected future reward). Instead of relying on this additional model to compute the estimated reward to compute the advantages, GRPO takes a simpler approach: it samples multiple answers from the policy model itself and uses their relative quality to compute the advantages. To illustrate the differences between PPO and GRPO, I borrowed a nice figure from the DeepSeekMath paper: Annotated figure from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (https://arxiv.org/abs/2402.03300) to illustrate the differences between PPO and GRPO. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. So far, we looked at RLHF as a procedure, and we have introduced two reinforcement learning algorithms commonly used for it: PPO and GRPO. But if RLHF is already a core part of the LLM alignment toolkit, what does any of this have to do with reasoning? The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. The difference is that instead of relying on human preferences and training a reward mode l, the DeepSeek-R1 team used verifiable rewards . This approach is called reinforcement learning with verifiable rewards (RLVR). Again, it's worth emphasizing: In contrast to standard RLHF, RLVR bypasses the need for a reward model. So, rather than learning what counts as a "good" answer from human-labeled examples, the model gets direct binary feedback (correct or wrong) from a deterministic tool, such as symbolic verifiers or rule-based tools. Think calculators for math problems or compilers for code generation. Example of reinforcement learning with verifiable rewards (RLVR). The model is prompted to solve a math problem and produces an answer. Instead of using a learned reward model, a symbolic verifier (e.g., a calculator) checks the output and provides binary feedback based on correctness. One motivation here is to avoid noisy or expensive human or learned rewards by using automatic correctness checks as supervision signals during RL. The other motivation is that by using "cheap" tools like calculators, we can replace the expensive reward model training and the reward model itself. Since the reward model is usually the whole pre-trained model (but with a regression head), RLVR is much more efficient. So, in short, DeepSeek-R1 used RLVR with GRPO, which eliminates two expensive models in the training procedure: the reward model and the value model (critic), as illustrated in the figure below. Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by also removing the reward model, relying instead on verifiable rewards from symbolic tools like calculators or compilers. In the next section, I want to briefly go over the DeepSeek-R1 pipeline and discuss the different verifiable rewards that the DeepSeek team used. Now that we have clarified what RLHF and RLVR are, as well as PPO and GRPO, let's briefly recap the main insights from the DeepSeek-R1 paper in the context of RL and reasoning. First, there were three types of models: DeepSeek-R1-Zero trained with pure RL DeepSeek-R1 trained with instruction fine-tuning (SFT) and RL DeepSeek-Distill variants created via instruction fine-tuning SFT without RL I created a DeepSeek-R1 pipeline diagram to illustrate how these models relate to each other, as shown below. Training pipeline for the DeepSeek-R1 family DeepSeek-R1-Zero was trained using the verifiable rewards (RLVR) with GRPO, and this turned out to be sufficient for the model to exhibit reasoning abilities via intermediate-step generation. This showed that it's possible to skip the SFT stage. The model improves its reasoning abilities through exploration instead of learning from examples. DeepSeek-R1 is the flagship model, the one with the best performance. The difference compared to DeepSeek-R1-Zero is that they alternated instruction fine-tuning, RLVR, and RLHF. DeepSeek-Distill variants are meant to be small and more easily deployable models; they were generated by instruction fine-tuning Llama 3 and Qwen 2.5 models using instruction data from the DeepSeek-R1 model. This approach didn't use any RL for the reasoning part (however, RLHF was used to create the Llama 3 and Qwen 2.5 base models). For more details on explaining the DeepSeek-R1 pipeline, please see my previous article "Understanding Reasoning LLMs": The main takeaway here is that the DeepSeek team didn't use an LLM-based reward model to train DeepSeek-R1-Zero. Instead, they used rule-based rewards for the reasoning training of DeepSeek-R1-Zero and DeepSeek-R1: We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process [...] To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: (1) Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. (2) Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '<think>' and '</think>’ tags. I realize that the introduction (i.e., everything up to this point) turned out to be much longer than I expected. Nonetheless, I think that this lengthy introduction is perhaps necessary to put the following lessons into context. After going through a large number of recent papers on reasoning models last month, I have put together a summary of the most interesting ideas and insights in this section. (References like “[1]” point to the corresponding papers listed at the end of the article.) The original DeepSeek-R1 paper demonstrated clearly that supervised fine-tuning (SFT) followed by reinforcement learning (RL) outperforms RL alone. Given this observation, it's intuitive that additional RL should further improve distilled models (as distilled models essentially represent models trained via SFT using reasoning examples generated by a larger model.) Indeed, the DeepSeek team observed this phenomenon explicitly: Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. Several teams independently verified these observations: [8] Using the 1.5B DeepSeek-R1-Distill-Qwen model, researchers demonstrated substantial performance improvements from RL fine-tuning with just 7,000 examples and a modest $42 compute budget. Impressively, this small model surpassed OpenAI’s o1-preview on the AIME24 math benchmark. [15] However, another team cautioned that these gains might not always be statistically significant. This suggests that, although RL can improve smaller distilled models, the benchmark results might sometimes be overstating the improvements. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 I previously mentioned that RL with verifiable rewards (RLVR) does not strictly require the GRPO algorithm; DeepSeek's GRPO simply happens to be efficient and to perform well. However, [12] showed that vanilla PPO paired with a basic binary correctness reward was sufficient to scale models in reasoning capability and response length. More interestingly, both PPO and GRPO have a length bias. And several papers explored methods to tackle excessively long incorrect answers: [14] Provided an analysis illustrating how PPO inadvertently favors longer responses due to mathematical biases in loss calculations; GRPO may suffer from the same issue. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 As a follow-up to the statement above, [7] [10] specifically identified length and difficulty-level biases in GRPO. The modified variant "Dr. GRPO" simplifies advantage calculations by removing length and standard deviation normalization, providing clearer training signals. [1] Explicitly penalized lengthy incorrect answers in GRPO while rewarding concise, correct ones. [3] [6] Didn’t directly control response length in GRPO but found token-level rewards beneficial, allowing models to better focus on critical reasoning steps. [5] Introduced explicit penalties in GRPO for responses exceeding specific lengths, enabling precise length control during inference. Beyond "AHA" moments mentioned in the DeepSeek-R1 paper, RL has been shown to induce valuable self-verification and reflective reasoning capabilities in models [2] [9]. Interestingly, similar to the AHA moment, these capabilities emerged naturally during training without explicit instruction. [1] Showed that extending context lengths (up to 128k tokens) further improves the model's self-reflection and self-correction capabilities. Most research efforts so far has focused on reasoning tasks in math or coding contexts. However, [4] demonstrated successful generalization by training models on logic puzzles. And models trained on logic puzzles also achieved strong performance in mathematical reasoning tasks. This is evidence for RL's ability to induce general reasoning behaviors independent of specific domain knowledge. As a follow-up to the section above, another interesting insight [11] is that reasoning capabilities can naturally extend beyond structured domains like math, code, and logic. Models successfully applied reasoning to areas including medicine, chemistry, psychology, economics, and education, leveraging generative soft-scoring methods to effectively handle free-form answers. Notable next steps for reasoning models include: Integrating existing reasoning models (e.g., o1, DeepSeek-R1) with capabilities such as external tool use and retrieval-augmented generation (RAG); the just-released o3 model from Open AI paves the way here Speaking of tool-use and search, [9] showed that giving reasoning models the ability to search induces behaviors such as self-correction and robust generalization across benchmarks, despite minimal training datasets. Based on the hoops DeepSeek-R1 team went through in terms of maintaining the performance on knowledge-based tasks, I believe adding search abilities to reasoning models is almost a no-brainer. The fundamental claim behind DeepSeek-R1 (and R1-Zero) is that RLVR explicitly induces reasoning capabilities. However, recent findings [10] suggest that reasoning behaviors, including the "Aha moment," might already be present in base models due to pre-training on extensive chain-of-thought data. My recent comparisons between DeepSeek V3 base and R1 reinforce this observation, as the updated base model also demonstrates reasoning-like behaviors. For instance, the comparison between the original V3 and R1 models clearly shows the difference between a non-reasoning and a reasoning model: However, this is no longer true when comparing the updated V3 base model to R1: Additionally, [13] identified that self-reflection and self-correction behaviors emerge progressively throughout pre-training across various domains and model sizes. This further complicates the attribution of reasoning capabilities solely to RL methods. Perhaps the conclusion is that RL definitely turns simple base models into reasoning models. However, it's not the only way to induce or improve reasoning abilities. As the DeepSeek-R1 team showed, distillation also improves reasoning. And since distillation, in this paper, meant instruction fine-tuning on chain-of-thought data, it's likely that pre-training on data that includes chain-of-thought data induces these abilities as well. (As I explained in my book through hands-on code, pre-training and instruction fine-tuning are based on the same next-token prediction task and loss functions, after all.) After reading through a large number of reasoning papers last month, I tried to summarize the most interesting takeaways in the previous section. However, for those who are curious about the sources with a bit more detail, I also listed 15 relevant papers in this section below as an optional read. (For simplicity, the following summaries are sorted by date.) Please note that this list is also not comprehensive (I capped it at 15), as this article is already more than too long! 📄 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs , https://arxiv.org/abs/2501.12599 It's interesting that this paper came out the same day as the DeepSeek-R1 paper! Here, the authors showcase a multi-modal LLM trained with RL. Similar to DeepSeek-R1, they didn't use process reward models (PRMs) but employed verifiable rewards. A PRM is a type of reward model used in RL (especially in LLM training) that evaluates not just the final answer but also the reasoning steps that led to it. Another key idea here is that scaling the context length (up to 128k tokens) helps the model plan, reflect, and self-correct during reasoning. So, in addition to the correctness reward that is similar to DeepSeek-R1 they also have a length reward. Specifically, they promote shorter correct responses, and incorrect long answers get penalized more. And they propose a method called long2short to distill these long-chain-of-thought skills into more efficient short-CoT models. (It does this by distilling shorter correct responses from the long-CoT model using methods like model merging, shortest rejection sampling, DPO, and a 2nd round of RL with stronger length penalties.) Annotated figure from Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 📄 3 Feb, Competitive Programming with Large Reasoning Models , https://arxiv.org/abs/2502.06807 This paper from OpenAI evaluates their o-models (like o1, o1-ioi, and o3) on competitive programming tasks. While it doesn't go into the technical details of how RL was applied, it still offers some interesting takeaways. First, the models were trained using outcome-based RL, rather than process-based reward models. This is similar to approaches like DeepSeek-R1 and Kimi. One of the interesting findings is that o3 can learn its own test-time (i.e., inference-time scaling) strategies. For example, it often writes a simple brute-force version of a problem (something that trades efficiency for correctness) and then uses it to verify the outputs of its more optimized solution. This kind of strategy wasn't hand-coded; the model figured it out on its own. So overall, the paper argues that scaling general-purpose RL allows models to develop their own reasoning and verification methods, without needing any human heuristics or domain-specific inference pipelines. In contrast, other (earlier) models like o1-ioi relied on handcrafted test-time strategies like clustering thousands of samples and reranking them, which required a lot of manual design and tuning. Annotated figure from Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 [3] Exploring the Limit of Outcome Reward 📄 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning , https://arxiv.org/abs/2502.06781 This paper explores how far RL with just binary "correct" or "wrong" feedback (like in DeepSeek-R1) can go for solving math problems. To do this, they start by using Best-of-N sampling to collect positive examples and apply behavior cloning on them, which they show is theoretically enough to optimize the policy. To deal with the challenge of sparse rewards (especially when long chains of thought include partially correct steps) they add a token-level reward model that learns to assign importance weights to different parts of the reasoning. This helps the model focus on the most critical steps when learning and improves the overall performance. Annotated figure from Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 📄 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning , https://arxiv.org/abs/2502.14768 DeepSeek-R1 focused on math and code tasks. This paper trains a 7B model using logic puzzles as the main training data. The researchers adopt a similar rule-based RL setup as DeepSeek-R1 but make several adjustments: 1. They introduce a strict format reward that penalizes shortcuts and ensures the model separates its reasoning from its final answer using <think> and <answer> tags. 2. They also use a system prompt that explicitly tells the model to first think through the problem step-by-step before giving the final answer. Even with only 5K synthetic logic problems, the model develops good reasoning skills that generalize well to harder math benchmarks like AIME and AMC. This is particularly interesting because it shows that logic-based RL training can teach models to reason in ways that transfer beyond the original domain. Annotated figure from Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 📄 6 Mar, L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , https://arxiv.org/abs/2503.04697 One hallmark of reasoning models is that they tend to generate longer outputs because of chain-of-thought reasoning. But by default, there is no explicit way to control how long the responses are. This paper introduces Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that helps models to adhere to user-specified length constraints while still optimizing for accuracy. In short, LCPO is similar to GRPO, i.e., "GRPO + Custom Reward for Length Control" implemented as where the target length is provided as part of the user prompt. This LCPO method above encourages the model to adhere to the provided target length exactly. In addition, they also introduce an LCPO-Max variant, which, instead of encouraging the model to match the target length exactly, encourages the model to stay below a maximum token length: The authors train a 1.5B model called L1 using LCPO, which can adjust its output length based on the prompt. This lets users trade-off between accuracy and compute, depending on the task. Interestingly, the paper also finds that these long-chain models actually become surprisingly good at short reasoning too, even outperforming much larger models like GPT-4o at the same token lengths. Annotated figure from L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, https://arxiv.org/abs/2503.04697 📄 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , https://arxiv.org/abs/2503.05592 Reasoning models like DeepSeek-R1 that have been trained with RL rely on their internal knowledge. The authors here focus on improving these models on knowledge-based tasks that require more time-sensitive or recent information by adding access to external search systems. So, this paper improves these models by teaching them to use external search systems during the reasoning process. Instead of relying on test-time strategies or supervised training, the authors use a two-stage reinforcement learning method that helps the model learn how and when to search on its own. The model first learns the search format, and then learns how to use search results to find correct answers. Annotated figure from R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 📄 18 Mar, DAPO: An Open-Source LLM Reinforcement Learning System at Scale , https://arxiv.org/abs/2503.14476 While this paper is mainly about developing a DeepSeek-R1-like training pipeline and open-sourcing it, it also proposes interesting improvements to the GRPO algorithm that was used in DeepSeek-R1 training. 1. Clip-higher: Increases the upper bound of the PPO clipping range to encourage exploration and prevent entropy collapse during training. 2. Dynamic sampling: Improves training efficiency by filtering out prompts where all sampled responses are either always correct or always wrong. 3. Token-level policy gradient loss: moves from sample-level to token-level loss calculation so that longer responses can have more influence on the gradient update.* 4. Overlong reward shaping: Adds a soft penalty for responses that get truncated for being too long, which reduces reward noise and helps stabilize training. * Standard GRPO uses a sample-level loss calculation. This involves first averaging the loss over the tokens for each sample and then averaging the loss over the samples. Since the samples have equal weight, the tokens in samples with longer responses may disproportionally contribute less to the overall loss. At the same time, researchers observed that longer responses often contain gibberish before the final answer, and this gibberish wouldn't be sufficiently penalized in the original GRPO sample-level loss calculation. Annotated figure from DAPO: An Open-Source LLM Reinforcement Learning System at Scale, https://arxiv.org/abs/2503.14476 📄 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , https://arxiv.org/abs/2503.16219 The original DeepSeek-R1 paper showed that when developing small(er) reasoning models, distillation gives better results than pure RL. In this paper, researchers follow up on this and investigate ways to improve small, distilled reasoning models further with RL. So, using the 1.5B DeepSeek-R1-Distill-Qwen model, they find that with only 7000 training examples and a $42 compute budget, RL fine-tuning can lead to strong improvements. In this case, the improvements are enough to outperform OpenAI's o1-preview on the AIME24 math benchmark, for example. Furthermore, there were 3 interesting learnings in that paper: 1. Small LLMs can achieve fast reasoning improvements within the first 50–100 training steps using a compact, high-quality dataset. But the performance quickly drops if training continues too long, mainly due to length limits and output instability. 2. Mixing easier and harder problems helps the model produce shorter, more stable responses early in training. However, performance still degrades over time. 3. Using a cosine-shaped reward function helps control output length more effectively and improves training consistency. But this slightly reduces peak performance compared to standard accuracy-based rewards. Annotated figure from Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 📄 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning , https://arxiv.org/abs/2503.19470 The ReSearch framework proposed in this paper extends the RL method from the DeepSeek-R1 paper to include search results as part of the reasoning process. The model learns when and how to search based on its ongoing reasoning chain, and it then uses the retrieved information for the next steps of reasoning. This is all done without supervised data on reasoning steps. The researchers also show that this approach can lead to useful behaviors like self-correction and reflection, and that it generalizes well across multiple benchmarks despite being trained on just one dataset. Annotated figure from ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 PS: How does this method differ from the R1-Searcher discussed earlier? R1-Searcher uses a two-stage, outcome-based reinforcement learning approach. In the first stage, it teaches the model how to invoke external retrieval; in the second, it learns to use the retrieved information to answer questions. ReSearch, in contrast, integrates search directly into the reasoning process. It trains the model end-to-end using reinforcement learning, without any supervision on reasoning steps. Behaviors such as reflecting on incorrect queries and correcting them emerge naturally during training here. 📄 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 This paper investigates why DeepSeek-R1-Zero's pure RL approach works to improve reasoning. The authors find that some base models like Qwen2.5 already show strong reasoning and even the "Aha moment" without any RL. So the "Aha moment" might not be induced by RL, but instead inherited from pre-training. This challenges the idea that RL alone is what creates deep reasoning behaviors. The paper also identifies two biases in GRPO: 1. Response-length bias: GRPO divides the advantage by the length of the response. This makes long incorrect answers get smaller penalties, so the model learns to generate longer bad answers. 2. Difficulty-level bias: GRPO also normalizes by the standard deviation of rewards for each question. Easy or hard questions (with low reward variance) get overweighted. To fix this, the authors introduce Dr. GRPO, which is a modification of standard GRPO. Here, they get rid of the response length normalization in the advantage computation. Also, they get rid of the question-level standard deviation. This will result in more efficient training and fewer unnecessary long answers. Especially if the model is wrong, generating a long answer is no longer encouraged. 📄 31 Mar, Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , https://arxiv.org/abs/2503.23829 DeepSeek-R1 and most other reasoning models that followed focused on reward signals from easily verifiable domains like code and math. This paper explores how to extend these methods to more complex areas like medicine, chemistry, psychology, economics, and education, where answers are usually free-form and harder to verify (beyond a simple correct/incorrect). The authors find that using expert-written reference answers makes evaluation more feasible than expected, even in these broader domains. To provide reward signals, they introduce a generative, soft-scoring method without needing heavy domain-specific annotation. Annotated figure from Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains, https://arxiv.org/abs/2503.23829 📄 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , https://arxiv.org/abs/2503.24290 In this paper, the authors explore a minimalist reinforcement learning setup for training LLMs on reasoning tasks. They use vanilla PPO instead of GRPO (which was used in DeepSeek-R1-Zero) and skip the usual KL regularization commonly included in RLHF pipelines. Interestingly, they find that this simple setup (vanilla PPO and a basic binary reward function based on answer correctness) is sufficient to train models that scale up in both reasoning performance and response length. Using the same Qwen-32B base as DeepSeek-R1-Zero, their model outperforms it on multiple reasoning benchmarks while requiring only 1/10 the training steps. Annotated figure from Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 📄 5 Apr, Rethinking Reflection in Pre-Training , https://arxiv.org/abs/2504.04022 Based on the interesting insights from the DeepSeek-R1 paper, namely applying pure RL to a base model, we think that reasoning abilities in LLMs emerge from RL. This paper provides a bit of a plot twist, saying that self-correction already appears earlier during pre-training. Concretely, by introducing deliberately flawed chains-of-thought into tasks, the authors measure whether models can identify and correct these errors. They find that both explicit and implicit forms of reflection emerge steadily throughout pre-training. This happens across many domains and model sizes. Even relatively early checkpoints show signs of self-correction, and the ability becomes stronger as pre-training compute increases. Annotated figure from Rethinking Reflection in Pre-Training, https://arxiv.org/abs/2504.04022 📄 7 Apr, Concise Reasoning via Reinforcement Learning , https://arxiv.org/abs/2504.05185 As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training. I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem. What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong). So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness. However, it's important to emphasize that this analysis was done for PPO: Of note, our current analysis is not applicable to GRPO, and a precise analysis of such methods is left for future work. In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 📄 9 Apr, A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility , https://arxiv.org/abs/2504.07086 This paper takes a closer look at recent claims that RL can improve distilled language models, like those based on DeepSeek-R1. For instance, I previously discussed the "20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't" paper that found RL is effective for distilled models. And also the DeepSeek-R1 paper mentioned Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. So, while earlier papers reported large performance boosts from RL, this work finds that many of those improvements might just be noise. The authors show that results on small benchmarks like AIME24 are highly unstable: just changing a random seed can shift scores by several percentage points. When RL models are evaluated under more controlled and standardized setups, the gains turn out to be much smaller than originally reported, and often not statistically significant. However, some models trained with RL do show modest improvements, but these are usually weaker than what supervised fine-tuning achieves, and they often don't generalize well to new benchmarks. So, while RL might help in some cases to improve smaller distilled models, this paper argues that its benefits have been overstated and better evaluation standards are needed to understand what’s actually working. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book , or signing up for a paid subscription . Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Source: OpenAI livestream (https://openai.com/live/) on April 16, 2025 While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks (so far). And I expect reasoning-focused post-training to become standard practice in future LLM pipelines. So, in this article, let's explore the latest developments in reasoning via reinforcement learning. This article focuses on reinforcement learning training methods used to develop and improve reasoning models Because it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view. Understanding reasoning models RLHF basics: where it all started A brief introduction to PPO: RL's workhorse algorithm RL algorithms: from PPO to GRPO RL reward modeling: from RLHF to RLVR How the DeepSeek-R1 reasoning models were trained Lessons from recent RL papers on training reasoning models Noteworthy research papers on training reasoning models A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation. If you are new to reasoning models and would like a more comprehensive introduction, I recommend my previous articles: Now, as hinted at the beginning of this section, the reasoning abilities of LLMs can be improved in two ways, as nicely illustrated in a figure from an OpenAI blog post: Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods. RLHF basics: where it all started The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. Conventional LLMs typically undergo a 3-step training procedure: Pre-training Supervised fine-tuning Alignment (typically via RLHF) RLHF Step 1 (prerequisite): Supervised fine-tuning (SFT) of the pre-trained model RLHF Step 2: Creating a reward model RLHF Step 3: Fine-tuning via proximal policy optimization (PPO) Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF step 1, we create or sample prompts (from a database, for example) and ask humans to write good-quality responses. We then use this dataset to fine-tune the pre-trained base model in a supervised fashion. As mentioned before, this is not technically part of RL training but merely a prerequisite. In RLHF Step 2, we then use this model from supervised fine-tuning (SFT) to create a reward model, as shown below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 As depicted in the figure above, for each prompt, we generate four responses from the fine-tuned LLM created in the prior step. Human annotators then rank these responses based on their preferences. Although this ranking process is time-consuming, it might be somewhat less labor-intensive than creating the dataset for supervised fine-tuning. This is because ranking responses is likely simpler than writing them. Upon compiling a dataset with these rankings, we can design a reward model that outputs a reward score for the optimization subsequent stage in RLHF Step 3. The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets. This reward model (RM) generally originates from the LLM created in the prior supervised fine-tuning (SFT) step. To turn the model from RLHF Step 1 into a reward model, its output layer (the next-token classification layer) is substituted with a regression layer, which features a single output node. The third step in the RLHF pipeline is to use the reward model (RM) to fine-tune the previous model from supervised fine-tuning (SFT), which is illustrated in the figure below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF Step 3, the final stage, we are now updating the SFT model using proximal policy optimization (PPO) based on the reward scores from the reward model we created in RLHF Step 2. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. A brief introduction to PPO: RL's workhorse algorithm As mentioned earlier, the original RLHF method uses a reinforcement learning algorithm called proximal policy optimization (PPO). PPO was developed to improve the stability and efficiency of training a policy. (In reinforcement learning, “policy” just means the model we want to train; in this case, policy = LLM.) One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training. On top of that, PPO also includes a KL divergence penalty in the loss. This term compares the current policy (the model being trained) to the original SFT model. This encourages the updates to stay reasonably close. The idea is to preference-tune the model, not to completely re-train, after all. This is where the “proximal” in proximal policy optimization comes from: the algorithm tries to keep the updates close to the existing model while still allowing for improvement. And to encourage a bit of exploration, PPO also adds an entropy bonus, which this encourages the model to vary the outputs during training. In the following paragraphs, I want to introduce some more terminology to illustrate PPO on a relatively high level. Still, there's a lot of jargon involved, so I tried to summarize the key terminology in the figure below before we continue. Illustration of the key terms in RLHF. For instance, several models are involved in PPO, where PPO is an algorithm used in RLHF (and RLHF is one of the most popular LLM alignment methods). Below, I aim to illustrate the key steps in PPO via pseudo-code. In addition, to make it more intuitive, I will also use an analogy: Imagine you are a chef running a small food delivery service. And you are constantly trying out new recipe variations to improve customer satisfaction. Your overall goal is to tweak your recipe (policy) based on customer feedback (reward). 1. Compute the ratio of the next-token probabilities from the new vs the old policy: In short, this checks how different our new recipe is from the old one. Side note: Regarding "new_policy_prob", we are not using the final updated policy yet. We are using the current version of the policy (i.e., the model we are in the middle of training). However, it's a convention to call it "new". So, even though you're still experimenting, we call your current draft the "new policy" as per convention. 2. Multiply that ratio by how good the action was (called the advantage): Here, for simplicity, we may assume the advantage is computed based on the reward signal: In the chef analogy, we can think of the advantage as how well the new dish performed: For example, if a customer rates the new dish with a 9/10, and the customers normally give us a 7/10, that's a +2 advantage. Note that this is a simplification. In reality, this involves generalized advantage estimation (GAE), which I am omitting here so as not to bloat the article further. However, one important detail to mention is that the expected reward is computed by a so-called "critic" (sometimes also called "value model"), and a reward model computes the actual reward. I.e., the advantage computation involves 2 other models, typically the same size as the original model we are fine-tuning. In the analogy, we can think of this critic or value model as a friend we ask to try our new dish before serving it to the customers. We also ask our friend to estimate how a customer would rank it (that's the expected reward). The reward model is the actual customer then who gives the feedback (i.e., the actual reward). 3. Compute a clipped score: If the new policy changes too much (e.g., ratio > 1.2 or < 0.8), we clip the ratio, as follows: In the analogy, imagine that the new recipe got an exceptionally great (or bad) review. We might be tempted to overhaul the entire menu now. But that's risky. So, instead, we clip how much our recipe can change for now. (For instance, maybe we made the dish much spicier, and that one customer happened to love spicy food, but that doesn't mean everyone else will.) 4. Then we use the smaller of the raw score and clipped scor e: Again, this is related to being a bit cautious. For instance, if the advantage is positive (the new behavior is better), we cap the reward. That's because we don't want to over-trust a good result that might be a coincidence or luck. If the advantage is negative (the new behavior is worse), we limit the penalty. The idea here is similar. Namely, we don't want to overreact to one bad result unless we are really sure. In short, we use the smaller of the two scores if the advantage is positive (to avoid over-rewarding), and the larger when the advantage is negative (to avoid over-penalizing). In the analogy, this ensures that if a recipe is doing better than expected, we don't over-reward it unless we are confident. And if it's underperforming, we don't over-penalize it unless it's consistently bad. 5. Calculating the loss: This final score is what we maximize during training (using gradient descent after flipping the sign of the score to minimize). In addition, we also add a KL penalty term, where β is a hyperparameter for the penalty strength: In the analogy, we add the penalty to ensure new recipes are not too different from our original style. This prevents you from "reinventing the kitchen" every week. For example, we don't want to turn an Italian restaurant into a BBQ place all of a sudden. This was a lot of information, so I summarized it with a concrete, numeric example in an LLM context via the figure below. But please feel free to skip it if it's too complicated; you should be able to follow the rest of the article just fine. I admit that I may have gone overboard with the PPO walkthrough. But once I had written it, it was hard to delete it. I hope some of you will find it useful! That being said, the main takeaways that will be relevant in the next section are that there are multiple models involved in PPO: 1. The policy, which is the LLM that has been trained with SFT and that we want to further align). 2. The reward model, which is a model that has been trained to predict the reward (see RLHF step 2). 3. The critic, which is a trainable model that estimates the reward. 4. A reference model (original policy) that we use to make sure that the policy doesn't deviate too much. By the way, you might wonder why we need both a reward model and a critic model. The reward model is usually trained before training the policy with PPO. It's to automate the preference labeling by human judges, and it gives the score for the complete responses generated by the policy LLM. The critic, in contrast, judges partial responses. We use it to create the final response. While the reward model typically remains frozen, the critic model is updated during training to estimate the reward created by the reward model better. More details about PPO are out of the scope of this article, but interested readers can find the mathematical details in these four papers that predate the InstructGPT paper: (1) Asynchronous Methods for Deep Reinforcement Learning (2016) by Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu introduces policy gradient methods as an alternative to Q-learning in deep learning-based RL. (2) Proximal Policy Optimization Algorithms (2017) by Schulman, Wolski, Dhariwal, Radford, and Klimov presents a modified proximal policy-based reinforcement learning procedure that is more data-efficient and scalable than the vanilla policy optimization algorithm above. (3) Fine-Tuning Language Models from Human Preferences (2020) by Ziegler, Stiennon, Wu, Brown, Radford, Amodei, Christiano, Irving illustrates the concept of PPO and reward learning to pretrained language models including KL regularization to prevent the policy from diverging too far from natural language. (4) Learning to Summarize from Human Feedback (2022) by Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano introduces the popular RLHF three-step procedure that was later also used in the InstructGPT paper . RL algorithms: from PPO to GRPO As mentioned before, PPO was the original algorithm used in RLHF. From a technical standpoint, it works perfectly fine in the RL pipeline that's being used to develop reasoning models. However, what DeepSeek-R1 used for their RL pipeline is an algorithm called Group Relative Policy Optimization (GRPO), which was introduced in one of their earlier papers: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) Annotated figure from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (https://arxiv.org/abs/2402.03300) to illustrate the differences between PPO and GRPO. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. RL reward modeling: from RLHF to RLVR So far, we looked at RLHF as a procedure, and we have introduced two reinforcement learning algorithms commonly used for it: PPO and GRPO. But if RLHF is already a core part of the LLM alignment toolkit, what does any of this have to do with reasoning? The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. The difference is that instead of relying on human preferences and training a reward mode l, the DeepSeek-R1 team used verifiable rewards . This approach is called reinforcement learning with verifiable rewards (RLVR). Again, it's worth emphasizing: In contrast to standard RLHF, RLVR bypasses the need for a reward model. So, rather than learning what counts as a "good" answer from human-labeled examples, the model gets direct binary feedback (correct or wrong) from a deterministic tool, such as symbolic verifiers or rule-based tools. Think calculators for math problems or compilers for code generation. Example of reinforcement learning with verifiable rewards (RLVR). The model is prompted to solve a math problem and produces an answer. Instead of using a learned reward model, a symbolic verifier (e.g., a calculator) checks the output and provides binary feedback based on correctness. One motivation here is to avoid noisy or expensive human or learned rewards by using automatic correctness checks as supervision signals during RL. The other motivation is that by using "cheap" tools like calculators, we can replace the expensive reward model training and the reward model itself. Since the reward model is usually the whole pre-trained model (but with a regression head), RLVR is much more efficient. So, in short, DeepSeek-R1 used RLVR with GRPO, which eliminates two expensive models in the training procedure: the reward model and the value model (critic), as illustrated in the figure below. Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by also removing the reward model, relying instead on verifiable rewards from symbolic tools like calculators or compilers. In the next section, I want to briefly go over the DeepSeek-R1 pipeline and discuss the different verifiable rewards that the DeepSeek team used. How the DeepSeek-R1 reasoning models were trained Now that we have clarified what RLHF and RLVR are, as well as PPO and GRPO, let's briefly recap the main insights from the DeepSeek-R1 paper in the context of RL and reasoning. First, there were three types of models: DeepSeek-R1-Zero trained with pure RL DeepSeek-R1 trained with instruction fine-tuning (SFT) and RL DeepSeek-Distill variants created via instruction fine-tuning SFT without RL Training pipeline for the DeepSeek-R1 family DeepSeek-R1-Zero was trained using the verifiable rewards (RLVR) with GRPO, and this turned out to be sufficient for the model to exhibit reasoning abilities via intermediate-step generation. This showed that it's possible to skip the SFT stage. The model improves its reasoning abilities through exploration instead of learning from examples. DeepSeek-R1 is the flagship model, the one with the best performance. The difference compared to DeepSeek-R1-Zero is that they alternated instruction fine-tuning, RLVR, and RLHF. DeepSeek-Distill variants are meant to be small and more easily deployable models; they were generated by instruction fine-tuning Llama 3 and Qwen 2.5 models using instruction data from the DeepSeek-R1 model. This approach didn't use any RL for the reasoning part (however, RLHF was used to create the Llama 3 and Qwen 2.5 base models). For more details on explaining the DeepSeek-R1 pipeline, please see my previous article "Understanding Reasoning LLMs": The main takeaway here is that the DeepSeek team didn't use an LLM-based reward model to train DeepSeek-R1-Zero. Instead, they used rule-based rewards for the reasoning training of DeepSeek-R1-Zero and DeepSeek-R1: We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process [...] To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: (1) Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. (2) Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '<think>' and '</think>’ tags. Lessons from recent RL papers on training reasoning models I realize that the introduction (i.e., everything up to this point) turned out to be much longer than I expected. Nonetheless, I think that this lengthy introduction is perhaps necessary to put the following lessons into context. After going through a large number of recent papers on reasoning models last month, I have put together a summary of the most interesting ideas and insights in this section. (References like “[1]” point to the corresponding papers listed at the end of the article.) 1. Reinforcement learning further improves distilled models The original DeepSeek-R1 paper demonstrated clearly that supervised fine-tuning (SFT) followed by reinforcement learning (RL) outperforms RL alone. Given this observation, it's intuitive that additional RL should further improve distilled models (as distilled models essentially represent models trained via SFT using reasoning examples generated by a larger model.) Indeed, the DeepSeek team observed this phenomenon explicitly: Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. Several teams independently verified these observations: [8] Using the 1.5B DeepSeek-R1-Distill-Qwen model, researchers demonstrated substantial performance improvements from RL fine-tuning with just 7,000 examples and a modest $42 compute budget. Impressively, this small model surpassed OpenAI’s o1-preview on the AIME24 math benchmark. [15] However, another team cautioned that these gains might not always be statistically significant. This suggests that, although RL can improve smaller distilled models, the benchmark results might sometimes be overstating the improvements. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 2. The problem of long incorrect answers I previously mentioned that RL with verifiable rewards (RLVR) does not strictly require the GRPO algorithm; DeepSeek's GRPO simply happens to be efficient and to perform well. However, [12] showed that vanilla PPO paired with a basic binary correctness reward was sufficient to scale models in reasoning capability and response length. More interestingly, both PPO and GRPO have a length bias. And several papers explored methods to tackle excessively long incorrect answers: [14] Provided an analysis illustrating how PPO inadvertently favors longer responses due to mathematical biases in loss calculations; GRPO may suffer from the same issue. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 As a follow-up to the statement above, [7] [10] specifically identified length and difficulty-level biases in GRPO. The modified variant "Dr. GRPO" simplifies advantage calculations by removing length and standard deviation normalization, providing clearer training signals. [1] Explicitly penalized lengthy incorrect answers in GRPO while rewarding concise, correct ones. [3] [6] Didn’t directly control response length in GRPO but found token-level rewards beneficial, allowing models to better focus on critical reasoning steps. [5] Introduced explicit penalties in GRPO for responses exceeding specific lengths, enabling precise length control during inference. Integrating existing reasoning models (e.g., o1, DeepSeek-R1) with capabilities such as external tool use and retrieval-augmented generation (RAG); the just-released o3 model from Open AI paves the way here Speaking of tool-use and search, [9] showed that giving reasoning models the ability to search induces behaviors such as self-correction and robust generalization across benchmarks, despite minimal training datasets.

Machine Learning

0 views

Ahead of AI 8 months ago

First Look at Reasoning From Scratch: Chapter 1

Hi everyone, As you know, I've been writing a lot lately about the latest research on reasoning in LLMs. Before my next research-focused blog post, I wanted to offer something special to my paid subscribers as a thank-you for your ongoing support. So, I've started writing a new book on how reasoning works in LLMs, and here I'm sharing the first Chapter 1 with you. This ~15-page chapter is an introduction reasoning in the context of LLMs and provides an overview of methods like inference-time scaling and reinforcement learning. Thanks for your support! I hope you enjoy the chapter, and stay tuned for my next blog post on reasoning research! Happy reading, Sebastian Welcome to the next stage of large language models (LLMs): reasoning . LLMs have transformed how we process and generate text, but their success has been largely driven by statistical pattern recognition. However, new advances in reasoning methodologies now enable LLMs to tackle more complex tasks, such as solving logical puzzles or multi-step arithmetic. Understanding these methodologies is the central focus of this book. In this introductory chapter, you will learn: What "reasoning" means specifically in the context of LLMs. How reasoning differs fundamentally from pattern matching. The conventional pre-training and post-training stages of LLMs. Key approaches to improving reasoning abilities in LLMs. Why building reasoning models from scratch can improve our understanding of their strengths, limitations, and practical trade-offs. After building foundational concepts in this chapter, the following chapters shift toward practical, hands-on coding examples to directly implement reasoning techniques for LLMs. What "reasoning" means specifically in the context of LLMs. How reasoning differs fundamentally from pattern matching. The conventional pre-training and post-training stages of LLMs. Key approaches to improving reasoning abilities in LLMs. Why building reasoning models from scratch can improve our understanding of their strengths, limitations, and practical trade-offs.

Tutorial

0 views

Ahead of AI 8 months ago

The State of LLM Reasoning Model Inference

Improving the reasoning abilities of large language models (LLMs) has become one of the hottest topics in 2025, and for good reason. Stronger reasoning skills allow LLMs to tackle more complex problems, making them more capable across a wide range of tasks users care about. In the last few weeks, researchers have shared a large number of new strategies to improve reasoning, including scaling inference-time compute, reinforcement learning, supervised fine-tuning, and distillation. And many approaches combine these techniques for greater effect. This article explores recent research advancements in reasoning-optimized LLMs, with a particular focus on inference-time compute scaling that have emerged since the release of DeepSeek R1. The four main categories of implementing reasoning models I explained in Understanding Reasoning LLMs . This article focuses on inference-time-scaling methods. Implementing and improving reasoning in LLMs: The four main categories Since most readers are likely already familiar with LLM reasoning models, I will keep the definition short: An LLM-based reasoning model is an LLM designed to solve multi-step problems by generating intermediate steps or structured "thought" processes. Unlike simple question-answering LLMs that just share the final answer, reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks such as puzzles, coding challenges, and mathematical problems. Side-by-side comparison of a basic LLM’s one-line answer and a reasoning LLM’s explanatory response. In general, there are two main strategies to improve reasoning: (1) increasing training compute or (2) increasing inference compute, also known as inference-time scaling or test-time scalin g. (Inference compute refers to the processing power required to generate model outputs in response to a user query after training.) Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ Note that the plots shown above make it look like we improve reasoning either via train-time compute OR test-time compute. However, LLMs are usually designed to improve reasoning by combining heavy train-time compute (extensive training or fine-tuning, often with reinforcement learning or specialized data) and increased test-time compute (allowing the model to "think longer" or perform extra computation during inference). The many terms that are used synonymously with inference-time scaling. To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Understanding Reasoning LLMs , I discussed a finer categorization into four categories, as summarized in the figure below. Methods 2-4 in the figure above typically produce models that generate longer responses because they include intermediate steps and explanations in their outputs. Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling. However, in this section on inference-time compute scaling, I focus specifically on techniques that explicitly regulate the number of generated tokens, whether through additional sampling strategies, self-correction mechanisms, or other methods. In this article, I focus on the interesting new research papers and model releases focused on scaling inference-time compute scaling that followed after the DeepSeek R1 release on January 22nd, 2025. (Originally, I wanted to cover methods from all categories in this article, but due to the excessive length, I decided to release a separate article focused on train-time compute methods in the future.) Development process of DeepSeek's reasoning models that I discussed in my previous article, Understanding Reasoning LLMs (https://magazine.sebastianraschka.com/p/understanding-reasoning-llms). Before we look into Inference-time compute scaling methods and the different areas of progress on the reasoning model with a focus on the inference-time compute scaling category, let me at least provide a brief overview of all the different categories. 1. Inference-time compute scaling This category includes methods that improve model reasoning capabilities at inference time without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps with making even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures. While I categorize inference-time compute scaling separately to focus on methods in this context, it is important to note that this technique can be applied to any LLM. For example, OpenAI developed its o1 model using reinforcement learning and then additionally leveraged inference-time compute scaling. Interestingly, as I discussed in my previous article on reasoning models ( Understanding Reasoning LLMs ), the DeepSeek R1 paper explicitly categorized common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model’s natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling over the V3 base model. However, since explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, DeepSeek acknowledged that they could easily incorporate it into the R1 deployment or application. 2. Pure reinforcement learning This approach focuses solely on reinforcement learning (RL) to develop or improve reasoning capabilities. It typically involves training models with verifiable reward signals from math or coding domains. While RL allows models to develop more strategic thinking and self-improvement capabilities, it comes with challenges such as reward hacking, instability, and high computational costs. 3. Reinforcement learning and supervised fine-tuning This hybrid approach combines RL with supervised fine-tuning (SFT) to achieve more stable and generalizable improvements than pure RL. Typically, a model is first trained with SFT on high-quality instruction data and then further refined using RL to optimize specific behaviors . 4. Supervised fine-tuning and model distillation This method improves the reasoning capabilities of a model by instruction fine-tuning it on high-quality labeled datasets (SFT). If this high-quality dataset is generated by a larger LLM, then this methodology is also referred to as "knowledge distillation" or just "distillation" in LLM contexts. However, note that this differs slightly from traditional knowledge distillation in deep learning, which typically involves training a smaller model using not only the outputs (labels) but also the logits of a larger teacher model. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The previous section already briefly summarized inference-time compute scaling. Before discussing the recent research in this category, let me describe the inference-time scaling in a bit more detail. Inference-time scaling improves an LLM's reasoning by increasing computational resources ("compute") during inference. The idea why this can improve reasoning can be given with a simple analogy: humans give better responses when given more time to think, and similarly, LLMs can improve with techniques that encourage more "thought" during generation. One approach here is prompt engineering, such as chain-of-thought (CoT) prompting, where phrases like "think step by step" guide the model to generate intermediate reasoning steps. This improves accuracy on complex problems but is unnecessary for simple factual queries. Since CoT prompts generate more tokens, they effectively make inference more expensive. An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). Another method involves voting and search strategies, such as majority voting or beam search, which refine responses by selecting the best output. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The remainder of this article will be focused on the recent research advances in the inference-time scaling category for improving reasoning capabilities of LLMs. Let me start with a more detailed discussion of a paper that serves as an example of inference-time scaling. So, one of the interesting recent research papers in this category is s1: Simple Test-Time Scaling (31 Jan, 2025), which introduces so-called "wait" tokens, which can be considered as a more modern version of the aforementioned "think step by step" prompt modification. Note that this involves supervised finetuning (SFT) to generate the initial model, so it's not a pure inference-time scaling approach. However, the end goal is actively controlling the reasoning behavior through inference-time scaling; hence, I considered this paper for the "1. Inference-time compute scaling" category. In short, their approach is twofold: Create a curated SFT dataset with 1k training examples that include reasoning traces. Control the length of responses by: a) Appending "Wait" tokens to get the LLM to generate longer responses, self-verify, and correct itself, or b) Stopping generation by adding an end-of-thinking token delimiter ("Final Answer:"). They call this length control "budget forcing." Illustration of "wait" token insertion to control the length of the output. Annotated figure from https://arxiv.org/abs/2501.19393. Budget forcing can be seen as a sequential inference scaling technique because it still generates one token at a time (but just more of it). In contrast, we have parallel techniques like majority voting, which aggregate multiple independent completions. Correlation between response accuracy and length. Annotated figure from https://arxiv.org/abs/2501.19393. They found their budget-forcing method more effective than other inference-scaling techniques I've discussed, like majority voting. If there's something to criticize or improve, I would've liked to see results for more sophisticated parallel inference-scaling methods, like beam search, lookahead search, or the best compute-optimal search described in Google's Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters paper last year. Or even a simple comparison with a classic sequential method like chain-of-thought prompting ("Think step by step"). Anyway, it's a really interesting paper and approach! PS: Why "Wait" tokens? My guess is the researchers were inspired by the "Aha moment" figure in the DeepSeek-R1 paper, where researchers saw LLMs coming up with something like " Wait, wait. Wait. That's an aha moment I can flag here. " which showed that pure reinforcement learning can induce reasoning behavior in LLMs. Interestingly, they also tried other tokens like " Hmm " but found that " Wait " performed slightly better. " Wait" vs " Hmm" tokens. Annotated figure from https://arxiv.org/abs/2501.19393. Since it's been a very active month on the reasoning model research front, I need to keep the summaries of other papers relatively brief to manage a reasonable length for this article. Hence, below are brief summaries of other interesting research articles related to inference-time compute scaling, sorted in ascending order by publication date. As mentioned earlier, not all of these articles fall neatly into the inference-time compute scaling category, as some of them also involve specific training. However, these papers have in common that controlling inference-time compute is a specific mechanism of action. (Many distilled or SFT methods that I will cover in upcoming articles will lead to longer responses, which can be seen as a form of inference-time compute scaling. However, they do not actively control the length during inference, which makes these methods different from those covered here.) 📄 22 Jan, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , https://arxiv.org/abs/2501.12895 Test-time Preference Optimization (TPO) is an iterative process that aligns LLM outputs with human preferences during inference (this is without altering its underlying model weights). In each iteration, the model: Generates multiple responses for a given prompt. Score the responses with a reward model to select the highest- and lowest-scoring ones as "chosen" and "rejected" responses Prompt the model to compare and critique the "chosen" and "rejected" responses Refine the output by converting the critiques into textual suggestions to update the original model responses By doing steps 1-4 iteratively, the model refines its original responses. Annotated figure from "Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback", https://arxiv.org/abs/2501.12895 📄 30 Jan, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs , https://arxiv.org/abs/2501.18585 The researchers explore a phenomenon called "underthinking", where reasoning models frequently switch between reasoning paths instead of fully focusing on exploring promising ones, which lowers the problem solving accuracy. To address this "underthinking" issue, they introduce a method called the Thought Switching Penalty (TIP), which modifies the logits of thought-switching tokens to discourage premature reasoning path transitions. Their approach does not require model fine-tuning and empirically improves accuracy across multiple challenging test sets. Annotated figure from "Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs", https://arxiv.org/abs/2501.18585 📄 31 Jan, Trading Inference-Time Compute for Adversarial Robustness , https://arxiv.org/abs/2501.18841 Increasing inference-time compute improves the adversarial robustness of reasoning LLMs in many cases in terms of reducing the rate of successful attacks. Unlike adversarial training, this method does not need any special training or require prior knowledge of specific attack types. However, there are some important exceptions. For example, the improvements in settings involving policy ambiguities or loophole exploitation are limited. Additionally, the reasoning-improved robustness increases can be reduced by new attack strategies such as "Think Less" and "Nerd Sniping". So, while these findings suggest that scaling inference-time compute can improve LLM safety, this alone is not a complete solution to adversarial robustness. Annotated figure from "Trading Inference-Time Compute for Adversarial Robustness", https://arxiv.org/abs/2501.18841 📄 4 Feb, CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, https://arxiv.org/abs/2502.02390 The researchers combine classic Monte Carlo Tree Search inference-time scaling with an "associative memory" that serves as the LLM's knowledge base during the exploration of reasoning pathways. Using this so-called associative memory, it's easier for the LLM to consider earlier reasoning pathways and use dynamically involving information during the response generation. Annotated figure from "CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning", https://arxiv.org/abs/2502.02390 📄 6 Feb, Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, https://arxiv.org/abs/2502.0440 This paper proposes a self-backtracking mechanism that allows LLMs to improve their reasoning by learning when and where to backtrack during training and inference. While training involves teaching the model to recognize and correct suboptimal reasoning paths using a <backtrack> token, the key contribution is an inference-time tree-based search that uses this learned backtracking ability to explore alternative solutions. What's unique is that this exploration does not require without relying on external reward models (unlike the search-based methods that use a process-reward-based model that I mentioned at the beginning of the "1. Inference-time compute scaling methods" section in this article). Annotated figure from "Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models", https://arxiv.org/abs/2502.04404 I added this paper here as it's heavily focused on the proposed backtracking inference-time scaling method, which improves reasoning by dynamically adjusting search depth and breadth rather than fundamentally altering the training paradigm (although, the training with <backtrack> tokens is required). 📄 7 Feb, Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, https://arxiv.org/abs/2502.05171 Instead of improving reasoning by generating more tokens, the researchers propose a model that scales inference-time compute by iterating over a recurrent depth block in latent space. This block functions like a hidden state in RNNs, which allows the model to refine its reasoning without requiring longer token outputs. However, a key drawback is the lack of explicit reasoning steps, which are (in my opinion) useful for human interpretability and a major advantage of chain-of-thought methods. Annotated figure from "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach", https://arxiv.org/abs/2502.05171 📄 10 Feb, Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, https://arxiv.org/abs/2502.06703 Many inference-time scaling techniques depend on sampling, which requires a Process Reward Model (PRM) to select the best solution. This paper systematically analyzes how inference-time compute scaling interacts with PRMs and problem difficulty. The researchers develop a compute-optimal scaling strategy that adapts to the choice of PRM, policy model, and task complexity. Their results show that with the right inference-time scaling approach, a 1B parameter model can outperform a 405B Llama 3 model that lacks inference-time scaling. Similarly, they show how a 7B model with inference-time scaling surpasses DeepSeek-R1 while maintaining higher inference efficiency. These findings highlight how inference-time scaling can significantly improve LLMs, where small LLMs, with the right inference compute budget, can outperform much larger models. Annotated figure from "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling", https://arxiv.org/abs/2502.06703 9. Learning to Reason from Feedback at Test-Time 📄 16 Feb, Learning to Reason from Feedback at Test-Time, https://www.arxiv.org/abs/2502.12521 It's a bit hard to classify this as either an inference-time or training-time method, because it optimizes the LLM, changing its weight parameters, at inference-time. So, this paper explores a way to make LLMs learn from their mistakes during inference time without having to store failed attempts in the prompt (which gets expensive). Instead of the usual method of refining answers by adding previous attempts to the context (sequential revision) or blindly generating new answers (parallel sampling), this approach updates the model's weights at inference time. To do this, the authors introduce OpTune, a small, trainable optimizer that updates the model's weights based on the mistakes it made in a previous attempt. This means the model remembers what it did wrong without needing to keep the incorrect answer in the prompt/context. Annotated figure from "Learning to Reason from Feedback at Test-Time” , https://www.arxiv.org/abs/2502.12521 📄 18 Feb, Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, https://www.arxiv.org/abs/2502.12521 This paper benchmarks various inference-time compute scaling techniques for reasoning and planning tasks with a focus on analyzing their trade-offs between computational cost and performance. The authors evaluate multiple techniques—such as Chain-of-Thought, Tree-of-Thought, and Reasoning as Planning across eleven tasks spanning arithmetic, logical, commonsense, algorithmic reasoning, and planning. The main finding is that while scaling inference-time computation can improve reasoning, no single technique consistently outperforms others across all tasks. Annotated figure from Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights , https://www.arxiv.org/abs/2502.12521 📄 19 Feb, Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, https://arxiv.org/abs/2502.13842 The Inner Thinking Transformer (ITT) dynamically allocates more compute during inference. Instead of using a fixed depth (= using same number of layers) for all tokens as in standard transformer-based LLMs, ITT employs Adaptive Token Routing to allocate more compute to difficult tokens. These difficult tokens pass through the same layer multiple times to undergo additional processing, which increases the inference-compute budget for these difficult tokens. Annotated figure from "Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking", https://arxiv.org/abs/2502.13842 📄 20 Feb, S*: Test Time Scaling for Code Generation, https://arxiv.org/abs/2502.14382 Inference-time scaling can be achieved by parallel scaling (generating multiple answers), sequential scaling (iteratively refining answers), or both as described in the Google paper from Summer 2024 ( Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters ). S* is a test-time compute scaling method designed specifically for code generation that improves both parallel scaling (generating multiple solutions) and sequential scaling (iterative debugging). Annotated figure from "S*: Test Time Scaling for Code Generation", https://arxiv.org/abs/2502.14382 The approach operates in two stages: Stage 1: Generation The model generates multiple code solutions and iteratively refines them using execution results and test cases provided in the problem prompt. Think of this like a coding competition where a model submits solutions, runs tests, and fixes mistakes: 1. The model generates multiple candidate solutions. 2. Each solution is executed on public test cases (predefined input-output pairs). 3. If a solution fails (incorrect output or crashes), the model analyzes the execution results (errors, outputs) and modifies the code to improve it. 4. This refinement process continues iteratively until the model finds solutions that pass the test cases. For example, suppose the model is asked to implement a function is_even(n) that returns True for even numbers and False otherwise. The model’s first attempt might be: The model tests this implementation with public test cases: After reviewing the results, the model realizes that 4 % 2 returns 0, not True, so it modifies the function: Now the function passes all public tests , completing the debugging phase. Stage 2: Selection Once multiple solutions have passed public tests, the model must choose the best one (if possible). Here, S* introduces adaptive input synthesis to avoid random picking: 1. The model compares two solutions that both pass public tests. 2. It asks itself: "Can I generate an input that will reveal a difference between these solutions?" 3. It creates a new test input and runs both solutions on it. 4. If one solution produces the correct output while the other fails, the model selects the better one. 5. If both solutions behave identically, the model randomly picks one. For example, consider two different implementations of : Both pass the provided test cases for simple examples: But when the LLM generates edge cases we can see one of them fail, so the model would select the solution A in this case: 📄 25 Feb, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 The researchers observe that while reasoning LLMs often generate verbose step-by-step explanations, humans typically rely on concise drafts that capture only essential information. Inspired by this, they propose Chain of Draft (CoD), a prompting strategy that reduces verbosity by generating minimal but informative intermediate steps. So, in a sense it's a method for inference-time scaling that improves the efficiency of inference-time scaling through generating fewer tokens. Annotated figures from "Chain of Draft: Thinking Faster by Writing Less", https://arxiv.org/abs/2502.18600 Looking at the results, it seems that CoD is almost as brief as standard prompting, but as accurate as Chain of Thought (CoT) prompting. As I mentioned earlier, in my opinion, one of the advantages of reasoning models is that users can read the reasoning traces to learn and to better evaluate / trust the response. CoD somewhat diminishes the advantage of CoD. However, it might come in very handy where verbose intermediate steps are not needed as it speeds up the generation while maintaining the accuracy of CoT. 📄 6 Mar, Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks, https://arxiv.org/abs/2503.04378 Many techniques for scaling inference-time reasoning rely on tasks with verifiable answers (like math and code that can be checked), which makes them difficult to apply to open-ended tasks like writing and general problem-solving. To address this limitation regarding verifiable answers, the researchers develop a system where one model generates an initial response, another provides feedback ("feedback model"), and a third refines the response based on that feedback ("edit model"). They train these specialized "feedback" and "edit" models using a large dataset of human-annotated responses and feedback. These models then help improve responses by generating better feedback and making more effective edits during inference time. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Inference-time compute scaling has become one of the hottest research topics this year to improve the reasoning abilities of large language models without requiring modification to model weights. The many techniques I summarized above range from simple token-based interventions like “Wait” tokens to sophisticated search and optimization-based strategies such as Test-Time Preference Optimization and Chain-of-Associated-Thoughts. On the big-picture level, one recurring theme is that increasing compute at inference allows even relatively small models to achieve substantial improvements (on reasoning benchmarks) compared to standard approaches. This suggests that inference strategies can help narrow the performance gap between smaller, more cost-effective models and their larger counterparts. The cost caveat The caveat is that inference-time scaling increases the inference costs, so whether to use a small model with substantial inference scaling or training a larger model and using it with less or no inference scaling is a math that has to be worked out based on how much use the model gets. As an example, an o1 model, which uses heavy inference time scaling, is actually still slightly cheaper than a likely larger GPT-4.5 model that likely doesn't use inference time scaling. (It will be interesting to see how well GPT-4.5 will perform with o1- or o3-style inference-time scaling.) Which technique? However, inference-time compute scaling is not a silver bullet. While methods like Monte Carlo Tree Search, self-backtracking, and dynamic-depth scaling can substantially improve reasoning performance, the effectiveness also still depends on the task and difficulty. As one of the earlier papers showed, there's no inference-time compute scaling technique that performs best across all tasks. Additionally, many of these approaches trade off response latency for improved reasoning, and slow responses can be annoying to some users. For instance, I usually switch from o1 to GPT4o if I have simple tasks due to the faster response time. What's next Looking ahead, I think we will see many more papers this year centered around the two main branches of "reasoning via inference-time compute scaling" research: 1. Research that is purely centered around developing the best possible model topping the benchmarks. 2. Research that is concerned with balancing cost and performance trade-offs across different reasoning tasks. Either way, what's nice about inference-time compute scaling is that it can be applied to any type of existing LLM to make it better for specific tasks. Thinking on Demand An interesting trend on the industry side is what I refer to as "thinking on demand". Following the release of DeepSeek R1, it feels like companies have been rushing to add reasoning capabilities to their offerings. An interesting development here is that most LLM providers started to add the option for users to enable or disable thinking. An interesting development is that most LLM providers now allow users to enable or disable these "thinking" features. The mechanism is not publicly shared, but it's likely the same model with dialed-back inference-time compute scaling. For instance, Claude 3.7 Sonnet and Grok 3 now have a "thinking" that users can enable for their model, whereas OpenAI requires users to switch between models. For example, GPT4o/4.5 and o1/o3-mini if they want to use explicit reasoning models. However, the OpenAI CEO mentioned that GPT4.5 will likely be their last model, which doesn't explicitly have a reasoning or "thinking" mode. On the open-source side, even IBM added an explicit "thinking" toggle to their Granite models . Overall, the trend of adding reasoning capabilities whether via inference-time or train-time compute scaling is a major step forward for LLMs in 2025. In time, I expect that reasoning will no longer be treated as an optional or special feature but will instead become the standard, much as instruction-finetuned or RLHF-tuned models are now the norm over raw pretrained models. As mentioned earlier, this article solely focused on inference-time compute length due to its already long lengths, thanks to the very active reasoning research activity. In a future article, I plan to cover all the interesting train-time compute scaling methods for reasoning. This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book , or signing up for a paid subscription . Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! The four main categories of implementing reasoning models I explained in Understanding Reasoning LLMs . This article focuses on inference-time-scaling methods. Implementing and improving reasoning in LLMs: The four main categories Since most readers are likely already familiar with LLM reasoning models, I will keep the definition short: An LLM-based reasoning model is an LLM designed to solve multi-step problems by generating intermediate steps or structured "thought" processes. Unlike simple question-answering LLMs that just share the final answer, reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks such as puzzles, coding challenges, and mathematical problems. Side-by-side comparison of a basic LLM’s one-line answer and a reasoning LLM’s explanatory response. In general, there are two main strategies to improve reasoning: (1) increasing training compute or (2) increasing inference compute, also known as inference-time scaling or test-time scalin g. (Inference compute refers to the processing power required to generate model outputs in response to a user query after training.) Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ Note that the plots shown above make it look like we improve reasoning either via train-time compute OR test-time compute. However, LLMs are usually designed to improve reasoning by combining heavy train-time compute (extensive training or fine-tuning, often with reinforcement learning or specialized data) and increased test-time compute (allowing the model to "think longer" or perform extra computation during inference). The many terms that are used synonymously with inference-time scaling. To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Understanding Reasoning LLMs , I discussed a finer categorization into four categories, as summarized in the figure below. Methods 2-4 in the figure above typically produce models that generate longer responses because they include intermediate steps and explanations in their outputs. Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling. However, in this section on inference-time compute scaling, I focus specifically on techniques that explicitly regulate the number of generated tokens, whether through additional sampling strategies, self-correction mechanisms, or other methods. In this article, I focus on the interesting new research papers and model releases focused on scaling inference-time compute scaling that followed after the DeepSeek R1 release on January 22nd, 2025. (Originally, I wanted to cover methods from all categories in this article, but due to the excessive length, I decided to release a separate article focused on train-time compute methods in the future.) Development process of DeepSeek's reasoning models that I discussed in my previous article, Understanding Reasoning LLMs (https://magazine.sebastianraschka.com/p/understanding-reasoning-llms). Before we look into Inference-time compute scaling methods and the different areas of progress on the reasoning model with a focus on the inference-time compute scaling category, let me at least provide a brief overview of all the different categories. 1. Inference-time compute scaling This category includes methods that improve model reasoning capabilities at inference time without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps with making even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures. While I categorize inference-time compute scaling separately to focus on methods in this context, it is important to note that this technique can be applied to any LLM. For example, OpenAI developed its o1 model using reinforcement learning and then additionally leveraged inference-time compute scaling. Interestingly, as I discussed in my previous article on reasoning models ( Understanding Reasoning LLMs ), the DeepSeek R1 paper explicitly categorized common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model’s natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling over the V3 base model. However, since explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, DeepSeek acknowledged that they could easily incorporate it into the R1 deployment or application. 2. Pure reinforcement learning This approach focuses solely on reinforcement learning (RL) to develop or improve reasoning capabilities. It typically involves training models with verifiable reward signals from math or coding domains. While RL allows models to develop more strategic thinking and self-improvement capabilities, it comes with challenges such as reward hacking, instability, and high computational costs. 3. Reinforcement learning and supervised fine-tuning This hybrid approach combines RL with supervised fine-tuning (SFT) to achieve more stable and generalizable improvements than pure RL. Typically, a model is first trained with SFT on high-quality instruction data and then further refined using RL to optimize specific behaviors . 4. Supervised fine-tuning and model distillation This method improves the reasoning capabilities of a model by instruction fine-tuning it on high-quality labeled datasets (SFT). If this high-quality dataset is generated by a larger LLM, then this methodology is also referred to as "knowledge distillation" or just "distillation" in LLM contexts. However, note that this differs slightly from traditional knowledge distillation in deep learning, which typically involves training a smaller model using not only the outputs (labels) but also the logits of a larger teacher model. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Inference-time compute scaling methods The previous section already briefly summarized inference-time compute scaling. Before discussing the recent research in this category, let me describe the inference-time scaling in a bit more detail. Inference-time scaling improves an LLM's reasoning by increasing computational resources ("compute") during inference. The idea why this can improve reasoning can be given with a simple analogy: humans give better responses when given more time to think, and similarly, LLMs can improve with techniques that encourage more "thought" during generation. One approach here is prompt engineering, such as chain-of-thought (CoT) prompting, where phrases like "think step by step" guide the model to generate intermediate reasoning steps. This improves accuracy on complex problems but is unnecessary for simple factual queries. Since CoT prompts generate more tokens, they effectively make inference more expensive. An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). Another method involves voting and search strategies, such as majority voting or beam search, which refine responses by selecting the best output. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 1. "s1: Simple test-time scaling" The remainder of this article will be focused on the recent research advances in the inference-time scaling category for improving reasoning capabilities of LLMs. Let me start with a more detailed discussion of a paper that serves as an example of inference-time scaling. So, one of the interesting recent research papers in this category is s1: Simple Test-Time Scaling (31 Jan, 2025), which introduces so-called "wait" tokens, which can be considered as a more modern version of the aforementioned "think step by step" prompt modification. Note that this involves supervised finetuning (SFT) to generate the initial model, so it's not a pure inference-time scaling approach. However, the end goal is actively controlling the reasoning behavior through inference-time scaling; hence, I considered this paper for the "1. Inference-time compute scaling" category. In short, their approach is twofold: Create a curated SFT dataset with 1k training examples that include reasoning traces. Control the length of responses by: a) Appending "Wait" tokens to get the LLM to generate longer responses, self-verify, and correct itself, or b) Stopping generation by adding an end-of-thinking token delimiter ("Final Answer:"). They call this length control "budget forcing." Illustration of "wait" token insertion to control the length of the output. Annotated figure from https://arxiv.org/abs/2501.19393. Budget forcing can be seen as a sequential inference scaling technique because it still generates one token at a time (but just more of it). In contrast, we have parallel techniques like majority voting, which aggregate multiple independent completions. Correlation between response accuracy and length. Annotated figure from https://arxiv.org/abs/2501.19393. They found their budget-forcing method more effective than other inference-scaling techniques I've discussed, like majority voting. If there's something to criticize or improve, I would've liked to see results for more sophisticated parallel inference-scaling methods, like beam search, lookahead search, or the best compute-optimal search described in Google's Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters paper last year. Or even a simple comparison with a classic sequential method like chain-of-thought prompting ("Think step by step"). Anyway, it's a really interesting paper and approach! PS: Why "Wait" tokens? My guess is the researchers were inspired by the "Aha moment" figure in the DeepSeek-R1 paper, where researchers saw LLMs coming up with something like " Wait, wait. Wait. That's an aha moment I can flag here. " which showed that pure reinforcement learning can induce reasoning behavior in LLMs. Interestingly, they also tried other tokens like " Hmm " but found that " Wait " performed slightly better. " Wait" vs " Hmm" tokens. Annotated figure from https://arxiv.org/abs/2501.19393. Other noteworthy research papers on inference-time compute scaling Since it's been a very active month on the reasoning model research front, I need to keep the summaries of other papers relatively brief to manage a reasonable length for this article. Hence, below are brief summaries of other interesting research articles related to inference-time compute scaling, sorted in ascending order by publication date. As mentioned earlier, not all of these articles fall neatly into the inference-time compute scaling category, as some of them also involve specific training. However, these papers have in common that controlling inference-time compute is a specific mechanism of action. (Many distilled or SFT methods that I will cover in upcoming articles will lead to longer responses, which can be seen as a form of inference-time compute scaling. However, they do not actively control the length during inference, which makes these methods different from those covered here.) 2. Test-Time Preference Optimization 📄 22 Jan, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , https://arxiv.org/abs/2501.12895 Test-time Preference Optimization (TPO) is an iterative process that aligns LLM outputs with human preferences during inference (this is without altering its underlying model weights). In each iteration, the model: Generates multiple responses for a given prompt. Score the responses with a reward model to select the highest- and lowest-scoring ones as "chosen" and "rejected" responses Prompt the model to compare and critique the "chosen" and "rejected" responses Refine the output by converting the critiques into textual suggestions to update the original model responses

Machine Learning

0 views

Ahead of AI 9 months ago

Understanding Reasoning LLMs

This article describes the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic. In 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., "specializations"). Stages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases. The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later. To give you a brief glimpse of what's covered below, in this article, I will: Explain the meaning of "reasoning model" Discuss the advantages and disadvantages of reasoning models Outline the methodology behind DeepSeek R1 Describe the four main approaches to building and improving reasoning models Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases Provide tips for developing reasoning models on a tight budget I hope you find this article useful as AI continues its rapid development this year! If you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions. The term "reasoning models" is no exception. Eventually, someone will define it formally in a paper, only for it to be redefined in the next, and so on. In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer. A regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers. Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs. Additionally, most LLMs branded as reasoning models today include a "thought" or "thinking" process as part of their response. Whether and how an LLM actually "thinks" is a separate discussion. Intermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user. "Reasoning" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user. Now that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed. When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking." Also here the simple rule applies: Use the right tool (or type of LLM) for the task. The key strengths and limitations of reasoning models are summarized in the figure below. The key strengths and weaknesses of reasoning models. Before discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the DeepSeek R1 technical report . This report serves as both an interesting case study and a blueprint for developing reasoning LLMs. Note that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill. Based on the descriptions in the technical report, I have summarized the development process of these models in the diagram below. Development process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report. Next, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models. (1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF). (2) DeepSeek-R1: This is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the "cold-started" R1-Zero model. (3) DeepSeek-R1-Distill*: Using the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model. In this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others. Note: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques. One way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality. A rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to "think" more while generating an answer. (Although, whether LLMs actually "think" is a different discussion.) One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting , where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like "What is the capital of France", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.) An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). The aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens. Another approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses. I highly recommend the Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these different strategies. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The DeepSeek R1 technical report categorizes common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model's natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling compared to the V3 base model. However, explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, so DeepSeek may still apply such techniques within their app. I suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below. One of my personal highlights from the DeepSeek R1 paper is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail. As outlined earlier, DeepSeek developed three types of R1 models. The first, DeepSeek-R1-Zero , was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement learning without an initial SFT stage as highlighted in the diagram below. The development process of DeepSeek-R1-Zero model. Still, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives .) However, as mentioned above, the key difference in DeepSeek-R1-Zero is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as "pure" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.) For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward. The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags. Surprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an "Aha!" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below. A figure from the DeepSeek R1 technical report (https://arxiv.org/abs/2501.12948) showing the emergence of the "Aha" moment. While R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate "thinking" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Next, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance. Note that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach. The development process of DeepSeek-R1 model. As shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call "cold-start" SFT data. The term "cold start" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data. Using this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response. The RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model. These 600K + 200K SFT samples were then used for instruction-finetuning DeepSeek-V3 base before following up with a final round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types. All in all, this is very similar to regular RLHF except that the SFT data contains (more) CoT examples. And the RL has verifiable rewards in addition to human preference-based rewards. The final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below. Benchmark comparison of OpenAI O1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). So far, we have covered three key approaches to building and improving reasoning models: 1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model. 2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning. 3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model. So, what’s left? Model "distillation." Surprisingly, DeepSeek also released smaller models trained via a process they call distillation . However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my Machine Learning Q and AI book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset. Instead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section. To clarify this process, I have highlighted the distillation portion in the diagram below. The development process of DeepSeek-R1-Distill models. Why did they develop these distilled models? In my opinion, there are two key reasons: 1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me. 2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning. The table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1. Benchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). As we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1). Before wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B. The results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero. Benchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). Interestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models. For completeness, it would have been useful to see additional comparisons in the table: 1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT. 2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. In this section, we explored four different strategies for building and improving reasoning models: 1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1. 2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling. 3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done. 4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data. One interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time. In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from. One of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models. How does it compare to o1? Is DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1. That said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know: Is o1 also a Mixture of Experts (MoE)? How large is o1? Could o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling? Without knowing these details, a direct comparison remains an apples-to-oranges comparison. The cost of training DeepSeek-R1 Another point of discussion has been the cost of developing DeepSeek-R1. Some have mentioned a ~$6 million training cost, but they likely conflated DeepSeek-V3 (the base model released in December last year) and DeepSeek-R1. The $6 million estimate is based on an assumed $2 per GPU hour and the number of GPU hours required for the final training run of DeepSeek-V3, which was originally discussed back in December 2024. However, the DeepSeek team has never disclosed the exact GPU hours or development cost for R1, so any cost estimates remain pure speculation. Either way, ultimately, DeepSeek-R1 is a major milestone in open-weight reasoning models, and its efficiency at inference time makes it an interesting alternative to OpenAI’s o1. Developing a DeepSeek-R1-level reasoning model likely requires hundreds of thousands to millions of dollars, even when starting with an open-weight base model like DeepSeek-V3. This can feel discouraging for researchers or engineers working with limited budgets. The good news: Distillation can go a long way Fortunately, model distillation offers a more cost-effective alternative. The DeepSeek team demonstrated this with their R1-distilled models, which achieve surprisingly strong reasoning performance despite being significantly smaller than DeepSeek-R1. However, even this approach isn’t entirely cheap. Their distillation process used 800K SFT samples, which requires substantial compute. Interestingly, just a few days before DeepSeek-R1 was released, I came across an article about Sky-T1 , a fascinating project where a small team trained an open-weight 32B model using only 17K SFT samples. The total cost? Just $450, which is less than the registration fee for most AI conferences. This example highlights that while large-scale training remains expensive, smaller, targeted fine-tuning efforts can still yield impressive results at a fraction of the cost. Figure from the "Sky-T1: Train your own O1 preview model within $450" article, https://novasky-ai.github.io/posts/sky-t1/ According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost. Pure RL on a budget: TinyZero While Sky-T1 focused on model distillation, I also came across some interesting work in the "pure RL" space. One notable example is TinyZero , a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train). Surprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models. The TinyZero repository mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details. A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) showing that the model is capable of self-verification. (It would have been interesting to see the response of the base model in comparison.) The two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further. Beyond Traditional SFT: Journey Learning One particularly interesting approach I came across last year is described in the paper O1 Replication Journey: A Strategic Progress Report – Part 1 . Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process. The key idea in the paper is "journey learning" as an alternative to "shortcut learning." Shortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths. Journey learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes. This approach is kind of related to the self-verification abilities observed in TinyZero’s pure RL training, but it focuses on improving the model entirely through SFT. By exposing the model to incorrect reasoning paths and their corrections, journey learning may also reinforce self-correction abilities, potentially making reasoning models more reliable this way. Journey learning, as opposed to traditional shortcut learning, includes wrong solutions paths in the SFT data. Annotated figure from the O1 Replication Journey: A Strategic Progress Report – Part 1 (https://arxiv.org/abs/2410.18982) This could be an exciting direction for future work, particularly for low-budget reasoning model development, where RL-based approaches may be computationally impractical. Anyways, a lot of interesting work is currently happening on the reasoning model front, and I'm sure we will see a lot more exciting work in the upcoming months! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Stages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases. The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later. To give you a brief glimpse of what's covered below, in this article, I will: Explain the meaning of "reasoning model" Discuss the advantages and disadvantages of reasoning models Outline the methodology behind DeepSeek R1 Describe the four main approaches to building and improving reasoning models Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases Provide tips for developing reasoning models on a tight budget A regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers. Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs. Additionally, most LLMs branded as reasoning models today include a "thought" or "thinking" process as part of their response. Whether and how an LLM actually "thinks" is a separate discussion. Intermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user. "Reasoning" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user. When should we use reasoning models? Now that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed. When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking." Also here the simple rule applies: Use the right tool (or type of LLM) for the task. The key strengths and limitations of reasoning models are summarized in the figure below. The key strengths and weaknesses of reasoning models. A brief look at the DeepSeek training pipeline Before discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the DeepSeek R1 technical report . This report serves as both an interesting case study and a blueprint for developing reasoning LLMs. Note that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill. Based on the descriptions in the technical report, I have summarized the development process of these models in the diagram below. Development process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report. Next, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models. (1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF). (2) DeepSeek-R1: This is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the "cold-started" R1-Zero model. (3) DeepSeek-R1-Distill*: Using the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model. The 4 main ways to build and improve reasoning models In this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others. Note: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques. 1) Inference-time scaling One way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality. A rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to "think" more while generating an answer. (Although, whether LLMs actually "think" is a different discussion.) One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting , where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like "What is the capital of France", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.) An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). The aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens. Another approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses. I highly recommend the Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these different strategies. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The DeepSeek R1 technical report categorizes common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model's natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling compared to the V3 base model. However, explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, so DeepSeek may still apply such techniques within their app. I suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below. 2) Pure reinforcement learning (RL) One of my personal highlights from the DeepSeek R1 paper is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail. As outlined earlier, DeepSeek developed three types of R1 models. The first, DeepSeek-R1-Zero , was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement learning without an initial SFT stage as highlighted in the diagram below. The development process of DeepSeek-R1-Zero model. Still, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives .) However, as mentioned above, the key difference in DeepSeek-R1-Zero is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as "pure" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.) For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward. The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags. A figure from the DeepSeek R1 technical report (https://arxiv.org/abs/2501.12948) showing the emergence of the "Aha" moment. While R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate "thinking" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3) Supervised finetuning and reinforcement learning (SFT + RL) Next, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance. Note that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach. The development process of DeepSeek-R1 model. As shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call "cold-start" SFT data. The term "cold start" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data. Using this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response. The RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model. These 600K + 200K SFT samples were then used for instruction-finetuning DeepSeek-V3 base before following up with a final round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types. All in all, this is very similar to regular RLHF except that the SFT data contains (more) CoT examples. And the RL has verifiable rewards in addition to human preference-based rewards. The final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below. Benchmark comparison of OpenAI O1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). 4) Pure supervised finetuning (SFT) and distillation So far, we have covered three key approaches to building and improving reasoning models: 1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model. 2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning. 3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model. So, what’s left? Model "distillation." Surprisingly, DeepSeek also released smaller models trained via a process they call distillation . However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my Machine Learning Q and AI book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset. Instead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section. To clarify this process, I have highlighted the distillation portion in the diagram below. The development process of DeepSeek-R1-Distill models. Why did they develop these distilled models? In my opinion, there are two key reasons: 1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me. 2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning. The table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1. Benchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). As we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1). Before wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B. The results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero. Benchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). Interestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models. For completeness, it would have been useful to see additional comparisons in the table: 1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT. 2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Conclusion In this section, we explored four different strategies for building and improving reasoning models: 1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1. 2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling. 3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done. 4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data. One interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time. Thoughts about DeepSeek R1 In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from. One of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models. How does it compare to o1? Is DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1. That said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know: Is o1 also a Mixture of Experts (MoE)? How large is o1? Could o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling? Figure from the "Sky-T1: Train your own O1 preview model within $450" article, https://novasky-ai.github.io/posts/sky-t1/ According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost. Pure RL on a budget: TinyZero While Sky-T1 focused on model distillation, I also came across some interesting work in the "pure RL" space. One notable example is TinyZero , a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train). Surprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models. The TinyZero repository mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details. A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) showing that the model is capable of self-verification. (It would have been interesting to see the response of the base model in comparison.) The two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further. Beyond Traditional SFT: Journey Learning One particularly interesting approach I came across last year is described in the paper O1 Replication Journey: A Strategic Progress Report – Part 1 . Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process. The key idea in the paper is "journey learning" as an alternative to "shortcut learning." Shortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths. Journey learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes.

Machine Learning

0 views

Ahead of AI 10 months ago

Noteworthy AI Research Papers of 2024 (Part Two)

I hope your 2025 is off to a great start! To kick off the year, I've finally been able to complete the draft and second part of this AI Research Highlights of 2024 article. It covers a variety of relevant topics, from mixture-of-experts models to new LLM scaling laws for precision. Note that this article is Part Two in this series, focusing on the second half of 2024 from July through December. You can find Part One, covering January to June here. The selection criteria are admittedly subjective, based on what stood out to me this year. I've also aimed for some variety, so it's not all just about LLM model releases. I hope you are having a great 2025, and happy reading! Readers are probably already well familiar with Meta AI's Llama 3 models and paper, but since these are such important and widely-used models, I want to dedicate the July section to The Llama 3 Herd of Models (July 2024) paper by Grattafiori and colleagues. What's notable about the Llama 3 model family is the increased sophistication of the pre-training and post-training pipelines compared to its Llama 2 predecessor. Note that this is not only true for Llama 3 but other LLMs like Gemma 2 , Qwen 2 , Apple's Foundation Models , and others, as I described a few months ago in my New LLM Pre-training and Post-training Paradigms article. Llama 3 was first released in 8 billion and 70 billion parameter sizes, but the team kept iterating on the model, releasing 3.1, 3.2, and 3.3 versions of Llama. The sizes are summarized below: Llama 3 (April 2024) 8B parameters 70B parameters Llama 3.1 (July 2024, discussed in the paper) 8B parameters 70B parameters 405B parameters Llama 3.2 (September 2024) 1B parameters 3B parameters 11B parameters (vision-enabled) 90B parameters (vision-enabled) L lama 3.3 (December 2024) 70B parameters Overall, the Llama 3 architecture closely resembles that of Llama 2. The key differences lie in its larger vocabulary and the introduction of grouped-query attention for the smaller model variant. A summary of the differences is shown in the figure below. Llama 2 vs 3 comparison from the bonus material of my Build a Large Language from Scratch book If you're curious about architectural details, a great way to learn is by implementing the model from scratch and loading pretrained weights as a sanity check. I have a GitHub repository with a from-scratch implementation that converts GPT-2 to Llama 2, Llama 3, Llama 3.1, and Llama 3.2. GPT-2 to Llama 2, Llama 3, Llama 3.1, and Llama 3.2 conversion from the bonus material of my Build a Large Language from Scratch book Another noteworthy update over Llama 2 is that Llama 3 has now been trained on 15 trillion tokens. Comparison of the training set sizes of various models. The pre-training process is now multi-staged. The paper primarily focuses on Llama 3.1, and for the sake of brevity, I have summarized its pre-training techniques in the figure below. Summary of techniques used in pre-training Llama 3.1. In post-training, a notable change from Llama 2 is the switch from RLHF-PPO to DPO. These methods are also summarized in the figure below. Summary of techniques used in pre-training Llama 3.1. For the interest of brevity, since there are 5 more papers to be covered in this article, I will defer the additional details and comparisons to other models to one of my previous articles. New LLM Pre-training and Post-training Paradigms . Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Note that Llama 3.2 models were also released with multimodal support. However, I haven't observed widespread use of these models in practice, and they aren't widely discussed. We'll revisit multimodal techniques in the September section later in this article. While it's been over half a year since Llama 3 was released, Llama models continue to be among the most widely recognized and used open-weight LLMs (based on my personal perception, as I don’t have a specific source to cite). These models are relatively easy to understand and use. The reason for their popularity is likely the Llama brand recognition coupled with robust performance across a variety of general tasks, and making it easy to finetune them. Meta AI has also maintained momentum by iterating on the Llama 3 model, releasing versions 3.1, 3.2, and now 3.3, which span a variety of sizes to cater to diverse use cases, from on-device scenarios (1B) to high-performance applications (400B). Although the field now includes many competitive open-source and open-weight LLMs like Olmo 2, Qwen 2.5, Gemma 2, and Phi-4, and many others, I believe Llama will remain the go-to model for most users, much like ChatGPT has retained its popularity despite competition from options like Anthropic Claude, Google Gemini, DeepSeek, and others. Personally, I’m excited for Llama 4, which I hope will be released sometime in 2025. My pick for this month is Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (August 2024) because it is a very well-written and detailed paper that offers some interesting insights into improving LLM responses during inference time (i.e., deployment). The main premise of this paper is to investigate if and how increased test-time computation can be used to improve LLM outputs. As a rough analogy, suppose that humans, on hard tasks, can generate better responses if they are given more time to think. Analogously, LLMs may be able to produce better outputs given more time/resources to generate their responses. In more technical terms, the researchers try to find out how much better models can perform than they are trained to do if additional compute is used during inference. In addition, the researchers also looked into whether, given a fixed compute budget, spending more compute on test time can improve the results over spending that compute for further pre-training a model. But more on that later. The paper describes techniques for increasing and improving and test-time compute in great detail, and if you are serious about deploying LLMs in practice (e.g., the aforementioned Llama models), I highly recommend giving this paper a full read. In short, the 2 main methods to scale test-time compute are 1. Generating multiple solutions and using a process-based verifier reward model (it has to be separately trained) to select the best response 2. Updating the model's response distribution adaptively, which essentially means revising the responses during inference generation (this also requires a separate model). To provide a simple example for category 1: One naive way to improve test time compute is to use best-of-N sampling. This means that we let the LLM generate multiple answers in parallel and then pick the best one based on a verifier reward model. Best of N is also just one example. Multiple search algorithms fall into this category: beam-search, lookahead-search, and best-of-N, as shown in the figure below. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 Another approach, which falls into category 2, is sequentially revising the model's response, as illustrated in the figure below. Sequential revision approaches. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 Which approach works better? Unfortunately, there is no one-size-fits-all answer. It depends on the base LLM and the specific problem or query. For example, revision-based approaches perform better on harder questions, and they can actually harm performance on easy questions. In the paper, they developed an "optimal" strategy based on a model that assesses the query's difficulty level and then chooses the right strategy appropriately. An interesting question to answer is, given a fixed compute budget, what gives the bigger bang for the buck: using a larger model or using an increased inference-time budget? Here, suppose the price you pay for a query is the same because running a large model in inference is more costly than a small one. They found that for challenging questions, larger models outperform smaller models that get additional inference compute via the inference scaling strategies discussed earlier. However, for easy and medium questions, inference time compute can be used to match the performance of 14x larger models at the same compute budget! When using open-weight models like Llama 3 and others, we often let them generate responses as-is. However, as this paper highlights, response quality can be significantly enhanced by allocating more inference compute. (If you are deploying models, this is definitely THE paper to read.) Of course, increasing the inference-compute budget for large, expensive models makes them even costlier to operate. Yet, when applied selectively based on the difficulty of the queries, it can provide a valuable boost in quality and accuracy for certain responses, which is something most users would undoubtedly appreciate. (It’s safe to assume that OpenAI, Anthropic, and Google already leverage such techniques behind the scenes.) Another compelling use case is enhancing the performance of smaller, on-device LLMs. I think this will remain a hot topic in the months and years ahead as we've also seen with the big announcements and investments in Apple Intelligence and Microsoft’s Copilot PCs. Multimodal LLMs were one of the major things I thought would make big leaps in 2024. And yes, we got some more open-weight LLMs this year! An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. One paper that particularly stood out to me was NVIDIA's NVLM: Open Frontier-Class Multimodal LLMs (September 2024) by Dai and colleagues, because it nicely compares the two leading multimodal paradigms. There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. The two main approaches to developing multimodal LLM architectures. As illustrated in the figure above, the Unified Embedding-Decoder Architecture (Method A) relies on a single decoder model, resembling an unmodified LLM architecture such as GPT-2 or Llama 3.2. This method converts images into tokens that share the same embedding size as text tokens, enabling the LLM to process concatenated text and image input tokens. In contrast, the Cross-Modality Attention Architecture (Method B) incorporates a cross-attention mechanism to integrate image and text embeddings within the attention layer directly. If you are interested in additional details, I dedicated a whole article to multimodal LLMs earlier this year that goes over these two methods step by step: Understanding Multimodal LLMs -- An introduction to the main techniques and latest models . Given all the multimodal developments this year, to me, NVIDIA's paper NVLM: Open Frontier-Class Multimodal LLMs stands out for its comprehensive apples-to-apples comparison of these multimodal approaches. Rather than focusing on a single method, they directly compared: Method A: The Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), Method B: The Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X), A hybrid approach (NVLM-H). Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure above, NVLM-D aligns with Method A, and NVLM-X corresponds to Method B, as discussed earlier. The hybrid model (NVLM-H) combines the strengths of both approaches: it first accepts an image thumbnail as input, followed by a dynamic number of patches processed through cross-attention to capture finer high-resolution details. In summary, the key findings are as follows: NVLM-X: Offers superior computational efficiency for high-resolution images. NVLM-D: Delivers higher accuracy for OCR-related tasks. NVLM-H: Combines the strengths of both approaches for optimal performance. Multimodal LLMs are an interesting one. I think they are the next logical development up from regular text-based LLMs. Most LLM service providers like (OpenAI, Google, and Anthropic) support multimodal inputs like images. Personally, I need multimodal capabilities maybe 1% of the time (usually, it's something like: "extract the table in markdown format" or something like that). I do expect the default of open-weight LLMs to be purely text-based because it adds less complexity. At the same time I do think we will see more options and widespread use of open-weight LLMs as the tooling and APIs evolve. My pick for October is the O1 Replication Journey: A Strategic Progress Report -- Part 1 . (October 2024) by Quin and colleagues. OpenAI ChatGPT's o1 (and now o3) have gained significant popularity, as they seem to represent a paradigm shift in improving LLMs' performance on reasoning tasks. The exact details of OpenAI's o1 remain undisclosed, and several papers have attempted to describe or replicate it. So, why did I choose this one? Its unusual structure and broader philosophical arguments about the state of academic research resonated with me. In other words, there was something distinctive about it that stood out and made it an interesting choice. One of the key points of this paper is the researchers' hypothesis that O1 employs a process called journey learning as opposed to shortcut learning, as illustrated in the figure below. Traditionally, LLMs are trained on the correct solution path (shortcut learning); in journey learning, the supervised finetuning encompasses the whole trial-and-error correction process. Annotated figure from the O1 Replication Report, https://arxiv.org/abs/2410.18982 It's worth noting that the journey learning approach is somewhat similar to the tree-based or beam-search methods with revisions, as discussed earlier in the "8. August: Improving LLMs by Scaling Inference-Time Compute" section of this article. The subtle difference, however, is that the researchers create journey learning training examples for model finetuning, rather than simply applying this technique during inference. (It's worth noting that I couldn't find any information on the techniques they used to augment the inference process.) The researchers constructed a reasoning tree to derive an extended thought process from it, emphasizing trial and error. This approach diverges from traditional methods that prioritize finding a direct path to the correct answer with valid intermediate steps. In their framework, each node in the reasoning tree was annotated with a rating provided by a reward model, indicating whether the step was correct or incorrect, along with reasoning to justify this evaluation. Next, they trained a deepseek-math-7b-base model via supervised finetuning and DPO. Here, they trained two models. 1. First they used the traditional shortcut training paradigm where only the correct intermediate steps were provided. 2. Second, they trained the model with their proposed journey learning approach that included the thought process three with correct and incorrect answers, backtracking, and so forth. (Sidenote: They only used 327 examples in each case!) As shown in the figure below, the journey learning process outperformed shortcut learning by quite a wide margin on the MATH500 benchmark dataset. LLMs trained with shortcut and journey learning. Annotated figure from the O1 Replication Report, https://arxiv.org/abs/2410.18982 One month later, the team released another report: O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? (November 2024) by Huang and colleagues. Here, they used a distillation approach, meaning they used careful prompting to extract the thought processes from o1 to train a model to reach the same performance. Since this is a long article, I won't go over the details, but I wanted to share an interesting figure from that paper that summarizes the cost trade-offs of collecting long-thought data. They got really good performance with this distillation approach, performing on-part with o1-preview and o1-mini. However, along with these experiments, the researchers also shared some interesting and important thoughts about the state of research in light of this approach, which I will summarize in the next section. One big focus of the part 2 report was the "Bitter Lesson of Simple Distillation". Sure, distillation works well in practice, but it isn't what drives progress. In the best case, using distillation, you are matching the performance of of an existing upstream model (but you are not setting a new performance record.) Below are three quotes from the paper that might serve as a warning call about the current status quo: "This shift from “how it works” to 'what works' represents a fundamental change in research mentality that could have far-reaching consequences for the field’s future innovation capacity." "This erosion of first-principles thinking is particularly concerning as it undermines the very foundation of scientific innovation." "Pressure to produce quick results may overshadow the value of deeper technical investigations, while students may be discouraged from pursuing more challenging, fundamental research directions." My personal take is that I still think there are tons of great and important ideas coming out of academic labs (today also often in partnership with industry), and they can be really practical and impactful. (A couple of my favorites that come to mind are LoRA and DPO.) The catch is that a lot of promising ideas never get tested at scale because universities usually don't have the massive resources needed for that. I'm not sure what the perfect solution is, and I do realize that companies can't just give away their trade secrets. But it would be really helpful if, whenever companies do end up using ideas from academic papers, they'd openly acknowledge it. That kind of recognition goes a long way in motivating and rewarding researchers who make their work freely available. Also, it helps move the field forward by finding out what actually works in practice. Does the O1 Replication Journey paper replicate the exact mechanism behind o1? Probably not. But it is still a valuable read full of ideas that can help achieve better results. I believe that “long-thought” models like o1 and o3 will continue to play a key role in LLM research. They are more expensive to run, but they are basically the gold standard or the upper limit for performance on reasoning tasks. But because of their higher cost, o1-type models are not always the best option for every situation. For simpler tasks like grammar fixes or translations, we likely do not need a reasoning-heavy model. It all comes down to balancing cost and utility. We pick the right LLM for the job based on budget, latency, and other factors. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. I was originally tempted to pick Allen AI's Tulu 3: Pushing Frontiers in Open Language Model Post-Training paper because they included a detailed description of their Llama post-training methods and recipe, including ablation studies of DPO vs PPO, and a new preference alignment method called reinforcement learning with verifiable feedbacks, where they use verifiable queries where one can easily generate a ground truth answer (such as math and code questions) instead of a reward model. But after some internal debate, I ultimately decided to go with the Scaling Laws for Precision paper (November 2024) by Kumar and colleagues, as it provides a much-needed update for the Chinchilla scaling laws from the 2022 Training Compute-Optimal Large Language Models paper that is used to determine compute-optimal LLM parameter counts and dataset sizes for pretraining. In short, the Precision Scaling Laws paper (November 2024) extends Chinchilla's scaling laws to account for training and inference in low-precision settings (16-bit and below), which have become very popular in recent years. For instance, this paper unifies various low-precision and quantization-related observations into a single functional form that predicts the added loss from both low-precision training and post-training quantization. The original Chinchilla scaling laws from the 2022 Training Compute-Optimal Large Language Models paper model how LLM parameter counts ( N ) and dataset sizes ( D ) jointly affect the validation loss of an LLM and are used as guidelines for deciding upon the LLM and training dataset sizes. As a rule of thumb, the best tradeoff between dataset size D and the number of parameters N (when you have a fixed compute budget) is approximately D/N ≈ 20. This data-parameter ratio is often referred to as "Chinchilla-optimal" because it yields lower validation loss than other ratios at the same total training cost. Note that there are many modern exceptions, though; for example, the Llama 3 team trained on 15 trillion tokens, as discussed earlier, and for the 8B version, that'd be 15,000,000,000,000 ÷ 8,000,000,000 = 1,875, for example. In my opinion, what's more important than the exact data-parameter ratio is the takeaway that model and dataset sizes have to be scaled proportionally. Before discussing (or rather summarizing) the low-precision scaling laws further, let me start with a very short primer on different numeric precision formats for LLM (or deep neural network) weights in general. To the best of my knowledge, these were the precision formats used for training GPT 2 & 3 and Llama 2 & 3 models for comparison: Float32 was the standard 32-bit floating-point format widely used for training deep neural networks, as it offers a good balance between range and precision. Everything below float32 is nowadays considered low-precision (although the definition of "low" is kind of a moving goalpost similar to the "large" in large language models). Float16, or half-precision, uses just 16 bits, saving memory and speeding up computation but providing a narrower dynamic range. Comparison between 32-bit and 16-bit floating point precision Bfloat16 (brain float 16) is also a 16-bit format but trades off some of float16’s precision for a larger exponent, allowing it to represent very large and very small numbers more effectively. As a result, bfloat16 can help avoid numeric overflow or underflow in deep learning applications, although its lower precision can still lead to rounding errors Comparison between regular 16-bit floating point and the popular 16-bit brain floating point precision If you want to learn more about the different precision formats and their impact on LLM model behavior, you might like the lengthier intro in my previous The Missing Bits: Llama 2 Weights Have Changed article. Also note that I am only showing 32- and 16-bit formats, whereas there's currently a race to even lower precisions for training, e.g., the 8-bit format that was mentioned (as experimental) in the Llama 3 paper. (The DeepSeek-v3 model that was released on Dec 26 was entirely pretrained in 8-bit floating point precision.) It's a long and interesting paper that I recommend reading in full. However, to get to the main point, the researchers extend the original Chinchilla scaling laws by adding a "precision" factor P. Concretely, they reinterpret the model parameter count N as an "effective parameter count" that shrinks as the precision decreases. (For the mathematical formulas, defer to the paper.) Plus, they added an extra term to capture how post-training quantization degrades model performance. (I realize that I didn't write an intro to quantization, but due to the excessive length of this article already, I may have to defer this to another time.) The figure below is a nice illustration that more pretraining data is not always better and can actually be harmful if models are quantized after training with very small precision (int3), which I found super interesting. The effect of more training data on the validation loss for various post-quantization formats So, as a takeaway from the figure above, one might say that models trained on more and more data (like Llama 3) become harder to quantize to lower precision formats after training due to being "overtrained" on too much data. Besides providing a much-needed update to the Chinchilla scaling laws, the research on Precision Scaling Laws provides an interesting perspective on a critical challenge for 2025: as models like LLaMA-3 are trained on larger datasets, they may become harder to quantize to low precision formats like INT3 without performance loss. This finding underscores the need to rethink the "more data is better" mindset, balancing dataset size with the practical constraints of efficient inference. It's also an important insight for driving hardware optimization. One of the aspects that I think is often neglected in these scaling laws studies is the dataset's quality. I think the pretraining data's nature can have a significant impact. (More on that in the Phi-4 discussion below.) Several interesting models were released in the latter half of 2024, including the impressive DeepSeek-V3 on Christmas day. While it might not be the biggest model release, ultimately, I decided to go with Microsoft's Phi-4 Technical Report because it offers interesting insights into the use of synthetic data. The Phi-4 Technical Report (December 2024) by Abdin and colleagues describes the training of Microsoft's latest 14-billion-parameter open-weight LLM. What makes Phi-4 particularly interesting is that it was trained primarily on synthetic data generated by GPT-4o. According to the benchmarks, it outperforms other LLMs of a similar size, including its predecessor, Phi-3, which was trained predominantly on non-synthetic data. Performance of phi-4 compared to other models of similar and different sizes (annotated table from the phi-4 paper, https://arxiv.org/abs/2412.08905) I’m not entirely sure why the model performs worse on SimpleQA, as shown in the table above. But one possible explanation is that SimpleQA is a relatively new benchmark, released on October 30, 2024. Since it was developed by OpenAI as part of their evaluation suite, it might not have been included in the training data for GPT-4o or incorporated into the web-crawled datasets. Furthermore, because GPT-4o was used to generate the synthetic data for this evaluation, none of the models would have encountered SimpleQA during training. However, phi-4 might be overfitting to other benchmarks, which could explain its comparatively lower performance on this unseen SimpleQA dataset. Anyways, that's just my hypothesis. Let's look at the dataset composition before summarizing some of the ablation studies presented in this paper. Dataset mix for training phi-4 (annotated table from the phi-4 paper, https://arxiv.org/abs/2412.08905). The researchers observed that while synthetic data is generally beneficial, models trained exclusively on synthetic data performed poorly on knowledge-based benchmarks. To me, this raises the question: does synthetic data lack sufficient knowledge-specific information, or does it include a higher proportion of factual errors, such as those caused by hallucinations? At the same time, the researchers found that increasing the number of training epochs on synthetic data boosted the performance more than just adding more web data, as shown in the figure below. Model performance comparison for different synthetic/web dataset ratios. (Annotated figure from the phi-4 paper, https://arxiv.org/abs/2412.08905). In summary, an excessive proportion of synthetic data in the mix negatively impacts knowledge-based performance. However, within a more balanced synthetic-to-web data mix, increasing the number of iterations (epochs) over the synthetic dataset is beneficial. The phi-4 technical report offers interesting insights into the use of synthetic data, namely that it can be highly beneficial for model pre-training. Especially since scaling laws are said to be plateauing concerning both model and dataset sizes (although the Llama 3 paper noted that they haven't seen a convergence at the 15T token level yet), researchers and engineers are looking for alternative ways to keep pushing the envelope. Of course, the refinement and addition of pre- and especially post-training techniques will likely remain one of the big needle movers. Still, I think that the use of synthetic data will be regarded as an effective way to either create a) pretrained base models with less data or b) create even better base models (think 15 trillion tokens from the Llama 3 dataset plus 40% synthetic data tokens added to it). I see the use of high-quality data as analogous to transfer learning. Instead of pre-training a model on raw, unstructured internet data and refining it during post-training, leveraging (some) synthetic data generated by a high-quality model (such as GPT-4o, which has already undergone extensive refinement) may serve as a kind of jumpstart. In other words, the use of high-quality training data might enable the model to learn more effectively from the outset. I hope you found these research summaries useful! As always, this article ended up being longer than I originally intended. But let me close out with a relatively short and snappy section on my predictions (or expectations) for 2025. Multimodal LLMs Last year, I predicted LLMs would become increasingly multimodal. Now, all major proprietary LLM providers offer multimodal (or at least image) support. So, the transformation is now fully underway, and we will also see more open-source efforts toward this. Based on what I've seen and read, there's definitely been a sharp increase in multimodal papers. Maybe followed by my open-source finetuning methods and resources; although I'd argue for many use cases, text-only suffices and will continue to suffice, and the main focus will be on developing better reasoning models (like o1 and the upcoming o3). Computational efficiency Pretraining and using LLMs is relatively expensive. So, I expect that we are going to see more clever tricks to improve computational efficiency of LLMs in the foreseeable future. For reference, training the recent DeepSeek-v3 model would cost $5 million dollars assuming the GPU rental sticker prices (and this doesn't include hyperparameter tuning, failed runs, and personnel cost). Back-off-the-envelope calculation from the DeepSeek-v3 report, https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf By the way, according to the official Meta AI Llama 3 model card , Llama 3 405B used even ~10x more compute (30.84 million GPU hours vs 2.66 million GPU hours). Popular examples of techniques to make LLMs efficient (although not all apply during training) include a mixture of experts (as discussed in my part 1 article), grouped-query attention as found in Llama models, and many others. Another interesting one is the use of multihead latent attention, as found in DeepSeek models, to make KV-caching in multihead attention more efficient. Another interesting recent route is targeting the model input. For instance, the recently proposed Byte Latent Transformer improves efficiency by dynamically encoding bytes into entropy-based patches, optimizing compute for scalability and faster inference without tokenization. State space models You may have noticed that I didn't cover state space models this year. That’s because my current focus is primarily on transformer-based LLMs. While I find state space models super interesting, they still seem quite experimental at this stage. Besides, transformers continue to demonstrate exceptional performance across a wide range of tasks, making it not very tempting to consider alternatives. However, that doesn't mean there hasn't been any progress on the state space model front. I've seen a bunch of interesting papers in this area. And one interesting trend I noticed is that they are now all more or less hybrid models integrating self-attention from transformer models. For example, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale , The Mamba in the Llama: Distilling and Accelerating Hybrid Models , and Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling . In that sense, they are also getting more computationally expensive. With efficiency tweaks to transformer-based LLMs and adding attention to state space models, they will probably meet somewhere in the middle if the current trends continue. It's definitely an interesting field of research to watch though. LLM progress through scaling Towards the end of the year, there was also some discussion of LLM scaling "being over" as there is no more internet data. This discussion came from a NeurIPS talk by Ilya Sutskever (one of OpenAI's co-founders and co-author on the GPT papers), but unfortunately, I couldn't attend the conference this year, so I am not familiar with the details. In any case, it's an interesting point because the internet grows exponentially fast. I could find resources saying that it grows "15.87 terabytes of data daily." Sure, the challenge is that not all of the data is text or useful for LLM training. However, as we have seen with Phi-4, there are still a lot of opportunities in data curation and refinement that can help make some leaps from training data alone. I agree with the diminishing returns of scaling via data, though. I expect that the gains will be smaller as we are probably heading towards plateauing. It's not a bad thing, though, as it brings other improvement opportunities. One notable area where I expect a lot of future gains to come from is post-training. We've already seen a taste of these developments in this area with recent LLM releases, as I wrote about last summer in my New LLM Pre-training and Post-training Paradigms article. What I am looking forward to in 2025 I really enjoyed tinkering and (re)implementing the various Llama models (3, 3.1, and 3.2) this year. I am really looking forward to the Llama 4 release, which hopefully also comes in small and convenient sizes that I can experiment with on my laptop or affordable cloud GPUs. Moreover, it's also the year where I want to experiment more with special-purpose model finetuning rather than generating general chatbots (it's already pretty crowded in this space). We've seen a bit of that with various code and math models (the recent Qwen 2.5 Coder and Qwen 2.5 Math come to mind, which I unfortunately haven't had a chance to cover in this report yet). In any case, I could keep on going with this wish list and plans, as 2025 will be another interesting and fast-moving year! It's definitely not going to be boring, that's for sure! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Subscribe now 8B parameters 70B parameters 8B parameters 70B parameters 405B parameters 1B parameters 3B parameters 11B parameters (vision-enabled) 90B parameters (vision-enabled) 70B parameters Llama 2 vs 3 comparison from the bonus material of my Build a Large Language from Scratch book If you're curious about architectural details, a great way to learn is by implementing the model from scratch and loading pretrained weights as a sanity check. I have a GitHub repository with a from-scratch implementation that converts GPT-2 to Llama 2, Llama 3, Llama 3.1, and Llama 3.2. GPT-2 to Llama 2, Llama 3, Llama 3.1, and Llama 3.2 conversion from the bonus material of my Build a Large Language from Scratch book 7.3 Llama 3 training Another noteworthy update over Llama 2 is that Llama 3 has now been trained on 15 trillion tokens. Comparison of the training set sizes of various models. The pre-training process is now multi-staged. The paper primarily focuses on Llama 3.1, and for the sake of brevity, I have summarized its pre-training techniques in the figure below. Summary of techniques used in pre-training Llama 3.1. In post-training, a notable change from Llama 2 is the switch from RLHF-PPO to DPO. These methods are also summarized in the figure below. Summary of techniques used in pre-training Llama 3.1. For the interest of brevity, since there are 5 more papers to be covered in this article, I will defer the additional details and comparisons to other models to one of my previous articles. New LLM Pre-training and Post-training Paradigms . Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 7.4 Multimodal Llamas Note that Llama 3.2 models were also released with multimodal support. However, I haven't observed widespread use of these models in practice, and they aren't widely discussed. We'll revisit multimodal techniques in the September section later in this article. 7.5 Llama 3 impact and usage While it's been over half a year since Llama 3 was released, Llama models continue to be among the most widely recognized and used open-weight LLMs (based on my personal perception, as I don’t have a specific source to cite). These models are relatively easy to understand and use. The reason for their popularity is likely the Llama brand recognition coupled with robust performance across a variety of general tasks, and making it easy to finetune them. Meta AI has also maintained momentum by iterating on the Llama 3 model, releasing versions 3.1, 3.2, and now 3.3, which span a variety of sizes to cater to diverse use cases, from on-device scenarios (1B) to high-performance applications (400B). Although the field now includes many competitive open-source and open-weight LLMs like Olmo 2, Qwen 2.5, Gemma 2, and Phi-4, and many others, I believe Llama will remain the go-to model for most users, much like ChatGPT has retained its popularity despite competition from options like Anthropic Claude, Google Gemini, DeepSeek, and others. Personally, I’m excited for Llama 4, which I hope will be released sometime in 2025. 8. August: Improving LLMs by scaling inference-time compute My pick for this month is Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (August 2024) because it is a very well-written and detailed paper that offers some interesting insights into improving LLM responses during inference time (i.e., deployment). 8.1 Improve outputs by using more test-time computation The main premise of this paper is to investigate if and how increased test-time computation can be used to improve LLM outputs. As a rough analogy, suppose that humans, on hard tasks, can generate better responses if they are given more time to think. Analogously, LLMs may be able to produce better outputs given more time/resources to generate their responses. In more technical terms, the researchers try to find out how much better models can perform than they are trained to do if additional compute is used during inference. In addition, the researchers also looked into whether, given a fixed compute budget, spending more compute on test time can improve the results over spending that compute for further pre-training a model. But more on that later. 8.2 Optimizing test-time computation techniques The paper describes techniques for increasing and improving and test-time compute in great detail, and if you are serious about deploying LLMs in practice (e.g., the aforementioned Llama models), I highly recommend giving this paper a full read. In short, the 2 main methods to scale test-time compute are 1. Generating multiple solutions and using a process-based verifier reward model (it has to be separately trained) to select the best response 2. Updating the model's response distribution adaptively, which essentially means revising the responses during inference generation (this also requires a separate model). To provide a simple example for category 1: One naive way to improve test time compute is to use best-of-N sampling. This means that we let the LLM generate multiple answers in parallel and then pick the best one based on a verifier reward model. Best of N is also just one example. Multiple search algorithms fall into this category: beam-search, lookahead-search, and best-of-N, as shown in the figure below. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 Another approach, which falls into category 2, is sequentially revising the model's response, as illustrated in the figure below. Sequential revision approaches. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 Which approach works better? Unfortunately, there is no one-size-fits-all answer. It depends on the base LLM and the specific problem or query. For example, revision-based approaches perform better on harder questions, and they can actually harm performance on easy questions. In the paper, they developed an "optimal" strategy based on a model that assesses the query's difficulty level and then chooses the right strategy appropriately. 8.3 Test-time computation versus pretraining a larger model An interesting question to answer is, given a fixed compute budget, what gives the bigger bang for the buck: using a larger model or using an increased inference-time budget? Here, suppose the price you pay for a query is the same because running a large model in inference is more costly than a small one. They found that for challenging questions, larger models outperform smaller models that get additional inference compute via the inference scaling strategies discussed earlier. However, for easy and medium questions, inference time compute can be used to match the performance of 14x larger models at the same compute budget! 8.4 Future relevance of test-time compute scaling When using open-weight models like Llama 3 and others, we often let them generate responses as-is. However, as this paper highlights, response quality can be significantly enhanced by allocating more inference compute. (If you are deploying models, this is definitely THE paper to read.) Of course, increasing the inference-compute budget for large, expensive models makes them even costlier to operate. Yet, when applied selectively based on the difficulty of the queries, it can provide a valuable boost in quality and accuracy for certain responses, which is something most users would undoubtedly appreciate. (It’s safe to assume that OpenAI, Anthropic, and Google already leverage such techniques behind the scenes.) Another compelling use case is enhancing the performance of smaller, on-device LLMs. I think this will remain a hot topic in the months and years ahead as we've also seen with the big announcements and investments in Apple Intelligence and Microsoft’s Copilot PCs. 9. September: Comparing multimodal LLM paradigms Multimodal LLMs were one of the major things I thought would make big leaps in 2024. And yes, we got some more open-weight LLMs this year! An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. One paper that particularly stood out to me was NVIDIA's NVLM: Open Frontier-Class Multimodal LLMs (September 2024) by Dai and colleagues, because it nicely compares the two leading multimodal paradigms. 9.1 Multimodal LLM paradigms There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. The two main approaches to developing multimodal LLM architectures. As illustrated in the figure above, the Unified Embedding-Decoder Architecture (Method A) relies on a single decoder model, resembling an unmodified LLM architecture such as GPT-2 or Llama 3.2. This method converts images into tokens that share the same embedding size as text tokens, enabling the LLM to process concatenated text and image input tokens. In contrast, the Cross-Modality Attention Architecture (Method B) incorporates a cross-attention mechanism to integrate image and text embeddings within the attention layer directly. If you are interested in additional details, I dedicated a whole article to multimodal LLMs earlier this year that goes over these two methods step by step: Understanding Multimodal LLMs -- An introduction to the main techniques and latest models . 9.2 Nvidia's hybrid approach Given all the multimodal developments this year, to me, NVIDIA's paper NVLM: Open Frontier-Class Multimodal LLMs stands out for its comprehensive apples-to-apples comparison of these multimodal approaches. Rather than focusing on a single method, they directly compared: Method A: The Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), Method B: The Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X), A hybrid approach (NVLM-H). Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure above, NVLM-D aligns with Method A, and NVLM-X corresponds to Method B, as discussed earlier. The hybrid model (NVLM-H) combines the strengths of both approaches: it first accepts an image thumbnail as input, followed by a dynamic number of patches processed through cross-attention to capture finer high-resolution details. In summary, the key findings are as follows: NVLM-X: Offers superior computational efficiency for high-resolution images. NVLM-D: Delivers higher accuracy for OCR-related tasks. NVLM-H: Combines the strengths of both approaches for optimal performance. Traditionally, LLMs are trained on the correct solution path (shortcut learning); in journey learning, the supervised finetuning encompasses the whole trial-and-error correction process. Annotated figure from the O1 Replication Report, https://arxiv.org/abs/2410.18982 It's worth noting that the journey learning approach is somewhat similar to the tree-based or beam-search methods with revisions, as discussed earlier in the "8. August: Improving LLMs by Scaling Inference-Time Compute" section of this article. The subtle difference, however, is that the researchers create journey learning training examples for model finetuning, rather than simply applying this technique during inference. (It's worth noting that I couldn't find any information on the techniques they used to augment the inference process.) 10.2 Constructing long thoughts The researchers constructed a reasoning tree to derive an extended thought process from it, emphasizing trial and error. This approach diverges from traditional methods that prioritize finding a direct path to the correct answer with valid intermediate steps. In their framework, each node in the reasoning tree was annotated with a rating provided by a reward model, indicating whether the step was correct or incorrect, along with reasoning to justify this evaluation. Next, they trained a deepseek-math-7b-base model via supervised finetuning and DPO. Here, they trained two models. 1. First they used the traditional shortcut training paradigm where only the correct intermediate steps were provided. 2. Second, they trained the model with their proposed journey learning approach that included the thought process three with correct and incorrect answers, backtracking, and so forth. (Sidenote: They only used 327 examples in each case!) As shown in the figure below, the journey learning process outperformed shortcut learning by quite a wide margin on the MATH500 benchmark dataset. LLMs trained with shortcut and journey learning. Annotated figure from the O1 Replication Report, https://arxiv.org/abs/2410.18982 10.3 Distillation -- the quick fix? One month later, the team released another report: O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? (November 2024) by Huang and colleagues. Here, they used a distillation approach, meaning they used careful prompting to extract the thought processes from o1 to train a model to reach the same performance. Since this is a long article, I won't go over the details, but I wanted to share an interesting figure from that paper that summarizes the cost trade-offs of collecting long-thought data. They got really good performance with this distillation approach, performing on-part with o1-preview and o1-mini. However, along with these experiments, the researchers also shared some interesting and important thoughts about the state of research in light of this approach, which I will summarize in the next section. 10.4 The state of AI research One big focus of the part 2 report was the "Bitter Lesson of Simple Distillation". Sure, distillation works well in practice, but it isn't what drives progress. In the best case, using distillation, you are matching the performance of of an existing upstream model (but you are not setting a new performance record.) Below are three quotes from the paper that might serve as a warning call about the current status quo: "This shift from “how it works” to 'what works' represents a fundamental change in research mentality that could have far-reaching consequences for the field’s future innovation capacity." "This erosion of first-principles thinking is particularly concerning as it undermines the very foundation of scientific innovation." "Pressure to produce quick results may overshadow the value of deeper technical investigations, while students may be discouraged from pursuing more challenging, fundamental research directions." Float32 was the standard 32-bit floating-point format widely used for training deep neural networks, as it offers a good balance between range and precision. Everything below float32 is nowadays considered low-precision (although the definition of "low" is kind of a moving goalpost similar to the "large" in large language models). Float16, or half-precision, uses just 16 bits, saving memory and speeding up computation but providing a narrower dynamic range. Comparison between 32-bit and 16-bit floating point precision Bfloat16 (brain float 16) is also a 16-bit format but trades off some of float16’s precision for a larger exponent, allowing it to represent very large and very small numbers more effectively. As a result, bfloat16 can help avoid numeric overflow or underflow in deep learning applications, although its lower precision can still lead to rounding errors Comparison between regular 16-bit floating point and the popular 16-bit brain floating point precision If you want to learn more about the different precision formats and their impact on LLM model behavior, you might like the lengthier intro in my previous The Missing Bits: Llama 2 Weights Have Changed article. Also note that I am only showing 32- and 16-bit formats, whereas there's currently a race to even lower precisions for training, e.g., the 8-bit format that was mentioned (as experimental) in the Llama 3 paper. (The DeepSeek-v3 model that was released on Dec 26 was entirely pretrained in 8-bit floating point precision.) 11.3 Precision scaling laws takeaways It's a long and interesting paper that I recommend reading in full. However, to get to the main point, the researchers extend the original Chinchilla scaling laws by adding a "precision" factor P. Concretely, they reinterpret the model parameter count N as an "effective parameter count" that shrinks as the precision decreases. (For the mathematical formulas, defer to the paper.) Plus, they added an extra term to capture how post-training quantization degrades model performance. (I realize that I didn't write an intro to quantization, but due to the excessive length of this article already, I may have to defer this to another time.) The figure below is a nice illustration that more pretraining data is not always better and can actually be harmful if models are quantized after training with very small precision (int3), which I found super interesting. The effect of more training data on the validation loss for various post-quantization formats So, as a takeaway from the figure above, one might say that models trained on more and more data (like Llama 3) become harder to quantize to lower precision formats after training due to being "overtrained" on too much data. 11.4 Model scaling laws in 2025 Besides providing a much-needed update to the Chinchilla scaling laws, the research on Precision Scaling Laws provides an interesting perspective on a critical challenge for 2025: as models like LLaMA-3 are trained on larger datasets, they may become harder to quantize to low precision formats like INT3 without performance loss. This finding underscores the need to rethink the "more data is better" mindset, balancing dataset size with the practical constraints of efficient inference. It's also an important insight for driving hardware optimization. One of the aspects that I think is often neglected in these scaling laws studies is the dataset's quality. I think the pretraining data's nature can have a significant impact. (More on that in the Phi-4 discussion below.) 12. December: Phi-4 and Learning from Synthetic Data Several interesting models were released in the latter half of 2024, including the impressive DeepSeek-V3 on Christmas day. While it might not be the biggest model release, ultimately, I decided to go with Microsoft's Phi-4 Technical Report because it offers interesting insights into the use of synthetic data. 12.1 Phi-4 performance The Phi-4 Technical Report (December 2024) by Abdin and colleagues describes the training of Microsoft's latest 14-billion-parameter open-weight LLM. What makes Phi-4 particularly interesting is that it was trained primarily on synthetic data generated by GPT-4o. According to the benchmarks, it outperforms other LLMs of a similar size, including its predecessor, Phi-3, which was trained predominantly on non-synthetic data. Performance of phi-4 compared to other models of similar and different sizes (annotated table from the phi-4 paper, https://arxiv.org/abs/2412.08905) I’m not entirely sure why the model performs worse on SimpleQA, as shown in the table above. But one possible explanation is that SimpleQA is a relatively new benchmark, released on October 30, 2024. Since it was developed by OpenAI as part of their evaluation suite, it might not have been included in the training data for GPT-4o or incorporated into the web-crawled datasets. Furthermore, because GPT-4o was used to generate the synthetic data for this evaluation, none of the models would have encountered SimpleQA during training. However, phi-4 might be overfitting to other benchmarks, which could explain its comparatively lower performance on this unseen SimpleQA dataset. Anyways, that's just my hypothesis. 12.2 Synthetic data learnings Let's look at the dataset composition before summarizing some of the ablation studies presented in this paper. Dataset mix for training phi-4 (annotated table from the phi-4 paper, https://arxiv.org/abs/2412.08905). The researchers observed that while synthetic data is generally beneficial, models trained exclusively on synthetic data performed poorly on knowledge-based benchmarks. To me, this raises the question: does synthetic data lack sufficient knowledge-specific information, or does it include a higher proportion of factual errors, such as those caused by hallucinations? At the same time, the researchers found that increasing the number of training epochs on synthetic data boosted the performance more than just adding more web data, as shown in the figure below. Model performance comparison for different synthetic/web dataset ratios. (Annotated figure from the phi-4 paper, https://arxiv.org/abs/2412.08905). In summary, an excessive proportion of synthetic data in the mix negatively impacts knowledge-based performance. However, within a more balanced synthetic-to-web data mix, increasing the number of iterations (epochs) over the synthetic dataset is beneficial. 12.3 Future importance of synthetic data The phi-4 technical report offers interesting insights into the use of synthetic data, namely that it can be highly beneficial for model pre-training. Especially since scaling laws are said to be plateauing concerning both model and dataset sizes (although the Llama 3 paper noted that they haven't seen a convergence at the 15T token level yet), researchers and engineers are looking for alternative ways to keep pushing the envelope. Of course, the refinement and addition of pre- and especially post-training techniques will likely remain one of the big needle movers. Still, I think that the use of synthetic data will be regarded as an effective way to either create a) pretrained base models with less data or b) create even better base models (think 15 trillion tokens from the Llama 3 dataset plus 40% synthetic data tokens added to it). I see the use of high-quality data as analogous to transfer learning. Instead of pre-training a model on raw, unstructured internet data and refining it during post-training, leveraging (some) synthetic data generated by a high-quality model (such as GPT-4o, which has already undergone extensive refinement) may serve as a kind of jumpstart. In other words, the use of high-quality training data might enable the model to learn more effectively from the outset. Conclusion & Outlook I hope you found these research summaries useful! As always, this article ended up being longer than I originally intended. But let me close out with a relatively short and snappy section on my predictions (or expectations) for 2025. Multimodal LLMs Last year, I predicted LLMs would become increasingly multimodal. Now, all major proprietary LLM providers offer multimodal (or at least image) support. So, the transformation is now fully underway, and we will also see more open-source efforts toward this. Based on what I've seen and read, there's definitely been a sharp increase in multimodal papers. Maybe followed by my open-source finetuning methods and resources; although I'd argue for many use cases, text-only suffices and will continue to suffice, and the main focus will be on developing better reasoning models (like o1 and the upcoming o3). Computational efficiency Pretraining and using LLMs is relatively expensive. So, I expect that we are going to see more clever tricks to improve computational efficiency of LLMs in the foreseeable future. For reference, training the recent DeepSeek-v3 model would cost $5 million dollars assuming the GPU rental sticker prices (and this doesn't include hyperparameter tuning, failed runs, and personnel cost). Back-off-the-envelope calculation from the DeepSeek-v3 report, https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf By the way, according to the official Meta AI Llama 3 model card , Llama 3 405B used even ~10x more compute (30.84 million GPU hours vs 2.66 million GPU hours). Popular examples of techniques to make LLMs efficient (although not all apply during training) include a mixture of experts (as discussed in my part 1 article), grouped-query attention as found in Llama models, and many others. Another interesting one is the use of multihead latent attention, as found in DeepSeek models, to make KV-caching in multihead attention more efficient. Another interesting recent route is targeting the model input. For instance, the recently proposed Byte Latent Transformer improves efficiency by dynamically encoding bytes into entropy-based patches, optimizing compute for scalability and faster inference without tokenization. State space models You may have noticed that I didn't cover state space models this year. That’s because my current focus is primarily on transformer-based LLMs. While I find state space models super interesting, they still seem quite experimental at this stage. Besides, transformers continue to demonstrate exceptional performance across a wide range of tasks, making it not very tempting to consider alternatives. However, that doesn't mean there hasn't been any progress on the state space model front. I've seen a bunch of interesting papers in this area. And one interesting trend I noticed is that they are now all more or less hybrid models integrating self-attention from transformer models. For example, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale , The Mamba in the Llama: Distilling and Accelerating Hybrid Models , and Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling .

0 views

Ahead of AI 11 months ago

Noteworthy AI Research Papers of 2024 (Part One)

To kick off the year, I've finally been able to complete the draft of this AI Research Highlights of 2024 article. It covers a variety of topics, from mixture-of-experts models to new LLM scaling laws for precision. Reflecting on all the major research highlights of 2024 would probably require writing an entire book. It's been an extraordinarily productive year, even for such a fast-moving field. To keep things reasonably concise, I decided to focus exclusively on LLM research this year. But even then, how does one choose a subset of papers from such an eventful year? The simplest approach I could think of was to highlight one paper per month: January through December 2024. So, in this article, I'll share research papers that I personally found fascinating, impactful, or, ideally, both. However, note that this article is just Part One , focusing on the first half of 2024 from January through June. Part 2 of this series, covering July to December, will be shared later in January. The selection criteria are admittedly subjective, based on what stood out to me this year. I've also aimed for some variety, so it's not all just about LLM model releases. If you're looking for a broader list of AI research papers, feel free to check out my earlier article ( LLM Research Papers: The 2024 List ). For those who read my previous article , I’m happy to share that I’m already feeling a bit better and slowly but steadily recovering! I also want to express my heartfelt thanks for all the kind wishes and support. It truly meant the world to me and helped me through some tough days! Happy new year and happy reading! Only a few days into January 2024, the Mistral AI team shared the Mixtral of Experts paper (8 Jan 2024), which described Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) model. The paper and model were both very influential at the time, as Mixtral 8x7B was (one of) the first open-weight MoE LLMs with an impressive performance: it outperformed Llama 2 70B and GPT-3.5 across various benchmarks. An MoE, or Mixture of Experts, is an ensemble model that combines several smaller "expert" subnetworks inside the GPT-like decoder architecture. Each subnetwork is said to be responsible for handling different types of tasks or, more concretely, tokens. The idea here is that by using multiple smaller subnetworks instead of one large network, MoEs aim to allocate computational resources more efficiently. In particular, in Mixtral 8x7B, is to replace each feed-forward module in a transformer architecture with 8 expert layers, as illustrated in the figure below. Annotated transformer architecture from Attention Is All You Need, https://arxiv.org/abs/1706.03762 "Sparse" in the context of a "Sparse Mixture of Experts" refers to the fact that at any given time, only a subset of the expert layers (typically 1 or 2 out of the 8 in Mixtral 8x7B) are actively used for processing a token. As illustrated in the figure above, the subnetworks replace the feed-forward module in the LLM. A feed-forward module is essentially a multilayer perceptron. In PyTorch-like pseudocode, it essentially looks like this: In addition, there is also a Router module (also known as a gating network ) that redirects each of the token embeddings to the 8 expert feed-forward modules, where only a subset of these experts are active at a time. Since there are 11 more papers to cover in this article, I want to keep this description of the Mixtral model brief. However, you can find additional details in my previous article, Model Merging, Mixtures of Experts, and Towards Smaller LLMs . At the beginning of the year, I would have thought that open-weight MoE models would be more popular and widely used than they are today. While they are not irrelevant, many state-of-the-art models still rely on dense (traditional) LLMs rather than MoEs though, e.g., Llama 3, Qwen 2.5, Gemma 2, etc. However, it is, of course, impossible to say what proprietary architectures like GPT-4, Gemini, and Claude are based on; they might as well be using MoE under the hood. In any case, MoE architectures are still relevant, especially as they offer a way to scale large language models efficiently by activating only a subset of the model's parameters for each input, thus reducing computation costs without sacrificing model capacity. By the way, after writing this article, there was a nice surprise release of the very well-performing DeepSeek-V3 model in December , which uses a MoE architecture. So, yes, MoEs continue to be very relevant! If you are finetuning open-weight LLMs, chances are high that you have been using low-rank adaptation (LoRA), a method for parameter-efficient LLM finetuning, at some point. If you are new to LoRA, I have written a previous article on Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) that you might helpful, and I have a from-scratch code implementation in Appendix D of my Build A Large Language Model (From Scratch) book. Since LoRA is such a popular and widely used method, and since I had so much fun implementing and playing with a newer variant, my pick for February is DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024) by Liu and colleagues. Before introducing DoRA, here’s a quick LoRA refresher: Full finetuning updates each large weight matrix W in an LLM by computing a large weight update matrix ΔW . LoRA approximates ΔW as the product of two smaller matrices A and B . So, Instead of W + ΔW , we have W + A.B . This greatly reduces computational and memory overhead. The figure below illustrates these formulas for full finetuning (left) and LoRA (right) side by side. An illustration of regular finetuning (left) and LoRA finetuning (right). In DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024), Liu and colleagues.extend LoRA by first decomposing a pretrained weight matrix into two parts: a magnitude vector m and a directional matrix V . This decomposition is rooted in the idea that any vector can be represented by its length (magnitude) and direction (orientation), and here we apply it to each column vector of a weight matrix. Once we have m and V , DoRA applies LoRA-style low-rank updates only to the directional matrix V , while allowing the magnitude vector m to be trained separately. Annotated illustration from the DoRA paper (https://arxiv.org/abs/2402.09353) This two-step approach gives DoRA more flexibility than standard LoRA. Rather than uniformly scaling both magnitude and direction as LoRA tends to do, DoRA can make subtle directional adjustments without necessarily increasing the magnitude. The result is improved performance and robustness, as DoRA can outperform LoRA even when using fewer parameters and is less sensitive to the choice of rank. Again, I am keeping this section brief since there are 10 more to go, but if you are interested in additional details, I dedicated a whole article to this method earlier this year: Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch . DoRA is a small, logical improvement over the original LoRA method. While it hasn’t been widely adopted yet, it adds minimal complexity and is worth considering the next time you finetune an LLM. In general, I expect LoRA and similar methods to remain popular. For example, Apple recently mentioned in their Apple Intelligence Foundation Language Models paper that they use LoRA for on-device task specialization of LLMs. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. As far as I can tell, instruction-finetuning is the most popular form of finetuning by LLM practitioners. The goal here is to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions. However, when it comes to taking in new knowledge, continued pretraining (sometimes also referred to continually pretraining) is the way to go. In this section, I want to briefly summarize the refreshingly straightforward Simple and Scalable Strategies to Continually Pre-train Large Language Models (March 2024) paper by Ibrahim and colleagues. This 24-page Continually Pre-train Large Language Models paper reports a large number of experiments and comes with countless figures, which is very thorough for today's standards. What were the main tips for applying continued pretraining successfully? 1. Simple re-warming and re-decaying the learning rate. 2. Adding a small portion (e.g., 5%) of the original pretraining data to the new dataset to prevent catastrophic forgetting. Note that smaller fractions like 0.5% and 1% were also effective. To be a bit more concrete regarding point 1, re-warming and re-decaying, this means we employ the exact same learning rate schedule that was used during the initial pretraining stage of an LLM as shown in the figure below. A schedule for continued pretraining. Figure based on Build a Large Language Model From Scratch, https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb As far as I know, the re-warming and re-decaying, as well as adding original pretraining data to the new data, is more or less common knowledge. However, I really appreciate that the researchers took the time to formally test this method in this very detailed 24-page report. If you are interested in additional details, I discussed this paper more thoroughly in my previous Tips for LLM Pretraining and Evaluating Reward Models article . I have no reason to believe that these methods will not continue to work for future LLMs. However, it is important to note that pretraining pipelines have become more sophisticated in recent months, consisting of multiple stages, including short- and long-context pretraining. (I’ve written more about it in New LLM Pre-training and Post-training Paradigms ). So, for optimal results, the recipes suggested in this paper may need to be tweaked under certain circumstances. April is a tough choice. For instance, Kolmogorov-Arnold Networks made a big wave that month. But as far as I can tell, the excitement fizzled out pretty quickly. This is likely because their theoretical guarantees are difficult to implement practically, they lack competitive results or benchmarks, and other architectures are much more scalable. So, instead, my pick for April goes to a more practical paper: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (April 2024) by Xu and colleagues. Before summarizing the paper itself, here's an overview of Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), both popular methods in aligning LLMs via Reinforcement Learning with Human Feedback (RLHF). RLHF is the method of choice to align LLMs with human preferences, improving the quality but also the safety of their responses. The typical (simplified) LLM training lifecycle. Traditionally, RLHF-PPO has been a crucial step in training LLMs for models and platforms like InstructGPT and ChatGPT. However, DPO started gaining traction last year due to its simplicity and effectiveness. In contrast to RLHF-PPO, DPO does not require a separate reward model. Instead, it directly updates the LLM using a classification-like objective. Many LLMs now utilize DPO, although comprehensive comparisons with PPO are lacking. Below are two resources on RLHF and DPO I developed and shared earlier this year: LLM Training: RLHF and Its Alternatives Direct Preference Optimization (DPO) for LLM Alignment (From Scratch) Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study is a well-written paper with numerous experiments and results. The key conclusions are that PPO tends to outperform DPO, and that DPO is inferior when dealing with out-of-distribution data. Here, out-of-distribution data means the language model was previously trained on instruction data (via supervised finetuning) that differs from the preference data used for DPO. For instance, a model might be trained on the general Alpaca dataset before undergoing DPO finetuning on a different preference-labeled dataset. (However, one way to improve DPO on such out-of-distribution data is to first conduct a supervised instruction-finetuning step using the preference dataset, and then perform DPO finetuning.) The main findings are summarized in the figure below. Annotated table from the Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (https://arxiv.org/abs/2404.10719) paper. 4.3 How are PPO and DPO used today? PPO might have a slight edge when it comes to the raw modeling performance of the resulting LLM. However, DPO is much easier to implement and computationally more efficient to apply (you don't have to train and use a separate reward model, after all). Hence, to the best of my knowledge, DPO is also much more widely used in practice than RLHF-PPO. One interesting example is Meta AI's Llama models. While Llama 2 was trained with RLHF-PPO, the newer Llama 3 models used DPO. Interestingly, recent models even use both PPO and DPO nowadays. Recent examples include Apple's Foundation Models and Allen AI's Tulu 3 . I found another LoRA paper this year particularly interesting (this is the last LoRA paper in this 12-paper selection, I promise!). I wouldn't call it groundbreaking, but I really like it since it formalizes some of the common knowledge around finetuning LLMs with (and without) LoRA: LoRA Learns Less and Forgets Less (May 2024) by Biderman and colleagues. LoRA Learns Less and Forgets Less is an empirical study comparing low-rank adaptation (LoRA) to full finetuning on large language models (LLMs), focusing on two domains (programming and mathematics) and two tasks (instruction finetuning and continued pretraining). Check out the February section above if you'd like a refresher on LoRA before proceeding. The LoRA Learns Less and Forgets Less study shows LoRA learns noticeably less than full finetuning, especially in tasks like coding, where new knowledge needs to be acquired. The gap is smaller when only instruction finetuning is performed. This suggests that pretraining on new data (learning new knowledge) benefit more from full finetuning than converting a pretrained model into an instruction follower. Full finetuning vs LoRA. The performance is measured on HumanEval, which is a dataset consisting of 164 coding challenges. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673 . There are some more nuances, though. For math tasks, for example, the difference between LoRA and full finetuning shrinks. This may be because math problems are more familiar to the LLM, and they likely encountered similar problems during pretraining. In contrast, coding involves a more distinct domain, requiring more new knowledge. Thus, the farther a new task is from the model’s pretraining data, the more beneficial full finetuning becomes in terms of learning capacity. When examining how much previously acquired knowledge is lost, LoRA consistently forgets less. This is particularly clear when adapting to data far from the source domain (e.g., coding). With coding tasks, full finetuning leads to significant forgetting, while LoRA preserves more original capabilities. In math, where the model’s original knowledge was already closer to the new task, the difference is less pronounced. Full finetuning vs LoRA on the original source tasks after training on programming data. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673 . Overall, there is a trade-off: full finetuning is better for absorbing new knowledge from more distant domains but leads to more forgetting of previously learned tasks. LoRA, by changing fewer parameters, learns less new information but retains more of the original capabilities. The study primarily compares LoRA to full finetuning. In practice, LoRA has gained popularity because it is far more resource-efficient than full finetuning. In many cases, full finetuning is simply not feasible due to hardware constraints. Moreover, if you only need to address specialized applications, LoRA alone may be sufficient. Since LoRA adapters can be stored separately from the base LLM, it's easy to preserve the original capabilities while adding new ones. Additionally, it's possible to combine both methods by using full finetuning for knowledge updates and LoRA for subsequent specialization. In short, I think both methods will continue to be very relevant in the upcoming year(s). It's more about using the right approach for the task at hand. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (June 2024) paper by Penedo and colleagues describes the creation of a 15 trillion token dataset for LLMs and making it publicly available, including a link to download the dataset and a code repository ( datatrove/examples/fineweb.py ) to reproduce the dataset preparation steps. Since several other large datasets for LLM pretraining are available, what's so special about this one? Other datasets are comparatively small: RefinedWeb (500B tokens), C4 (172B tokens), the Common Crawl-based part of Dolma 1.6 (3T tokens) and 1.7 (1.2T tokens), The Pile (340B tokens), SlimPajama (627B tokens), the deduplicated variant of RedPajama (20T tokens), English CommonCrawl section of Matrix (1.3T tokens), English CC-100 (70B tokens), Colossal-OSCAR (850B tokens). For example, ~360 billion tokens are only suited for small LLMs (for instance, 1.7 B, according to the Chinchilla scaling laws ). On the other hand, the 15 trillion tokens in the FineWeb dataset should be optimal for models up to 500 billion parameters according to the Chinchilla scaling laws. (Note that RedPajama contains 20 trillion tokens, but the researchers found that models trained on RedPajama result in poorer quality than FineWeb due to the different filtering rules applied.) Illustration of the dataset sizes used to pretrain LLMs over the years. Note that this is simply a general reference and is not directly related to the FineWeb paper or the Chinchilla scaling laws paper. In short, the FineWeb dataset (English-only) makes it theoretically possible for researchers and practitioners to train large-scale LLMs. (Side note: The Llama 3 models with 8B, 70B, and 405B sizes were trained on 15 trillion tokens as well, but Meta AI's training dataset is not publicly available.) In addition, the paper contains principled ablation studies and insights into how the filtering rules were developed and applied to arrive at the FineWeb dataset (starting from the CommonCrawl web corpus). In short, for each filtering rule they tried, they took a 360 billion token random sample from the original and the filtered data and then trained a small 1.71 billion parameter Llama-like model to see whether the filtering rule is beneficial or not based on the models' performances on standard benchmarks such as HellaSwag, ARC, MMLU, and others. 6.3 The relevance of FineWeb today Overall, while pretraining multi-billion parameter LLMs may still be beyond the reach of most research labs and companies, this dataset is a substantial step toward democratizing the study and development of LLMs. In summary, this paper represents a commendable effort and introduces a valuable public resource for advancing pretraining in LLMs. I hope you found the research summaries useful! Since I am still recovering from my injury, and since it would have been an excessively long article anyway, I decided to split this year's review article into two parts. The second (July to December) part is actually even more exciting (for me personally), as I am discussing the more recent papers on scaling laws, reproducing O1, and the role of synthetic data in LLM training. In addition, I will also share my thoughts for 2025 and what I expect to be on the horizon. Stay tuned! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Subscribe now Annotated transformer architecture from Attention Is All You Need, https://arxiv.org/abs/1706.03762 "Sparse" in the context of a "Sparse Mixture of Experts" refers to the fact that at any given time, only a subset of the expert layers (typically 1 or 2 out of the 8 in Mixtral 8x7B) are actively used for processing a token. As illustrated in the figure above, the subnetworks replace the feed-forward module in the LLM. A feed-forward module is essentially a multilayer perceptron. In PyTorch-like pseudocode, it essentially looks like this: In addition, there is also a Router module (also known as a gating network ) that redirects each of the token embeddings to the 8 expert feed-forward modules, where only a subset of these experts are active at a time. Since there are 11 more papers to cover in this article, I want to keep this description of the Mixtral model brief. However, you can find additional details in my previous article, Model Merging, Mixtures of Experts, and Towards Smaller LLMs . 1.2 The relevance of MoE models today At the beginning of the year, I would have thought that open-weight MoE models would be more popular and widely used than they are today. While they are not irrelevant, many state-of-the-art models still rely on dense (traditional) LLMs rather than MoEs though, e.g., Llama 3, Qwen 2.5, Gemma 2, etc. However, it is, of course, impossible to say what proprietary architectures like GPT-4, Gemini, and Claude are based on; they might as well be using MoE under the hood. In any case, MoE architectures are still relevant, especially as they offer a way to scale large language models efficiently by activating only a subset of the model's parameters for each input, thus reducing computation costs without sacrificing model capacity. By the way, after writing this article, there was a nice surprise release of the very well-performing DeepSeek-V3 model in December , which uses a MoE architecture. So, yes, MoEs continue to be very relevant! 2. February: Weight-decomposed LoRA If you are finetuning open-weight LLMs, chances are high that you have been using low-rank adaptation (LoRA), a method for parameter-efficient LLM finetuning, at some point. If you are new to LoRA, I have written a previous article on Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) that you might helpful, and I have a from-scratch code implementation in Appendix D of my Build A Large Language Model (From Scratch) book. Since LoRA is such a popular and widely used method, and since I had so much fun implementing and playing with a newer variant, my pick for February is DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024) by Liu and colleagues. 2.2 LoRA Recap Before introducing DoRA, here’s a quick LoRA refresher: Full finetuning updates each large weight matrix W in an LLM by computing a large weight update matrix ΔW . LoRA approximates ΔW as the product of two smaller matrices A and B . So, Instead of W + ΔW , we have W + A.B . This greatly reduces computational and memory overhead. The figure below illustrates these formulas for full finetuning (left) and LoRA (right) side by side. An illustration of regular finetuning (left) and LoRA finetuning (right). 2.2 From LoRA to DoRA In DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024), Liu and colleagues.extend LoRA by first decomposing a pretrained weight matrix into two parts: a magnitude vector m and a directional matrix V . This decomposition is rooted in the idea that any vector can be represented by its length (magnitude) and direction (orientation), and here we apply it to each column vector of a weight matrix. Once we have m and V , DoRA applies LoRA-style low-rank updates only to the directional matrix V , while allowing the magnitude vector m to be trained separately. Annotated illustration from the DoRA paper (https://arxiv.org/abs/2402.09353) This two-step approach gives DoRA more flexibility than standard LoRA. Rather than uniformly scaling both magnitude and direction as LoRA tends to do, DoRA can make subtle directional adjustments without necessarily increasing the magnitude. The result is improved performance and robustness, as DoRA can outperform LoRA even when using fewer parameters and is less sensitive to the choice of rank. Again, I am keeping this section brief since there are 10 more to go, but if you are interested in additional details, I dedicated a whole article to this method earlier this year: Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch . 2.3 The future of LoRA and LoRA-like methods DoRA is a small, logical improvement over the original LoRA method. While it hasn’t been widely adopted yet, it adds minimal complexity and is worth considering the next time you finetune an LLM. In general, I expect LoRA and similar methods to remain popular. For example, Apple recently mentioned in their Apple Intelligence Foundation Language Models paper that they use LoRA for on-device task specialization of LLMs. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3. March: Tips for Continually Pretraining LLMs As far as I can tell, instruction-finetuning is the most popular form of finetuning by LLM practitioners. The goal here is to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions. However, when it comes to taking in new knowledge, continued pretraining (sometimes also referred to continually pretraining) is the way to go. In this section, I want to briefly summarize the refreshingly straightforward Simple and Scalable Strategies to Continually Pre-train Large Language Models (March 2024) paper by Ibrahim and colleagues. 3.1 Simple techniques work This 24-page Continually Pre-train Large Language Models paper reports a large number of experiments and comes with countless figures, which is very thorough for today's standards. What were the main tips for applying continued pretraining successfully? 1. Simple re-warming and re-decaying the learning rate. 2. Adding a small portion (e.g., 5%) of the original pretraining data to the new dataset to prevent catastrophic forgetting. Note that smaller fractions like 0.5% and 1% were also effective. To be a bit more concrete regarding point 1, re-warming and re-decaying, this means we employ the exact same learning rate schedule that was used during the initial pretraining stage of an LLM as shown in the figure below. A schedule for continued pretraining. Figure based on Build a Large Language Model From Scratch, https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb As far as I know, the re-warming and re-decaying, as well as adding original pretraining data to the new data, is more or less common knowledge. However, I really appreciate that the researchers took the time to formally test this method in this very detailed 24-page report. If you are interested in additional details, I discussed this paper more thoroughly in my previous Tips for LLM Pretraining and Evaluating Reward Models article . 3.2 Will these simple techniques continue to work? I have no reason to believe that these methods will not continue to work for future LLMs. However, it is important to note that pretraining pipelines have become more sophisticated in recent months, consisting of multiple stages, including short- and long-context pretraining. (I’ve written more about it in New LLM Pre-training and Post-training Paradigms ). So, for optimal results, the recipes suggested in this paper may need to be tweaked under certain circumstances. 4. April: DPO or PPO for LLM alignment, or both? April is a tough choice. For instance, Kolmogorov-Arnold Networks made a big wave that month. But as far as I can tell, the excitement fizzled out pretty quickly. This is likely because their theoretical guarantees are difficult to implement practically, they lack competitive results or benchmarks, and other architectures are much more scalable. So, instead, my pick for April goes to a more practical paper: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (April 2024) by Xu and colleagues. 4.1 RLHF-PPO and DPO: What Are They? Before summarizing the paper itself, here's an overview of Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), both popular methods in aligning LLMs via Reinforcement Learning with Human Feedback (RLHF). RLHF is the method of choice to align LLMs with human preferences, improving the quality but also the safety of their responses. The typical (simplified) LLM training lifecycle. Traditionally, RLHF-PPO has been a crucial step in training LLMs for models and platforms like InstructGPT and ChatGPT. However, DPO started gaining traction last year due to its simplicity and effectiveness. In contrast to RLHF-PPO, DPO does not require a separate reward model. Instead, it directly updates the LLM using a classification-like objective. Many LLMs now utilize DPO, although comprehensive comparisons with PPO are lacking. Below are two resources on RLHF and DPO I developed and shared earlier this year: LLM Training: RLHF and Its Alternatives Direct Preference Optimization (DPO) for LLM Alignment (From Scratch)

Python

Machine Learning

0 views

Ahead of AI 11 months ago

LLM Research Papers: The 2024 List

It’s been a very eventful and exciting year in AI research. This is especially true if you are interested in LLMs. I had big plans for this December edition and was planning to publish a new article with a discussion of all my research highlights from 2024. I still plan to do so, but due to an accident and serious injury, I am currently unable to work at a computer and finish the draft. But I hope to recover in the upcoming weeks and be back on my feet soon. In the meantime, I want to share my running bookmark list of many fascinating (mostly LLM-related) papers I stumbled upon in 2024. It’s just a list, but maybe it will come in handy for those who are interested in finding some gems to read for the holidays. And if you are interested in more code-heavy reading and tinkering, My Build A Large Language Model (From Scratch) book is out on Amazon as of last month. In addition, I added a lot of bonus materials to the GitHub repository . Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663 This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Bonus materials in the GitHub repository (stars highlight my personal favorites) Thanks for your understanding and support, and I hope to make a full recovery soon and be back with the Research Highlights 2024 article in a few weeks! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. January 2024 1 Jan, Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , https://arxiv.org/abs/2401.00788 2 Jan, A Comprehensive Study of Knowledge Editing for Large Language Models , https://arxiv.org/abs/2401.01286 2 Jan, LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning , https://arxiv.org/abs/2401.01325 2 Jan, Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models , https://arxiv.org/abs/2401.01335 2 Jan, LLaMA Beyond English: An Empirical Study on Language Capability Transfer , https://arxiv.org/abs/2401.01055 3 Jan, A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity , https://arxiv.org/abs/2401.01967 4 Jan, LLaMA Pro: Progressive LLaMA with Block Expansion , https://arxiv.org/abs/2401.02415 4 Jan, LLM Augmented LLMs: Expanding Capabilities through Composition , https://arxiv.org/abs/2401.02412 4 Jan, Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM , https://arxiv.org/abs/2401.02994 5 Jan, DeepSeek LLM: Scaling Open-Source Language Models with Longtermism , https://arxiv.org/abs/2401.02954 5 Jan, Denoising Vision Transformers , https://arxiv.org/abs/2401.02957 7 Jan, Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon , https://arxiv.org/abs/2401.03462 8 Jan, Mixtral of Experts , https://arxiv.org/abs/2401.04088 8 Jan, MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts , https://arxiv.org/abs/2401.04081 8 Jan, A Minimaximalist Approach to Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2401.04056 9 Jan, RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation , https://arxiv.org/abs/2401.04679 10 Jan, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , https://arxiv.org/abs/2401.05566 11 Jan, Transformers are Multi-State RNNs , https://arxiv.org/abs/2401.06104 11 Jan, A Closer Look at AUROC and AUPRC under Class Imbalance , https://arxiv.org/abs/2401.06091 12 Jan, An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models , https://arxiv.org/abs/2401.06692 16 Jan, Tuning Language Models by Proxy , https://arxiv.org/abs/2401.08565 16 Jan, Scalable Pre-training of Large Autoregressive Image Models , https://arxiv.org/abs/2401.08541 16 Jan, Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering , https://arxiv.org/abs/2401.08500 16 Jan, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , https://arxiv.org/abs/2401.08406 17 Jan, ReFT: Reasoning with Reinforced Fine-Tuning , https://arxiv.org/abs/2401.08967 18 Jan, DiffusionGPT: LLM-Driven Text-to-Image Generation System , https://arxiv.org/abs/2401.10061 18 Jan, Self-Rewarding Language Models , https://arxiv.org/abs/2401.10020 18 Jan, VMamba: Visual State Space Model , https://arxiv.org/abs/2401.10166 19 Jan, Knowledge Fusion of Large Language Models , https://arxiv.org/abs/2401.10491 22 Jan, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , https://arxiv.org/abs/2401.12168 22 Jan, WARM: On the Benefits of Weight Averaged Reward Models , https://arxiv.org/abs/2401.12187 22 Jan, Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text , https://arxiv.org/abs/2401.12070 24 Jan, MambaByte: Token-free Selective State Space Model , https://arxiv.org/abs/2401.13660 24 Jan, SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection , https://arxiv.org/abs/2401.13160 25 Jan, Rethinking Patch Dependence for Masked Autoencoders , https://arxiv.org/abs/2401.14391 25 Jan, Pix2gestalt: Amodal Segmentation by Synthesizing Wholes , https://arxiv.org/abs/2401.14398 25 Jan, Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities , https://arxiv.org/abs/2401.14405 26 Jan, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , https://arxiv.org/abs/2401.15077 29 Jan, MoE-LLaVA: Mixture of Experts for Large Vision-Language Models , https://arxiv.org/abs/2401.15947 29 Jan, Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , https://arxiv.org/abs/2401.16380 31 Jan, KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization , https://arxiv.org/abs/2401.18079 1 Feb, Efficient Exploration for LLMs , https://arxiv.org/abs/2402.00396 1 Feb, OLMo: Accelerating the Science of Language Models , https://arxiv.org/abs/2402.00838 1 Feb, Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization? , https://arxiv.org/abs/2402.00841 1 Feb, Repeat After Me: Transformers are Better than State Space Models at Copying , https://arxiv.org/abs/2402.01032 2 Feb, LiPO: Listwise Preference Optimization through Learning-to-Rank , https://arxiv.org/abs/2402.01878 2 Feb, FindingEmo: An Image Dataset for Emotion Recognition in the Wild , https://arxiv.org/abs/2402.01355 3 Feb, More Agents Is All You Need , https://arxiv.org/abs/2402.05120 5 Feb, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , https://arxiv.org/abs/2402.03300 6 Feb, MobileVLM V2: Faster and Stronger Baseline for Vision Language Model , https://arxiv.org/abs/2402.03766 6 Feb, A Phase Transition Between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention , https://arxiv.org/abs/2402.03902 6 Feb, Scaling Laws for Downstream Task Performance of Large Language Models , https://arxiv.org/abs/2402.04177 6 Feb, MOMENT: A Family of Open Time-series Foundation Models , https://arxiv.org/abs/2402.03885 6 Feb, Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models , https://arxiv.org/abs/2402.03749 6 Feb, Self-Discover: Large Language Models Self-Compose Reasoning Structures , https://arxiv.org/abs/2402.03620 7 Feb, Grandmaster-Level Chess Without Search , https://arxiv.org/abs/2402.04494 7 Feb, Direct Language Model Alignment from Online AI Feedback , https://arxiv.org/abs/2402.04792 8 Feb, Buffer Overflow in Mixture of Experts , https://arxiv.org/abs/2402.05526 9 Feb, The Boundary of Neural Network Trainability is Fractal , https://arxiv.org/abs/2402.06184 11 Feb, ODIN: Disentangled Reward Mitigates Hacking in RLHF , https://arxiv.org/abs/2402.07319 12 Feb, Policy Improvement using Language Feedback Models , https://arxiv.org/abs/2402.07876 12 Feb, Scaling Laws for Fine-Grained Mixture of Experts , https://arxiv.org/abs/2402.07871 12 Feb, Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model , https://arxiv.org/abs/2402.07610 12 Feb, Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping , https://arxiv.org/abs/2402.07610 12 Feb, Suppressing Pink Elephants with Direct Principle Feedback , https://arxiv.org/abs/2402.07896 13 Feb, World Model on Million-Length Video And Language With RingAttention , https://arxiv.org/abs/2402.08268 13 Feb, Mixtures of Experts Unlock Parameter Scaling for Deep RL , https://arxiv.org/abs/2402.08609 14 Feb, DoRA: Weight-Decomposed Low-Rank Adaptation , https://arxiv.org/abs/2402.09353 14 Feb, Transformers Can Achieve Length Generalization But Not Robustly , https://arxiv.org/abs/2402.09371 15 Feb, BASE TTS: Lessons From Building a Billion-Parameter Text-to-Speech Model on 100K Hours of Data , https://arxiv.org/abs/2402.08093 15 Feb, Recovering the Pre-Fine-Tuning Weights of Generative Models , https://arxiv.org/abs/2402.10208 15 Feb, Generative Representational Instruction Tuning , https://arxiv.org/abs/2402.09906 16 Feb, FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models , https://arxiv.org/abs/2402.10986 17 Feb, OneBit: Towards Extremely Low-bit Large Language Models , https://arxiv.org/abs/2402.11295 18 Feb, LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration , https://arxiv.org/abs/2402.11550 19 Feb, Reformatted Alignment , https://arxiv.org/abs/2402.12219 19 Feb, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling , https://arxiv.org/abs/2402.12226 19 Feb, Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs , https://arxiv.org/abs/2402.12030 19 Feb, LoRA+: Efficient Low Rank Adaptation of Large Models , https://arxiv.org/abs/2402.12354 20 Feb, Neural Network Diffusion , https://arxiv.org/abs/2402.13144 21 Feb, YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information , https://arxiv.org/abs/2402.13616 21 Feb, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens , https://arxiv.org/abs/2402.13753 21 Feb, Large Language Models for Data Annotation: A Survey , https://arxiv.org/abs/2402.13446 22 Feb, TinyLLaVA: A Framework of Small-scale Large Multimodal Models , https://arxiv.org/abs/2402.14289 22 Feb, Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , https://arxiv.org/abs/2402.14740 23 Feb, Genie: Generative Interactive Environments , https://arxiv.org/abs/2402.15391 26 Feb, CARTE: Pretraining and Transfer for Tabular Learning , https://arxiv.org/abs/2402.16785 27 Feb, The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits , https://arxiv.org/abs/2402.17764 27 Feb, Sora Generates Videos with Stunning Geometrical Consistency , https://arxiv.org/abs/2402.17403 27 Feb, When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method , https://arxiv.org/abs/2402.17193 29 Feb, Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , https://arxiv.org/abs/2402.19427 1 Mar, Learning and Leveraging World Models in Visual Representation Learning , https://arxiv.org/abs/2403.00504 3 Mar, Improving LLM Code Generation with Grammar Augmentation , https://arxiv.org/abs/2403.01632 3 Mar, The Hidden Attention of Mamba Models , https://arxiv.org/abs/2403.01590 4 Mar, Training-Free Pretrained Model Merging , https://arxiv.org/abs/2403.01753 4 Mar, Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures , https://arxiv.org/abs/2403.02308 5 Mar, The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , https://arxiv.org/abs/2403.03218 5 Mar, Evolution Transformer: In-Context Evolutionary Optimization , https://arxiv.org/abs/2403.02985 5 Mar, Enhancing Vision-Language Pre-training with Rich Supervisions , https://arxiv.org/abs/2403.03346 5 Mar, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , https://arxiv.org/abs/2403.03206 5 Mar, Design2Code: How Far Are We From Automating Front-End Engineering? , https://arxiv.org/abs/2403.03163 6 Mar, ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , https://arxiv.org/abs/2403.03853 6 Mar, Backtracing: Retrieving the Cause of the Query , https://arxiv.org/abs/2403.03956 6 Mar, Learning to Decode Collaboratively with Multiple Language Models , https://arxiv.org/abs/2403.03870 6 Mar, SaulLM-7B: A pioneering Large Language Model for Law , https://arxiv.org/abs/2403.03883 6 Mar, Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning , https://arxiv.org/abs/2403.03864 6 Mar, 3D Diffusion Policy , https://arxiv.org/abs/2403.03954 6 Mar, MedMamba: Vision Mamba for Medical Image Classification , https://arxiv.org/abs/2403.03849 6 Mar, GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection , https://arxiv.org/abs/2403.03507 6 Mar, Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , https://arxiv.org/abs/2403.03950 7 Mar, How Far Are We from Intelligent Visual Deductive Reasoning? , https://arxiv.org/abs/2403.04732 7 Mar, Common 7B Language Models Already Possess Strong Math Capabilities , https://arxiv.org/abs/2403.04706 8 Mar, Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context , https://arxiv.org/abs/2403.05530 8 Mar, Is Cosine-Similarity of Embeddings Really About Similarity? , https://arxiv.org/abs/2403.05440 8 Mar, LLM4Decompile: Decompiling Binary Code with Large Language Models , https://arxiv.org/abs/2403.05286 9 Mar, Algorithmic Progress in Language Models , https://arxiv.org/abs/2403.05812 11 Mar, Stealing Part of a Production Language Model , https://arxiv.org/abs/2403.06634 12 Mar, Chronos: Learning the Language of Time Series , https://arxiv.org/abs/2403.07815 13 Mar, Simple and Scalable Strategies to Continually Pre-train Large Language Models , https://arxiv.org/abs/2403.08763 13 Mar, Language Models Scale Reliably With Over-Training and on Downstream Tasks , https://arxiv.org/abs/2403.08540 14 Mar, BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences , https://arxiv.org/abs/2403.09347 14 Mar, LocalMamba: Visual State Space Model with Windowed Selective Scan , https://arxiv.org/abs/2403.09338 14 Mar, GiT: Towards Generalist Vision Transformer through Universal Language Interface , https://arxiv.org/abs/2403.09394 14 Mar, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training , https://arxiv.org/abs/2403.09611 15 Mar, RAFT: Adapting Language Model to Domain Specific RAG , https://arxiv.org/abs/2403.10131 18 Mar, TnT-LLM: Text Mining at Scale with Large Language Models , https://arxiv.org/abs/2403.12173 18 Mar, Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression , https://arxiv.org/abs/2403.15447 19 Mar, PERL: Parameter Efficient Reinforcement Learning from Human Feedback , https://arxiv.org/abs/2403.10704 20 Mar, RewardBench: Evaluating Reward Models for Language Modeling , https://arxiv.org/abs/2403.13787 20 Mar, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , https://arxiv.org/abs/2403.13372 21 Mar, RakutenAI-7B: Extending Large Language Models for Japanese , https://arxiv.org/abs/2403.15484 22 Mar, SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series , https://arxiv.org/abs/2403.15360 22 Mar, Can Large Language Models Explore In-Context? , https://arxiv.org/abs/2403.15371 22 Mar, LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement , https://arxiv.org/abs/2403.15042 25 Mar, LLM Agent Operating System , https://arxiv.org/abs/2403.16971 26 Mar, The Unreasonable Ineffectiveness of the Deeper Layers , https://arxiv.org/abs/2403.17887 27 Mar, BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text , https://arxiv.org/abs/2403.18421 27 Mar, ViTAR: Vision Transformer with Any Resolution , https://arxiv.org/abs/2403.18361 27 Mar, Long-form Factuality in Large Language Models , https://arxiv.org/abs/2403.18802 27 Mar, Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models , https://arxiv.org/abs/2403.18814 26 Mar, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning , https://arxiv.org/abs/2403.17919 26 Mar, Mechanistic Design and Scaling of Hybrid Architectures , https://arxiv.org/abs/2403.17844 28 Mar, MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions , https://arxiv.org/abs/2403.19651 28 Mar, Model Stock: All We Need Is Just a Few Fine-Tuned Models , https://arxiv.org/abs/2403.19522 1 Apr, Do Language Models Plan Ahead for Future Tokens? , https://arxiv.org/abs/2404.00859 1 Apr, Bigger is not Always Better: Scaling Properties of Latent Diffusion Models , https://arxiv.org/abs/2404.01367 1 Apr, The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis , https://arxiv.org/abs/2404.01204 1 Apr, Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models , https://arxiv.org/abs/2404.04478 2 Apr, Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , https://arxiv.org/abs/2404.02258 2 Apr, Long-context LLMs Struggle with Long In-context Learning , https://arxiv.org/abs/2404.02060 2 Apr, Emergent Abilities in Reduced-Scale Generative Language Models , https://arxiv.org/abs/2404.02204 2 Apr, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks , https://arxiv.org/abs/2404.02151 3 Apr, On the Scalability of Diffusion-based Text-to-Image Generation , https://arxiv.org/abs/2404.02883 3 Apr, BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models , https://arxiv.org/abs/2404.02827 3 Apr, Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , https://arxiv.org/abs/2404.02747 4 Apr, Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences , https://arxiv.org/abs/2404.02151 4 Apr, Training LLMs over Neurally Compressed Text , https://arxiv.org/abs/2404.03626 4 Apr, CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues , https://arxiv.org/abs/2404.03820 5 Apr, ReFT: Representation Finetuning for Language Models , https://arxiv.org/abs/2404.03592 5 Apr, Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data , https://arxiv.org/abs/2404.03862 5 Apr, Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation , https://arxiv.org/abs/2404.04256 8 Apr, AutoCodeRover: Autonomous Program Improvement , https://arxiv.org/abs/2404.05427 8 Apr, Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence , https://arxiv.org/abs/2404.05892 8 Apr, CodecLM: Aligning Language Models with Tailored Synthetic Data , https://arxiv.org/abs/2404.05875 9 Apr, MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , https://arxiv.org/abs/2404.06395 9 Apr, Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models , https://arxiv.org/abs/2404.06209 9 Apr, LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders , https://arxiv.org/abs/2404.05961 10 Apr, Adapting LLaMA Decoder to Vision Transformer , https://arxiv.org/abs/2404.06773 10 Apr, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , https://arxiv.org/abs/2404.07143 11 Apr, LLoCO: Learning Long Contexts Offline , https://arxiv.org/abs/2404.07979 11 Apr, JetMoE: Reaching Llama2 Performance with 0.1M Dollars , https://arxiv.org/abs/2404.07413 11 Apr, Best Practices and Lessons Learned on Synthetic Data for Language Models , https://arxiv.org/abs/2404.07503 11 Apr, Rho-1: Not All Tokens Are What You Need , https://arxiv.org/abs/2404.07965 12 Apr, Pre-training Small Base LMs with Fewer Tokens , https://arxiv.org/abs/2404.08634 12 Apr, Dataset Reset Policy Optimization for RLHF , https://arxiv.org/abs/2404.08495 13 Apr, LLM In-Context Recall is Prompt Dependent , https://arxiv.org/abs/2404.08865 15 Apr, State Space Model for New-Generation Network Alternative to Transformers: A Survey , https://arxiv.org/abs/2404.09516 15 Apr, Chinchilla Scaling: A Replication Attempt , https://arxiv.org/abs/2404.10102 15 Apr, Learn Your Reference Model for Real Good Alignment , https://arxiv.org/abs/2404.09656 16 Apr, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study , https://arxiv.org/abs/2404.10719 16 Apr, Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies , https://arxiv.org/abs/2404.08197 16 Apr, How Faithful Are RAG Models? Quantifying the Tug-of-War Between RAG and LLMs' Internal Prior , https://arxiv.org/abs/2404.10198 17 Apr, A Survey on Retrieval-Augmented Text Generation for Large Language Models , https://arxiv.org/abs/2404.10981 18 Apr, When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes , https://arxiv.org/abs/2404.12365 18 Apr, Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , https://arxiv.org/abs/2404.12253 18 Apr, OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data , https://arxiv.org/abs/2404.12195 19 Apr, The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions , https://arxiv.org/abs/2404.13208 22 Apr, How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study , https://arxiv.org/abs/2404.14047 22 Apr, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , https://arxiv.org/abs/2404.14219 22 Apr, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework , https://arxiv.org/abs/2404.14619 22 Apr, A Survey on Self-Evolution of Large Language Models , https://arxiv.org/abs/2404.14662 23 Apr, Multi-Head Mixture-of-Experts , https://arxiv.org/abs/2404.15045 23 Apr, NExT: Teaching Large Language Models to Reason about Code Execution , https://arxiv.org/abs/2404.14662 23 Apr, Graph Machine Learning in the Era of Large Language Models (LLMs) , https://arxiv.org/abs/2404.14928 24 Apr, Retrieval Head Mechanistically Explains Long-Context Factuality , https://arxiv.org/abs/2404.15574 25 Apr, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding , https://arxiv.org/abs/2404.16710 25 Apr, Make Your LLM Fully Utilize the Context , https://arxiv.org/abs/2404.16811 28 Apr, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report , https://arxiv.org/abs/2405.00732 30 Apr, Better & Faster Large Language Models via Multi-token Prediction , https://arxiv.org/abs/2404.19737 30 Apr, RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing , https://arxiv.org/abs/2404.19543 30 Apr, A Primer on the Inner Workings of Transformer-based Language Models , https://arxiv.org/abs/2405.00208 30 Apr, When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively , https://arxiv.org/abs/2404.19705 30 Apr, KAN: Kolmogorov–Arnold Networks , https://arxiv.org/abs/2404.19756 1 May, Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 , https://arxiv.org/abs/2405.00664 1 May, Self-Play Preference Optimization for Language Model Alignment , https://arxiv.org/abs/2405.00675 1 May, A Careful Examination of Large Language Model Performance on Grade School Arithmetic , https://arxiv.org/abs/2405.00332 2 May, Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , https://arxiv.org/abs/2405.01535 3 May, What Matters When Building Vision-Language Models? , https://arxiv.org/abs/2405.02246 5 May, Is Flash Attention Stable? , https://arxiv.org/abs/2405.02803 7 May, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention , https://arxiv.org/abs/2405.04437 7 May, xLSTM: Extended Long Short-Term Memory , https://arxiv.org/abs/2405.04517 8 May, You Only Cache Once: Decoder-Decoder Architectures for Language Models , https://arxiv.org/abs/2405.05254 8 May, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , https://arxiv.org/abs/2405.04434 8 May, Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models , https://arxiv.org/abs/2405.05417 9 May, Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? , https://arxiv.org/abs/2405.05904 10 May, Value Augmented Sampling for Language Model Alignment and Personalization , https://arxiv.org/abs/2405.06639 12 May, PHUDGE: Phi-3 as Scalable Judge , https://arxiv.org/abs/2405.08029 13 May, RLHF Workflow: From Reward Modeling to Online RLHF , https://arxiv.org/abs/2405.07863 15 May, LoRA Learns Less and Forgets Less , https://arxiv.org/abs/2405.09673 15 May, Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model , https://arxiv.org/abs/2405.09215 16 May, Chameleon: Mixed-Modal Early-Fusion Foundation Models , https://arxiv.org/abs/2405.09818 17 May, Towards Modular LLMs by Building and Reusing a Library of LoRAs , https://arxiv.org/abs/2405.11157 19 May, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization , https://arxiv.org/abs/2405.11582 20 May, MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning , https://arxiv.org/abs/2405.12130 22 May, Attention as an RNN , https://arxiv.org/abs/2405.13956 22 May, Dense Connector for MLLMs , https://arxiv.org/abs/2405.13800 23 May, AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability , https://arxiv.org/abs/2405.14129 23 May, SimPO: Simple Preference Optimization with a Reference-Free Reward , https://arxiv.org/abs/2405.14734 23 May, Instruction Tuning With Loss Over Instructions , https://arxiv.org/abs/2405.14394 24 May, The Road Less Scheduled , https://arxiv.org/abs/2405.15682 26 May, Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training , https://arxiv.org/abs/2405.15319 26 May, gzip Predicts Data-dependent Scaling Laws , https://arxiv.org/abs/2405.16684 27 May, Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning , https://arxiv.org/abs/2405.17258 28 May, VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections , https://arxiv.org/abs/2405.17991 28 May, LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models , https://arxiv.org/abs/2405.18377 29 May, Contextual Position Encoding: Learning to Count What's Important , https://arxiv.org/abs/2405.18719 2 Jun, Show, Don't Tell: Aligning Language Models with Demonstrated Feedback , https://arxiv.org/abs/2406.00888 3 Jun, Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models , https://arxiv.org/abs/2406.06563 3 Jun, OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , https://arxiv.org/abs/2406.01775 3 Jun, The Geometry of Categorical and Hierarchical Concepts in Large Language Models , https://arxiv.org/abs/2406.01506 3 Jun, Towards Scalable Automated Alignment of LLMs: A Survey , https://arxiv.org/abs/2406.01252 4 Jun, Scalable MatMul-free Language Modeling , https://arxiv.org/abs/2406.02528 4 Jun, Block Transformer: Global-to-Local Language Modeling for Fast Inference , https://arxiv.org/abs/2406.02657 6 Jun, Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models , https://arxiv.org/abs/2406.04271 6 Jun, The Prompt Report: A Systematic Survey of Prompting Techniques , https://arxiv.org/abs/2406.06608 6 Jun, Transformers Need Glasses! Information Over-Squashing in Language Tasks , https://arxiv.org/abs/2406.04267 6 Jun, Are We Done with MMLU? , https://arxiv.org/abs/2406.04127 6 Jun, Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step , https://arxiv.org/abs/2406.04314 7 Jun, Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach , https://arxiv.org/abs/2406.04594 7 Jun, CRAG -- Comprehensive RAG Benchmark , https://arxiv.org/abs/2406.04744 7 Jun, WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild , https://arxiv.org/abs/2406.04770 7 Jun, Mixture-of-Agents Enhances Large Language Model Capabilities , https://arxiv.org/abs/2406.04692 7 Jun, BERTs are Generative In-Context Learners , https://arxiv.org/abs/2406.04823 7 Jun, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination , https://arxiv.org/abs/2406.05132 8 Jun, Creativity Has Left the Chat: The Price of Debiasing Language Models , https://arxiv.org/abs/2406.05587 10 Jun, Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation , https://arxiv.org/abs/2406.06525 10 Jun, Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference , https://arxiv.org/abs/2406.06424 10 Jun, Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning , https://arxiv.org/abs/2406.06469 10 Jun, Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters , https://arxiv.org/abs/2406.05955 10 Jun, Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching , https://arxiv.org/abs/2406.06326 11 Jun, An Image is Worth 32 Tokens for Reconstruction and Generation , https://arxiv.org/abs/2406.07550 11 Jun, TextGrad: Automatic "Differentiation" via Text , https://arxiv.org/abs/2406.07496 11 Jun, Simple and Effective Masked Diffusion Language Models , https://arxiv.org/abs/2406.07524 11 Jun, Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement , https://arxiv.org/abs/2406.07138 11 Jun, Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling , https://arxiv.org/abs/2406.07522 12 Jun, Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing , https://arxiv.org/abs/2406.08464 12 Jun, What If We Recaption Billions of Web Images with LLaMA-3? , https://arxiv.org/abs/2406.08478 12 Jun, Large Language Model Unlearning via Embedding-Corrupted Prompts , https://arxiv.org/abs/2406.07933 12 Jun, Large Language Models Must Be Taught to Know What They Don't Know , https://arxiv.org/abs/2406.08391 12 Jun, An Empirical Study of Mamba-based Language Models , https://arxiv.org/abs/2406.07887 12 Jun, Discovering Preference Optimization Algorithms with and for Large Language Models , https://arxiv.org/abs/2406.08414 13 Jun, Transformers Meet Neural Algorithmic Reasoners , https://arxiv.org/abs/2406.09308 13 Jun, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding , https://arxiv.org/abs/2406.09297 13 Jun, An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels , https://arxiv.org/abs/2406.09415 13 Jun, FouRA: Fourier Low Rank Adaptation , https://arxiv.org/abs/2406.08798 14 Jun, Bootstrapping Language Models with DPO Implicit Rewards , https://arxiv.org/abs/2406.09760 14 Jun, Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , https://arxiv.org/abs/2406.10209 14 Jun, Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs , https://arxiv.org/abs/2406.10216 16 Jun, THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation , https://arxiv.org/abs/2406.10996 17 Jun, Task Me Anything , https://arxiv.org/abs/2406.11775 17 Jun, How Do Large Language Models Acquire Factual Knowledge During Pretraining? , https://arxiv.org/abs/2406.11813 17 Jun, mDPO: Conditional Preference Optimization for Multimodal Large Language Models , https://arxiv.org/abs/2406.11839 17 Jun, Nemotron-4 340B Technical Report , https://arxiv.org/abs/2406.11704 17 Jun, DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , https://arxiv.org/abs/2406.11794 17 Jun, Tokenization Falling Short: The Curse of Tokenization , https://arxiv.org/abs/2406.11687 17 Jun, DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , https://arxiv.org/abs/2406.11931 17 Jun, Unveiling Encoder-Free Vision-Language Models , https://arxiv.org/abs/2406.11832 17 Jun, Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level , https://arxiv.org/abs/2406.11817 17 Jun, HARE: HumAn pRiors, a key to small language model Efficiency , https://arxiv.org/abs/2406.11410 17 Jun, Measuring memorization in RLHF for code completion , https://arxiv.org/abs/2406.11715 17 Jun, Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts , https://arxiv.org/abs/2406.12034 18 Jun, From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries , https://arxiv.org/abs/2406.12824 18 Jun, Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges , https://arxiv.org/abs/2406.12624 19 Jun, Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? , https://arxiv.org/abs/2406.13121 20 Jun, Instruction Pre-Training: Language Models are Supervised Multitask Learners , https://arxiv.org/abs/2406.14491 20 Jun, Can LLMs Learn by Teaching? A Preliminary Study , https://arxiv.org/abs/2406.14629 21 Jun, A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems , https://arxiv.org/abs/2406.14972 21 Jun, LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs , https://arxiv.org/abs/2406.15319 21 Jun, MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression , https://arxiv.org/abs/2406.14909 21 Jun, Efficient Continual Pre-training by Mitigating the Stability Gap , https://arxiv.org/abs/2406.14833 24 Jun, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers , https://arxiv.org/abs/2406.16747 24 Jun, WARP: On the Benefits of Weight Averaged Rewarded Policies , https://arxiv.org/abs/2406.16768 24 Jun, Adam-mini: Use Fewer Learning Rates To Gain More , https://arxiv.org/abs/2406.16793 25 Jun, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , https://arxiv.org/abs/2406.17557 25 Jun, LongIns: A Challenging Long-context Instruction-based Exam for LLMs , https://arxiv.org/abs/2406.17588 25 Jun, Following Length Constraints in Instructions , https://arxiv.org/abs/2406.17744 26 Jun, A Closer Look into Mixture-of-Experts in Large Language Models , https://arxiv.org/abs/2406.18219 26 Jun, RouteLLM: Learning to Route LLMs with Preference Data , https://arxiv.org/abs/2406.18665 26 Jun, Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs , https://arxiv.org/abs/2406.18629 27 Jun, Dataset Size Recovery from LoRA Weights , https://arxiv.org/abs/2406.19395 27 Jun, From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data , https://arxiv.org/abs/2406.19292 27 Jun, Changing Answer Order Can Decrease MMLU Accuracy , https://arxiv.org/abs/2406.19470 28 Jun, Direct Preference Knowledge Distillation for Large Language Models , https://arxiv.org/abs/2406.19774 28 Jun, LLM Critics Help Catch LLM Bugs , https://arxiv.org/abs/2407.00215 28 Jun, Scaling Synthetic Data Creation with 1,000,000,000 Personas , https://arxiv.org/abs/2406.20094 1 Jul, LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives , https://arxiv.org/abs/2407.01490 1 Jul, Searching for Best Practices in Retrieval-Augmented Generation , https://arxiv.org/abs/2407.01219 1 Jul, Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models , https://arxiv.org/abs/2407.01906 1 Jul, Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , https://arxiv.org/abs/2407.01392 1 Jul, Eliminating Position Bias of Language Models: A Mechanistic Approach , https://arxiv.org/abs/2407.01100 2 Jul, JMInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , https://arxiv.org/abs/2407.02490 2 Jul, TokenPacker: Efficient Visual Projector for Multimodal LLM , https://arxiv.org/abs/2407.02392 2 Jul, Reasoning in Large Language Models: A Geometric Perspective , https://arxiv.org/abs/2407.02678 2 Jul, RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs , https://arxiv.org/abs/2407.02485 3 Jul, AgentInstruct: Toward Generative Teaching with Agentic Flows , https://arxiv.org/abs/2407.03502 3 Jul, HEMM: Holistic Evaluation of Multimodal Foundation Models , https://arxiv.org/abs/2407.03418 4 Jul, Mixture of A Million Experts , https://arxiv.org/abs/2407.04153 5 Jul, Learning to (Learn at Test Time): RNNs with Expressive Hidden States , https://arxiv.org/abs/2407.04620 9 Jul, Vision Language Models Are Blind , https://arxiv.org/abs/2407.06581 9 Jul, Self-Recognition in Language Models , https://arxiv.org/abs/2407.06946 10 Jul, Inference Performance Optimization for Large Language Models on CPUs , https://arxiv.org/abs/2407.07304 11 Jul, Gradient Boosting Reinforcement Learning , https://arxiv.org/abs/2407.08250 11 Jul, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision , https://arxiv.org/abs/2407.08608 12 Jul, SpreadsheetLLM: Encoding Spreadsheets for Large Language Models , https://arxiv.org/abs/2407.09025 12 Jul, New Desiderata for Direct Preference Optimization , https://arxiv.org/abs/2407.09072 12 Jul, Context Embeddings for Efficient Answer Generation in RAG , https://arxiv.org/abs/2407.09252 15 Jul, Qwen2 Technical Report , https://arxiv.org/abs/2407.10671 15 Jul, The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism , https://arxiv.org/abs/2407.10457 15 Jul, From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients , https://arxiv.org/abs/2407.11239 16 Jul, GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression , https://arxiv.org/abs/2407.12077 16 Jul, Scaling Diffusion Transformers to 16 Billion Parameters , https://arxiv.org/abs/2407.11633 16 Jul, NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? , https://arxiv.org/abs/2407.11963 17 Jul, Patch-Level Training for Large Language Models , https://arxiv.org/abs/2407.12665 17 Jul, LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , https://arxiv.org/abs/2407.12772 17 Jul, A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , https://arxiv.org/abs/2407.12994 17 Jul, Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models , https://arxiv.org/abs/2407.12327 18 Jul, Attention Overflow: Language Model Input Blur during Long-Context Missing Items Recommendation , https://arxiv.org/abs/2407.13481 18 Jul, Weak-to-Strong Reasoning , https://arxiv.org/abs/2407.13647 18 Jul, Understanding Reference Policies in Direct Preference Optimization , https://arxiv.org/abs/2407.13709 18 Jul, Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies , https://arxiv.org/abs/2407.13623 19 Jul, BOND: Aligning LLMs with Best-of-N Distillation , https://arxiv.org/abs/2407.14622 19 Jul, Compact Language Models via Pruning and Knowledge Distillation , https://arxiv.org/abs/2407.14679 19 Jul, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference , https://arxiv.org/abs/2407.14057 22 Jul, Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training , https://arxiv.org/abs/2407.15892 22 Jul, DDK: Distilling Domain Knowledge for Efficient Large Language Models , https://arxiv.org/abs/2407.16154 23 Jul, Generation Constraint Scaling Can Mitigate Hallucination , https://arxiv.org/abs/2407.16908 23 Jul, Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach , https://arxiv.org/abs/2407.16833 23 Jul, Course-Correction: Safety Alignment Using Synthetic Preferences , https://arxiv.org/abs/2407.16637 26 Jul, Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , https://arxiv.org/abs/2407.16607 28 Jul, Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge , https://arxiv.org/abs/2407.19594 29 Jul, Improving Retrieval Augmented Language Model with Self-Reasoning , https://arxiv.org/abs/2407.19813 29 Jul, Apple Intelligence Foundation Language Models , https://arxiv.org/abs/2407.21075 30 Jul, ThinK: Thinner Key Cache by Query-Driven Pruning , https://arxiv.org/abs/2407.21018 31 Jul, The Llama 3 Herd of Models , https://arxiv.org/abs/2407.21783 31 Jul, Gemma 2: Improving Open Language Models at a Practical Size , https://arxiv.org/abs/2408.00118 1 Aug, S AM 2: Segment Anything in Images and Videos, https://arxiv.org/abs/2408.00714 2 Aug, POA: Pre-training Once for Models of All Sizes, https://arxiv.org/abs/2408.01031 2 Aug, RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework, https://arxiv.org/abs/2408.01262 2 Aug, A Survey of Mamba, https://arxiv.org/abs/2408.01129 3 Aug, MiniCPM-V: A GPT-4V Level MLLM on Your Phone, https://arxiv.org/abs/2408.01800 5 Aug, RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation, https://arxiv.org/abs/2408.02545 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 5 Aug, BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba, https://arxiv.org/abs/2408.02600 5 Aug, Self-Taught Evaluators, https://arxiv.org/abs/2408.02666 7 Aug, EXAONE 3.0 7.8B Instruction Tuned Language Model, https://arxiv.org/abs/2408.03541 7 Aug, 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data, https://arxiv.org/abs/2408.03506 8 Aug, Conversational Prompt Engineering, https://arxiv.org/abs/2408.04560 8 Aug, Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP, https://arxiv.org/abs/2408.04303 12 Aug, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, https://arxiv.org/abs/2408.06292 15 Aug, Hermes 3 Technical Report, https://arxiv.org/abs/2408.12570 19 Aug, Customizing Language Models with Instance-wise LoRA for Sequential Recommendation, https://arxiv.org/abs/2408.10159 20 Aug , Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information, https://arxiv.org/abs/2408.10615 20 Aug, To Code, or Not To Code? Exploring Impact of Code in Pre-training, https://arxiv.org/abs/2408.10914 21 Aug , LLM Pruning and Distillation in Practice: The Minitron Approach, https://arxiv.org/abs/2408.11796 22 Aug, Jamba-1.5: Hybrid Transformer-Mamba Models at Scale, https://arxiv.org/abs/2408.12570 22 Aug, Controllable Text Generation for Large Language Models: A Survey, https://arxiv.org/abs/2408.12599 23 Aug, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 26 Aug, A Practitioner's Guide to Continual Multimodal Pretraining, https://arxiv.org/abs/2408.14471 26 Aug, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637 26 Aug, CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation, https://arxiv.org/abs/2408.14572 27 Aug, The Mamba in the Llama: Distilling and Accelerating Hybrid Models, https://arxiv.org/abs/2408.15237 28 Aug, ReMamba: Equip Mamba with Effective Long-Sequence Modeling, https://arxiv.org/abs/2408.15496 29 Aug, Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling, https://arxiv.org/abs/2408.16737 31 Aug, LongRecipe: Recipe for Efficient Long Context Generalization in Large Languge Models, https://arxiv.org/abs/2409.00509 3 Sep, OLMoE: Open Mixture-of-Experts Language Models, https://arxiv.org/abs/2409.02060 3 Sep 2024, In Defense of RAG in the Era of Long-Context Language Models, https://arxiv.org/abs/2409.01666 5 Sep, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 5 Sep, LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA , https://arxiv.org/abs/2409.02897 5 Sep, How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data, https://arxiv.org/abs/2409.03810 6 Sep, T heory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 10 Sep, LLaMA-Omni: Seamless Speech Interaction with Large Language Models, https://arxiv.org/abs/2409.06666 10 Sep, What is the Role of Small Models in the LLM Era: A Survey, https://arxiv.org/abs/2409.06857 11 Sep, Policy Filtration in RLHF to Fine-Tune LLM for Code Generation, https://arxiv.org/abs/2409.06957 16 Sep, RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval , https://arxiv.org/abs/2409.10516 18 Sep, Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , https://arxiv.org/abs/2409.12122 18 Sep, Qwen2.5-Coder Technical Report , https://arxiv.org/abs/2409.12186 21 Sep, Instruction Following without Instruction Tuning, https://arxiv.org/abs/2409.14254 30 Sep, I s Preference Alignment Always the Best Option to Enhance LLM-Based Translation? An Empirical Analysis, https://arxiv.org/abs/2409.20059 30 Sep, The Perfect Blend: Redefining RLHF with Mixture of Judges, https://arxiv.org/abs/2409.20370 (New paper by Meta on how they did RLHF for Llama 3) 1 Oct, Addition is All You Need for Energy-efficient Language Models, https://arxiv.org/abs/2410.00907 2 Oct Quantifying Generalization Complexity for Large Language Models, https://arxiv.org/abs/2410.01769 2 Oct, When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 , https://arxiv.org/abs/2410.01792 2 Oct, W ere RNNs All We Needed? , https://arxiv.org/abs/2410.01201 3 Oct, Selective Attention Improves Transformer , https://arxiv.org/abs/2410.02703 3 Oct, LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , https://arxiv.org/abs/2410.02707 3 Oct, LLaVA-Critic: Learning to Evaluate Multimodal Models , https://arxiv.org/abs/2410.02712 7 Oct, Differential Transformer , https://arxiv.org/abs/2410.05258 7 Oct, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , https://arxiv.org/abs/2410.05229 8 Oct, ARIA: An Open Multimodal Native Mixture-of-Experts Model , https://arxiv.org/abs/2410.05993 8 Oct, O1 Replication Journey: A Strategic Progress Report -- Part 1 , https://arxiv.org/abs/2410.18982 8 Oct, Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG, https://arxiv.org/abs/2410.05983 9 Oct, From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning , https://arxiv.org/abs/2410.06456 10 Oct, KV Prediction for Improved Time to First Token , https://arxiv.org/abs/2410.08391 11 Oct, Baichuan-Omni Technical Report , https://arxiv.org/abs/2410.08565 13 Oct, MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models , https://arxiv.org/abs/2410.10139 13 Oct, LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models , https://arxiv.org/abs/2410.09732 15 Oct, AFlow: Automating Agentic Workflow Generation , https://arxiv.org/abs/2410.10762 15 Oct, Toward General Instruction-Following Alignment for Retrieval-Augmented Generation , https://arxiv.org/abs/2410.09584 21 Oct, Pre-training Distillation for Large Language Models: A Design Space Exploration , https://arxiv.org/abs/2410.16215 23 Oct, MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models , https://arxiv.org/abs/2410.17637 23 Oct, Scalable Ranked Preference Optimization for Text-to-Image Generation , https://arxiv.org/abs/2410.18013 23 Oct, Scaling Diffusion Language Models via Adaptation from Autoregressive Models , https://arxiv.org/abs/2410.17891 24 Oct, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback , https://arxiv.org/abs/2410.19133 25 Oct, Counting Ability of Large Language Models and Impact of Tokenization , https://arxiv.org/abs/2410.19730 25 Oct, A Survey of Small Language Models , https://arxiv.org/abs/2410.20011 26 Oct, Accelerating Direct Preference Optimization with Prefix Sharing , https://arxiv.org/abs/2410.20305 27 Oct, Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse , https://arxiv.org/abs/2410.21333 28 Oct, LongReward: Improving Long-context Large Language Models with AI Feedback , https://arxiv.org/abs/2410.21252 28 Oct, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference , https://arxiv.org/abs/2410.21465 29 Oct, Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications , https://arxiv.org/abs/2410.21943 30 Oct, CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation , https://arxiv.org/abs/2410.23090 31 Oct, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective , https://arxiv.org/abs/2410.23743 31 Oct, GPT or BERT: why not both? , https://arxiv.org/abs/2410.24159 31 Oct, Language Models can Self-Lengthen to Generate Long Texts , https://arxiv.org/abs/2410.23933 1 Nov, Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations , https://arxiv.org/abs/2411.00640 1 Nov 2024, Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation , https://arxiv.org/abs/2411.00412 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models , https://arxiv.org/abs/2411.00492 3 Nov, S ample-Efficient Alignment for LLMs , https://arxiv.org/abs/2411.01493 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness , https://arxiv.org/abs/2411.03350 4 Nov, "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization , https://arxiv.org/abs/2411.02355 4 Nov, Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study , https://arxiv.org/abs/2411.02462 5 Nov, HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems , https://arxiv.org/abs/2411.02959 6 Nov, Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination , https://arxiv.org/abs/2411.03823 6 Nov, Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding , https://arxiv.org/abs/2411.04282 6 Nov, Number Cookbook: Number Understanding of Language Models and How to Improve It , https://arxiv.org/abs/2411.03766 7 Nov, Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models , https://arxiv.org/abs/2411.04996 7 Nov, BitNet a4.8: 4-bit Activations for 1-bit LLMs , https://arxiv.org/abs/2411.04965 7 Nov, Scaling Laws for Precision , https://arxiv.org/abs/2411.04330 8 Nov, Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation , https://arxiv.org/abs/2411.05966 8 Nov, Balancing Pipeline Parallelism with Vocabulary Parallelism , https://arxiv.org/abs/2411.05288 11 Nov, Toward Optimal Search and Retrieval for RAG , https://arxiv.org/abs/2411.07396 12 Nov, Large Language Models Can Self-Improve in Long-context Reasoning , https://arxiv.org/abs/2411.08147 12 Nov, Stronger Models are NOT Stronger Teachers for Instruction Tuning , https://arxiv.org/abs/2411.07133 12 Nov, Direct Preference Optimization Using Sparse Feature-Level Constraints , https://arxiv.org/abs/2411.07618 13 Nov, Cut Your Losses in Large-Vocabulary Language Models , https://arxiv.org/abs/2411.09009 15 Nov, Does Prompt Formatting Have Any Impact on LLM Performance? , https://arxiv.org/abs/2411.10541 17 Nov, SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization , https://arxiv.org/abs/2411.11909 17 Nov, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration , https://arxiv.org/abs/2411.10958 18 Nov, Bi-Mamba: Towards Accurate 1-Bit State Space Models , https://arxiv.org/abs/2411.11843 19 Nov, RedPajama: an Open Dataset for Training Large Language Models, https://arxiv.org/abs/2411.12372 20 Nov, Hymba: A Hybrid-head Architecture for Small Language Models , https://arxiv.org/abs/2411.13676 20 Nov, Loss-to-Loss Prediction: Scaling Laws for All Datasets , https://arxiv.org/abs/2411.12925 21 Nov, When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training , https://arxiv.org/abs/2411.13476 21 Nov, Multimodal Autoregressive Pre-training of Large Vision Encoders , https://arxiv.org/abs/2411.14402 21 Nov, Natural Language Reinforcement Learning , https://arxiv.org/abs/2411.14251 22 Nov, Large Multi-modal Models Can Interpret Features in Large Multi-modal Models , https://arxiv.org/abs/2411.14982 22 Nov, TÜLU 3: Pushing Frontiers in Open Language Model Post-Training , https://arxiv.org/abs/2411.15124 23 Nov, MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs , https://arxiv.org/abs/2411.15296 24 Nov, LLMs Do Not Think Step-by-step In Implicit Reasoning , https://arxiv.org/abs/2411.15862 25 Nov, O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? , https://arxiv.org/abs/2411.16489 26 Nov, Star Attention: Efficient LLM Inference over Long Sequences , https://arxiv.org/abs/2411.17116 27 Nov, Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens , https://arxiv.org/abs/2411.17691 27 Nov, Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration , https://arxiv.org/abs/2411.17686 29 Nov, Reverse Thinking Makes LLMs Stronger Reasoners , https://arxiv.org/abs/2411.19865 29 Nov, Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability , https://arxiv.org/abs/2411.19943 2 Dec, Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis , https://arxiv.org/abs/2412.01819 2 Dec, X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models , https://arxiv.org/abs/2412.01824 2 Dec, Free Process Rewards without Process Labels , https://arxiv.org/abs/2412.01981 3 Dec, Scaling Image Tokenizers with Grouped Spherical Quantization , https://arxiv.org/abs/2412.02632 3 Dec, RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models , https://arxiv.org/abs/2412.02830 4 Dec, Perception Tokens Enhance Visual Reasoning in Multimodal Language Models , https://arxiv.org/abs/2412.03548 4 Dec, Evaluating Language Models as Synthetic Data Generators , https://arxiv.org/abs/2412.03679 4 Dec, Best-of-N Jailbreaking , https://arxiv.org/abs/2412.03556 4 Dec, PaliGemma 2: A Family of Versatile VLMs for Transfer , https://arxiv.org/abs/2412.03555 5 Dec, VisionZip: Longer is Better but Not Necessary in Vision Language Models , https://arxiv.org/abs/2412.04467 5 Dec, Evaluating and Aligning CodeLLMs on Human Preference , https://arxiv.org/abs/2412.05210 6 Dec, MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale , https://arxiv.org/abs/2412.05237 6 Dec, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , https://arxiv.org/abs/2412.05271 7 Dec, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , https://arxiv.org/abs/2412.05579 8 Dec, Does RLHF Scale? Exploring the Impacts From Data, Model, and Method , https://arxiv.org/abs/2412.06000 9 Dec, Unraveling the Complexity of Memory in RL Agents: An Approach for Classification and Evaluation , https://arxiv.org/abs/2412.06531 9 Dec, Training Large Language Models to Reason in a Continuous Latent Space , https://arxiv.org/abs/2412.06769 9 Dec, AutoReason: Automatic Few-Shot Reasoning Decomposition , https://arxiv.org/abs/2412.06975 11 Dec, Large Concept Models: Language Modeling in a Sentence Representation Space , https://arxiv.org/abs/2412.08821 12 Dec, Phi-4 Technical Report , https://arxiv.org/abs/2412.08905 13 Dec, Byte Latent Transformer: Patches Scale Better Than Tokens , https://arxiv.org/abs/2412.09871 13 Dec, SCBench: A KV Cache-Centric Analysis of Long-Context Methods , https://arxiv.org/abs/2412.10319 13 Dec, Cultural Evolution of Cooperation among LLM Agents , https://arxiv.org/abs/2412.10270 13 Dec, DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , https://arxiv.org/abs/2412.10302 16 Dec, No More Adam: Learning Rate Scaling at Initialization is All You Need , https://arxiv.org/abs/2412.11768 16 Dec, Precise Length Control in Large Language Models , https://arxiv.org/abs/2412.11937 16 Dec, The Open Source Advantage in Large Language Models (LLMs) , https://arxiv.org/abs/2412.12004 16 Dec, A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges , https://arxiv.org/abs/2412.11936 17 Dec, Are Your LLMs Capable of Stable Reasoning? , https://arxiv.org/abs/2412.13147 18 Dec, LLM Post-Training Recipes, Improving Reasoning in LLMs , https://arxiv.org/abs/2412.14135 18 Dec, Hansel: Output Length Controlling Framework for Large Language Models , https://arxiv.org/abs/2412.14033 18 Dec, Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning , https://arxiv.org/abs/2412.13631 18 Dec, Alignment Faking in Large Language Models , https://arxiv.org/abs/2412.14093 18 Dec, SCOPE: Optimizing Key-Value Cache Compression in Long-Context Generation , https://arxiv.org/abs/2412.13649 19 Dec, LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks , https://arxiv.org/abs/2412.15204 20 Dec, Offline Reinforcement Learning for LLM Multi-Step Reasoning , https://arxiv.org/abs/2412.16145 24 Dec, Mulberry: Empowering MLLM with O1-like Reasoning and Reflection via Collective Monte Carlo Tree Search , https://arxiv.org/abs/2412.18319 31 Dec, Titans: Learning to Memorize at Test Time , https://arxiv.org/abs/2501.00663

Odin

HTML Perl

0 views

Ahead of AI 1 years ago

Understanding Multimodal LLMs

It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to AI and several interesting research papers published. Among others, Meta AI released their latest Llama 3.2 models, which include open-weight versions for the 1B and 3B large language models and two multimodal models. In this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches. (To see a table of contents menu, click on the stack of lines on the left-hand side.) An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)" , is now finally available on Amazon ! Build a Large Language Model (From Scratch) now available on Amazon Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers! For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months. P.S. If you have read the book, I'd really appreciate it if you could leave a brief review ; it truly helps us authors! What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs. A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below. Example use of a multimodal LLM explaining a meme . Of course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown. There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. (By the way, I don’t believe official terms for these techniques exist yet, but let me know if you’ve come across any. For instance, briefer descriptions may be "decoder-only" and "cross-attention-based" approaches.) The two main approaches to developing multimodal LLM architectures. As shown in the figure above, the Unified Embedding-Decoder Architecture utilizes a single decoder model, much like an unmodified LLM architecture such as GPT-2 or Llama 3.2. In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation. The Cross-Modality Attention Architecture employs a cross-attention mechanism to integrate image and text embeddings directly within the attention layer. In the following sections, we will explore how these methods work on a conceptual level. Then, we will look at recent research papers on multimodal LLMs to see how they are applied in practice. Let’s begin with the unified embedding decoder architecture, illustrated again in the figure below. Illustration of the unified embedding decoder architecture, which is an unmodified decoder-style LLM (like GPT-2, Phi-3, Gemma, or Llama 3.2) that receives inputs consisting of image token and text token embeddings. In the unified embedding-decoder architecture, an image is converted into embedding vectors, similar to how input text is converted into embeddings in a standard text-only LLM. For a typical text-only LLM that processes text, the text input is usually tokenized (e.g., using Byte-Pair Encoding) and then passed through an embedding layer, as shown in the figure below. Illustration of the standard process for tokenizing text and converting it into token embedding vectors, which are subsequently passed to an LLM during training and inference. Analogous to the tokenization and embedding of text, image embeddings are generated using an image encoder module (instead of a tokenizer), as shown in the figure below. Illustration of the process for encoding an image into image patch embeddings. What happens inside the image encoder shown above? To process an image, we first divide it into smaller patches, much like breaking words into subwords during tokenization. These patches are then encoded by a pretrained vision transformer (ViT), as shown in the figure below. Illustration of a classic vision transformer (ViT) setup, similar to the model proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). Note that ViTs are often used for classification tasks, so I included the classification head in the figure above. However, in this case, we only need the image encoder part. The "linear projection" shown in the previous figure consists of a single linear layer (i.e., a fully connected layer). The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder. This linear projection is illustrated in the figure below. An image patch, flattened into a 256-dimensional vector, is up-projected to a 768-dimensional vector. Illustration of a linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space. For those who prefer seeing a code example, In PyTorch code, we could implement the linear projection for the image patches as follows: If you have read my Machine Learning Q and AI book by chance, you may know there are ways to replace linear layers with convolution operations that can be implemented to be mathematically equivalent. Here, this can be especially handy as we can combine the creation of patches and projection into two lines of code: Now that we briefly discussed the purpose of the image encoder (and the linear projection that is part of the encoder), let's return to the text tokenization analogy from earlier and look at text and image tokenization and embedding side by side, as depicted in the figure below. Image tokenization and embedding (left) and text tokenization and embedding (right) side by side. As you can see in the figure above, I included an additional projector module that follows the image encoder. This projector is usually just another linear projection layer that is similar to the one explained earlier. The purpose is to project the image encoder outputs into a dimension that matches the dimensions of the embedded text tokens, as illustrated in the figure below. (As we will see later, the projector is sometimes also called adapter, adaptor, or connector.) Another side-by-side comparison between image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions. Now that the image patch embeddings have the same embedding dimension as the text token embeddings, we can simply concatenate them as input to the LLM, as shown in the figure at the beginning of this section. Below is the same figure again for easier reference. After projecting the image patch tokens into the same dimension as the text token embeddings, we can simply concatenate them as input to a standard LLM. By the way, the image encoder we discussed in this section is usually a pretrained vision transformer. A popular choice is CLIP or OpenCLIP . However, there are also versions of Method A that operate directly on patches, such as Fuyu , which is shown in the figure below. Annotated figure of the Fuyu multimodal LLM that operates directly on the image patches without image encoder. (Annotated figure from https://www.adept.ai/blog/fuyu-8b .) As illustrated in the figure above, Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do. This greatly simplifies the architecture and training setup. Now that we have discussed the unified embedding decoder architecture approach to building multimodal LLMs and understand the basic concept behind image encoding, let's talk about an alternative way of implementing multimodal LLMs via cross-attention, as summarized in the figure below. An illustration of the Cross-Modality Attention Architecture approach to building multimodal LLMs. In the Cross-Modality Attention Architecture method depicted in the figure above, we still use the same image encoder setup we discussed previously. However, instead of encoding the patches as input to the LLM, we connect the input patches in the multi-head attention layer via a cross-attention mechanism. The idea is related and goes back to the original transformer architecture from the 2017 Attention Is All You Need paper, highlighted in the figure below. High-level illustration of the cross-attention mechanism used in the original transformer architecture. (Annotated figure from the "Attention Is All You Need" paper: https://arxiv.org/abs/1706.03762.) Note that the original "Attention Is All You Need" transformer depicted in the figure above was originally developed for language translation. So, it consists of a text en coder (left part of the figure) that takes the sentence to be translated and generates the translation via a text de coder (right part of the figure). In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies. How does cross-attention work? Let's have a look at a conceptual drawing of what happens inside the regular self-attention mechanism. Outline of the regular self-attention mechanism. (This flow depicts one of the heads in a regular multi-head attention module.) In the figure above, x is the input, and W q is a weight matrix used to generate the queries ( Q ). Similarly, K stands for keys, and V stands for values. A represents the attention scores matrix, and Z are the inputs (x) transformed into the output context vectors. (If this seems confusing, you may find a comprehensive introduction in Chapter 3 of my Build a Large Language Model from Scratch book helpful; alternatively, you may also find my article, Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs helpful here.) In cross-attention, in contrast to self-attention, we have two different input sources, as illustrated in the following figure. Illustration of cross attention, where there can be two different inputs x 1 and x 2 As illustrated in the previous two figures, in self-attention, we work with the same input sequence. In cross-attention, we mix or combine two different input sequences. In the case of the original transformer architecture in the Attention Is All You Need paper, the two inputs x 1 and x 2 correspond to the sequence returned by the encoder module on the left ( x 2 ) and the input sequence being processed by the decoder part on the right ( x 1 ). In the context of a multimodal LLM, x 2 is the output of an image encoder. (Note that the queries usually come from the decoder, and the keys and values typically come from the encoder.) Note that in cross-attention, the two input sequences x 1 and x 2 can have different numbers of elements. However, their embedding dimensions must match. If we set x 1 = x 2 , this is equivalent to self-attention. Now that we have talked a bit about the two major multimodal design choices, let's briefly talk about how we deal with the three major components during model training, which are summarized in the figure below. An overview of the different components in a multimodal LLM. The components numbered 1-3 can be frozen or unfrozen during the multimodal training process. Similar to the development of traditional text-only LLMs, the training of multimodal LLMs also involves two phases: pretraining and instruction finetuning. However, unlike starting from scratch, multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model. For the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions, as we will explore later. Keeping the LLM part frozen during the pretraining phase is also usual, focusing only on training the projector—a linear layer or a small multi-layer perceptron. Given the projector's limited learning capacity, usually comprising just one or two layers, the LLM is often unfrozen during multimodal instruction finetuning (stage 2) to allow for more comprehensive updates. However, note that in the cross-attention-based models (Method B), the cross-attention layers are unfrozen throughout the entire training process. After introducing the two primary approaches (Method A: Unified Embedding Decoder Architecture and Method B: Cross-modality Attention Architecture), you might be wondering which is more effective. The answer depends on specific trade-offs. The Unified Embedding Decoder Architecture (Method A) is typically easier to implement since it doesn't require any modifications to the LLM architecture itself. The Cross-modality Attention Architecture (Method B) is often considered more computationally efficient because it doesn't overload the input context with additional image tokens, introducing them later in the cross-attention layers instead. Additionally, this approach maintains the text-only performance of the original LLM if the LLM parameters are kept frozen during training. We will revisit the discussion on modeling performance and response quality in a later section, where we will discuss NVIDIA's NVLM paper. This marks the end of what turned out to be a rather extensive introduction to multimodal LLMs. As I write this, I realize that the discussion has become lengthier than initially planned, which probably makes this a good place to conclude the article. However, to provide a practical perspective, it would be nice to examine a few recent research papers that implement these approaches. So, we will explore these papers in the remaining sections of this article. For the remainder of this article, I will review recent literature concerning multimodal LLMs, focusing specifically on works published in the last few weeks to maintain a reasonable scope. Thus, this is not a historical overview or comprehensive review of multimodal LLMs but rather a brief look at the latest developments. I will also try to keep these summaries short and without too much fluff as there are 10 of them. The conclusion section at the end of this has an overview that compares the methods used in these papers. The Llama 3 Herd of Models paper (July 31, 2024) by Meta AI came out earlier this summer, which feels like ages ago in LLM terms. However, given that they only described but did not release their multimodal models until much later, I think it's fair to include Llama 3 in this list. (Llama 3.2 models were officially announced and made available on September 25.) The multimodal Llama 3.2 models, which come in an 11-billion and 90-billion parameter version, are image-text models that use the previously described cross-attention-based approach, which is illustrated in the figure below. Illustration of the multimodal LLM approach used by Llama 3.2. (Annotated figure from the Llama 3 paper: https://arxiv.org/abs/2407.21783.The video and speech parts are visually occluded to focus the attention on the image part.) Note that while the figure also depicts video and speech as possible modalities, the models that were released as of this writing focus only on image and text. Llama 3.2 uses the cross-attention-based approach. However, it differs a bit from what I wrote about earlier, namely that in multimodal LLM development, we usually freeze the image encoder and only update the LLM parameters during pretraining. Here, the researchers almost take the opposite approach: they update the image encoder but do not update the language model's parameters. They write that this is intentional and done to preserve the text-only capabilities so that the 11B and 90B multimodal models can be used as drop-in replacements for the Llama 3.1 8B and 70B text-only model on text tasks. The training itself is done in multiple iterations, starting with the Llama 3.1 text models. After adding the image encoder and projection (here called "adapter") layers, they pretrain the model on image-text data. Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article ), they follow up with instruction and preference finetuning. Instead of adopting a pretrained model such as CLIP as an image encoder, the researchers used a vision transformer that they pretrained from scratch. Specifically, they adopted the ViT-H/14 variant (630 million parameters) of the classic vision transformer architecture ( Dosovitskiy et al., 2020 ). They then pretrained the ViT on a dataset of 2.5 billion image-text pairs over five epochs; this was done before connecting the image encoder to the LLM. (The image encoder takes 224×224 resolution images and divides them into a 14×14 grid of patches, with each patch sized at 16×16 pixels.) As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 billion parameters.) The Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper (September 25, 2024) is notable because it promises to open source not only the model weights but also the dataset and source code similar to the language-only OLMo LLM. (This is great for LLM research as it allows us to take a look at the exact training procedure and code and also lets us run ablation studies and reproduce results on the same dataset.) If you are wondering why there are two names in the paper title, Molmo refers to the model (Multimodal Open Language Model), and PixMo (Pixels for Molmo) is the dataset. Illustration of the Molmo decoder-only approach (Method A). Annotated figure adapted from the Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper: https://www.arxiv.org/abs/2409.17146. As illustrated in the figure above, the image encoder employs an off-the-shelf vision transformer, specifically CLIP. The term "connector" here refers to a "projector" that aligns image features with the language model. Molmo streamlines the training process by avoiding multiple pretraining stages, choosing instead a simple pipeline that updates all parameters in a unified approach—including those of the base LLM, the connector, and the image encoder. The Molmo team offers several options for the base LLM: OLMo-7B-1024 (a fully open model backbone), OLMoE-1B-7B (a mixture-of-experts architecture; the most efficient model), Qwen2 7B (an open-weight model that performs better than OLMo-7B-1024), Qwen2 72B (an open-weight model and the best-performing model) NVIDIA's NVLM: Open Frontier-Class Multimodal LLMs paper (September 17, 2024) is particularly interesting because, rather than focusing on a single approach, it explores both methods: Method A, the Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), and Method B, the Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X). Additionally, they develop a hybrid approach (NVLM-H) and provide an apples-to-apples comparison of all three methods. Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure below, NVLM-D corresponds to Method A, and NVLM-X corresponds to Method B, as discussed earlier. The concept behind the hybrid model (NVLM-H) is to combine the strengths of both methods: an image thumbnail is provided as input, followed by a dynamic number of patches passed through cross-attention to capture finer high-resolution details. In short, the research team find that: NVLM-X demonstrates superior computational efficiency for high-resolution images. NVLM-D achieves higher accuracy in OCR-related tasks. NVLM-H combines the advantages of both methods. Similar to Molmo and other approaches, they begin with a text-only LLM rather than pretraining a multimodal model from scratch (as this generally performs better). Additionally, they use an instruction-tuned LLM instead of a base LLM. Specifically, the backbone LLM is Qwen2-72B-Instruct (to my knowledge, Molmo used the Qwen2-72B base model). While training all LLM parameters in the NVLM-D approach, they found that for NVLM-X, it works well to freeze the original LLM parameters and train only the cross-attention layers during both pretraining and instruction finetuning. For the image encoder, instead of using a typical CLIP model, they use InternViT-6B , which remains frozen throughout all stages. The projector is a multilayer perceptron rather than a single linear layer. The previous two papers and models, Molmo and NVLM, were based on Qwen2-72B LLM. In this paper, the Qwen research team itself announces a multimodal LLM, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (October 3rd, 2024). At the core of this work is their so-called "Naive Dynamic Resolution" mechanism (the term "naive" is intentional and not a typo for "native," though "native" could also be fitting). This mechanism allows the model to handle images of varying resolutions without simple downsampling, enabling the input of images in their original resolution. An overview of the multimodal Qwen model, which can process input images with various different resolutions natively. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The native resolution input is implemented via a modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE. They used a classic vision encoder with 675M parameters and LLM backbones of varying sizes, as shown in the table below. The components of the different Qwen2-VL models. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The training itself consists of 3 stages: (1) pretraining only the image encoder, (2) unfreezing all parameters (including LLM), and (3) freezing the image encoder and instruction-finetuning only the LLM. Pixtral 12B (September 17, 2024), which uses the Method A: Unified Embedding Decoder Architecture approach, is the first multimodal model from Mistral AI. Unfortunately, there is no technical paper or report available, but the Mistral team shared a few interesting tidbits in their blog post . Interestingly, they chose not to use a pretrained image encoder, instead training one with 400 million parameters from scratch. For the LLM backbone, they used the 12-billion-parameter Mistral NeMo model. Similar to Qwen2-VL, Pixtral also supports variable image sizes natively, as illustrated in the figure below. Illustration of how Pixtral processes images of different sizes. (Annotated figure from the Pixtral blog post: https://mistral.ai/news/pixtral-12b/) The MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning paper (September 30, 2024) provides practical tips and introduces a mixture-of-experts multimodal model alongside a dense model similar to Molmo. The models span a wide size range, from 1 billion to 30 billion parameters. The models described in this paper focuse on Method A, a Unified Embedding Transformer Architecture, which structures inputs effectively for multimodal learning. In addition, the paper has a series of interesting ablation studies looking into data mixtures and the effects of using coordinate tokens. Illustration of the MM1.5 approach, which includes additional coordinate tokens to denote bounding boxes. (Annotated figure from the MM1.5 paper: https://arxiv.org/abs/2409.20566.) The Aria: An Open Multimodal Native Mixture-of-Experts Model paper (October 8, 2024) introduces another mixture-of-experts model approach, similar to one of the variants in the Molmo and MM1.5 lineups. The Aria model has 24.9 billion parameters, with 3.5 billion parameters allocated per text token. The image encoder ( SigLIP ) has 438-million-parameters. This model is based on a cross-attention approach with the following overall training procedure: Training the LLM backbone entirely from scratch. Pretraining both the LLM backbone and the vision encoder. The Baichuan-Omni Technical Report (October 11, 2024) introduces Baichuan-Omni, a 7-billion-parameter multimodal LLM based on Method A: the Unified Embedding Decoder Architecture approach, as shown in the figure below. An overview of the Baichuan-Omni model, which can handle various input modalities. (Annotated figure from the Baichuan-Omni paper: https://arxiv.org/abs/2410.08565) The training process for Baichuan-Omni involves a three-stage approach: Projector training : Initially, only the projector is trained, while both the vision encoder and the language model (LLM) remain frozen. Vision encoder training : Next, the vision encoder is unfrozen and trained, with the LLM still frozen. Full model training : Finally, the LLM is unfrozen, allowing the entire model to be trained end-to-end. The model utilizes the SigLIP vision encoder and incorporates the AnyRes module to handle high-resolution images through down-sampling techniques. While the report does not explicitly specify the LLM backbone, it is likely based on the Baichuan 7B LLM, given the model's parameter size and the naming convention. The Emu3: Next-Token Prediction is All You Need paper (September 27, 2024) presents a compelling alternative to diffusion models for image generation, which is solely based on a transformer-based decoder architecture. Although it's not a multimodal LLM in the classic sense (i.e., models focused on image understanding rather than generation), Emu3 is super interesting as it demonstrates that it's possible to use transformer decoders for image generation, which is a task typically dominated by diffusion methods. (However, note that there have been other similar approaches before, such as Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation .) Emu3 is primarily an LLM for image generation as an alternative to diffusion models. (Annotated figure from the Emu3 paper: https://arxiv.org/abs/2409.18869) The researchers trained Emu3 from scratch and then used Direct Preference Optimization (DPO) to align the model with human preferences. The architecture includes a vision tokenizer inspired by SBER-MoVQGAN . The core LLM architecture is based on Llama 2, yet it is trained entirely from scratch. We previously focused on multimodal LLMs for image understanding and just saw one example for image generation with Emu 3 above. Now, the Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation paper (October 17, 2024) introduces a framework that unifies multimodal understanding and generation tasks within a single LLM backbone. A key feature of Janus is the decoupling of visual encoding pathways to address the distinct requirements of understanding and generation tasks. The researchers argue that image understanding tasks require high-dimensional semantic representations, while generation tasks require detailed local information and global consistency in images. By separating these pathways, Janus effectively manages these differing needs. The model employs the SigLIP vision encoder, similar to that used in Baichuan-Omni, for processing visual inputs. For image generation, it utilizes a Vector Quantized (VQ) tokenizer to handle the generation process. The base LLM in Janus is the DeepSeek-LLM with 1.3 billion parameters. An overview of the unified decoder-only framework used in Janus. (Annotated figure from the Janus paper: https://arxiv.org/abs/2410.13848.) The training process for the model in this image follows three stages, as shown in the figure below. Illustration of the 3-stage training process of the Janus model. (Annotated figure from the Janus paper: https://arxiv.org/abs/2410.13848) In Stage I, only the projector layers and image output layer are trained while the LLM, understanding, and generation encoders remain frozen. In Stage II, the LLM backbone and text output layer are unfrozen, allowing for unified pretraining across understanding and generation tasks. Finally, in Stage III, the entire model, including the SigLIP image encoder, is unfrozen for supervised fine-tuning, enabling the model to fully integrate and refine its multimodal capabilities. As you may have noticed, I almost entirely skipped both the modeling and the computational performance comparisons. First, comparing the performance of LLMs and multimodal LLMs on public benchmarks is challenging due to prevalent data contamination, meaning that the test data may have been included in the training data. Additionally, the architectural components vary so much that making an apples-to-apples comparison is difficult. So, big kudos to the NVIDIA team for developing NVLM in different flavors, which allowed for a comparison between the decoder-only and cross-attention approaches at least. In any case, the main takeaway from this article is that multimodal LLMs can be built successfully in many different ways. Below is a figure that summarizes the different components of the models covered in this article. An overview of the different models covered in this article along with their subcomponents and training approaches. I hope you found reading this article educational and now have a better understanding of how multimodal LLMs work! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Your support means a great deal! Thank you! An illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality. But before we begin, I also have some exciting news to share on the personal front! My book, "Build A Large Language Model (From Scratch)" , is now finally available on Amazon ! Build a Large Language Model (From Scratch) now available on Amazon Writing this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers! For those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months. P.S. If you have read the book, I'd really appreciate it if you could leave a brief review ; it truly helps us authors! 1. Use cases of multimodal LLMs What are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each "modality" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs. A classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below. Example use of a multimodal LLM explaining a meme . Of course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown. 2. Common approaches to building multimodal LLMs There are two main approaches to building multimodal LLMs: Method A: Unified Embedding Decoder Architecture approach; Method B: Cross-modality Attention Architecture approach. The two main approaches to developing multimodal LLM architectures. As shown in the figure above, the Unified Embedding-Decoder Architecture utilizes a single decoder model, much like an unmodified LLM architecture such as GPT-2 or Llama 3.2. In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation. The Cross-Modality Attention Architecture employs a cross-attention mechanism to integrate image and text embeddings directly within the attention layer. In the following sections, we will explore how these methods work on a conceptual level. Then, we will look at recent research papers on multimodal LLMs to see how they are applied in practice. 2.1 Method A: Unified Embedding Decoder Architecture Let’s begin with the unified embedding decoder architecture, illustrated again in the figure below. Illustration of the unified embedding decoder architecture, which is an unmodified decoder-style LLM (like GPT-2, Phi-3, Gemma, or Llama 3.2) that receives inputs consisting of image token and text token embeddings. In the unified embedding-decoder architecture, an image is converted into embedding vectors, similar to how input text is converted into embeddings in a standard text-only LLM. For a typical text-only LLM that processes text, the text input is usually tokenized (e.g., using Byte-Pair Encoding) and then passed through an embedding layer, as shown in the figure below. Illustration of the standard process for tokenizing text and converting it into token embedding vectors, which are subsequently passed to an LLM during training and inference. 2.1.1 Understanding Image encoders Analogous to the tokenization and embedding of text, image embeddings are generated using an image encoder module (instead of a tokenizer), as shown in the figure below. Illustration of the process for encoding an image into image patch embeddings. What happens inside the image encoder shown above? To process an image, we first divide it into smaller patches, much like breaking words into subwords during tokenization. These patches are then encoded by a pretrained vision transformer (ViT), as shown in the figure below. Illustration of a classic vision transformer (ViT) setup, similar to the model proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). Note that ViTs are often used for classification tasks, so I included the classification head in the figure above. However, in this case, we only need the image encoder part. 2.1.2 The role of the linear projection module The "linear projection" shown in the previous figure consists of a single linear layer (i.e., a fully connected layer). The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder. This linear projection is illustrated in the figure below. An image patch, flattened into a 256-dimensional vector, is up-projected to a 768-dimensional vector. Illustration of a linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space. For those who prefer seeing a code example, In PyTorch code, we could implement the linear projection for the image patches as follows: If you have read my Machine Learning Q and AI book by chance, you may know there are ways to replace linear layers with convolution operations that can be implemented to be mathematically equivalent. Here, this can be especially handy as we can combine the creation of patches and projection into two lines of code: 2.1.3 Image vs text tokenization Now that we briefly discussed the purpose of the image encoder (and the linear projection that is part of the encoder), let's return to the text tokenization analogy from earlier and look at text and image tokenization and embedding side by side, as depicted in the figure below. Image tokenization and embedding (left) and text tokenization and embedding (right) side by side. As you can see in the figure above, I included an additional projector module that follows the image encoder. This projector is usually just another linear projection layer that is similar to the one explained earlier. The purpose is to project the image encoder outputs into a dimension that matches the dimensions of the embedded text tokens, as illustrated in the figure below. (As we will see later, the projector is sometimes also called adapter, adaptor, or connector.) Another side-by-side comparison between image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions. Now that the image patch embeddings have the same embedding dimension as the text token embeddings, we can simply concatenate them as input to the LLM, as shown in the figure at the beginning of this section. Below is the same figure again for easier reference. After projecting the image patch tokens into the same dimension as the text token embeddings, we can simply concatenate them as input to a standard LLM. By the way, the image encoder we discussed in this section is usually a pretrained vision transformer. A popular choice is CLIP or OpenCLIP . However, there are also versions of Method A that operate directly on patches, such as Fuyu , which is shown in the figure below. Annotated figure of the Fuyu multimodal LLM that operates directly on the image patches without image encoder. (Annotated figure from https://www.adept.ai/blog/fuyu-8b .) As illustrated in the figure above, Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do. This greatly simplifies the architecture and training setup. 2.2 Method B: Cross-Modality Attention Architecture Now that we have discussed the unified embedding decoder architecture approach to building multimodal LLMs and understand the basic concept behind image encoding, let's talk about an alternative way of implementing multimodal LLMs via cross-attention, as summarized in the figure below. An illustration of the Cross-Modality Attention Architecture approach to building multimodal LLMs. In the Cross-Modality Attention Architecture method depicted in the figure above, we still use the same image encoder setup we discussed previously. However, instead of encoding the patches as input to the LLM, we connect the input patches in the multi-head attention layer via a cross-attention mechanism. The idea is related and goes back to the original transformer architecture from the 2017 Attention Is All You Need paper, highlighted in the figure below. High-level illustration of the cross-attention mechanism used in the original transformer architecture. (Annotated figure from the "Attention Is All You Need" paper: https://arxiv.org/abs/1706.03762.) Note that the original "Attention Is All You Need" transformer depicted in the figure above was originally developed for language translation. So, it consists of a text en coder (left part of the figure) that takes the sentence to be translated and generates the translation via a text de coder (right part of the figure). In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies. How does cross-attention work? Let's have a look at a conceptual drawing of what happens inside the regular self-attention mechanism. Outline of the regular self-attention mechanism. (This flow depicts one of the heads in a regular multi-head attention module.) In the figure above, x is the input, and W q is a weight matrix used to generate the queries ( Q ). Similarly, K stands for keys, and V stands for values. A represents the attention scores matrix, and Z are the inputs (x) transformed into the output context vectors. (If this seems confusing, you may find a comprehensive introduction in Chapter 3 of my Build a Large Language Model from Scratch book helpful; alternatively, you may also find my article, Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs helpful here.) In cross-attention, in contrast to self-attention, we have two different input sources, as illustrated in the following figure. Illustration of cross attention, where there can be two different inputs x 1 and x 2 As illustrated in the previous two figures, in self-attention, we work with the same input sequence. In cross-attention, we mix or combine two different input sequences. In the case of the original transformer architecture in the Attention Is All You Need paper, the two inputs x 1 and x 2 correspond to the sequence returned by the encoder module on the left ( x 2 ) and the input sequence being processed by the decoder part on the right ( x 1 ). In the context of a multimodal LLM, x 2 is the output of an image encoder. (Note that the queries usually come from the decoder, and the keys and values typically come from the encoder.) Note that in cross-attention, the two input sequences x 1 and x 2 can have different numbers of elements. However, their embedding dimensions must match. If we set x 1 = x 2 , this is equivalent to self-attention. 3. Unified decoder and cross-attention model training Now that we have talked a bit about the two major multimodal design choices, let's briefly talk about how we deal with the three major components during model training, which are summarized in the figure below. An overview of the different components in a multimodal LLM. The components numbered 1-3 can be frozen or unfrozen during the multimodal training process. Similar to the development of traditional text-only LLMs, the training of multimodal LLMs also involves two phases: pretraining and instruction finetuning. However, unlike starting from scratch, multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model. For the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions, as we will explore later. Keeping the LLM part frozen during the pretraining phase is also usual, focusing only on training the projector—a linear layer or a small multi-layer perceptron. Given the projector's limited learning capacity, usually comprising just one or two layers, the LLM is often unfrozen during multimodal instruction finetuning (stage 2) to allow for more comprehensive updates. However, note that in the cross-attention-based models (Method B), the cross-attention layers are unfrozen throughout the entire training process. After introducing the two primary approaches (Method A: Unified Embedding Decoder Architecture and Method B: Cross-modality Attention Architecture), you might be wondering which is more effective. The answer depends on specific trade-offs. The Unified Embedding Decoder Architecture (Method A) is typically easier to implement since it doesn't require any modifications to the LLM architecture itself. The Cross-modality Attention Architecture (Method B) is often considered more computationally efficient because it doesn't overload the input context with additional image tokens, introducing them later in the cross-attention layers instead. Additionally, this approach maintains the text-only performance of the original LLM if the LLM parameters are kept frozen during training. We will revisit the discussion on modeling performance and response quality in a later section, where we will discuss NVIDIA's NVLM paper. This marks the end of what turned out to be a rather extensive introduction to multimodal LLMs. As I write this, I realize that the discussion has become lengthier than initially planned, which probably makes this a good place to conclude the article. However, to provide a practical perspective, it would be nice to examine a few recent research papers that implement these approaches. So, we will explore these papers in the remaining sections of this article. 4. Recent multimodal models and methods For the remainder of this article, I will review recent literature concerning multimodal LLMs, focusing specifically on works published in the last few weeks to maintain a reasonable scope. Thus, this is not a historical overview or comprehensive review of multimodal LLMs but rather a brief look at the latest developments. I will also try to keep these summaries short and without too much fluff as there are 10 of them. The conclusion section at the end of this has an overview that compares the methods used in these papers. 4.1 The Llama 3 Herd of Models The Llama 3 Herd of Models paper (July 31, 2024) by Meta AI came out earlier this summer, which feels like ages ago in LLM terms. However, given that they only described but did not release their multimodal models until much later, I think it's fair to include Llama 3 in this list. (Llama 3.2 models were officially announced and made available on September 25.) The multimodal Llama 3.2 models, which come in an 11-billion and 90-billion parameter version, are image-text models that use the previously described cross-attention-based approach, which is illustrated in the figure below. Illustration of the multimodal LLM approach used by Llama 3.2. (Annotated figure from the Llama 3 paper: https://arxiv.org/abs/2407.21783.The video and speech parts are visually occluded to focus the attention on the image part.) Note that while the figure also depicts video and speech as possible modalities, the models that were released as of this writing focus only on image and text. Llama 3.2 uses the cross-attention-based approach. However, it differs a bit from what I wrote about earlier, namely that in multimodal LLM development, we usually freeze the image encoder and only update the LLM parameters during pretraining. Here, the researchers almost take the opposite approach: they update the image encoder but do not update the language model's parameters. They write that this is intentional and done to preserve the text-only capabilities so that the 11B and 90B multimodal models can be used as drop-in replacements for the Llama 3.1 8B and 70B text-only model on text tasks. The training itself is done in multiple iterations, starting with the Llama 3.1 text models. After adding the image encoder and projection (here called "adapter") layers, they pretrain the model on image-text data. Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article ), they follow up with instruction and preference finetuning. Instead of adopting a pretrained model such as CLIP as an image encoder, the researchers used a vision transformer that they pretrained from scratch. Specifically, they adopted the ViT-H/14 variant (630 million parameters) of the classic vision transformer architecture ( Dosovitskiy et al., 2020 ). They then pretrained the ViT on a dataset of 2.5 billion image-text pairs over five epochs; this was done before connecting the image encoder to the LLM. (The image encoder takes 224×224 resolution images and divides them into a 14×14 grid of patches, with each patch sized at 16×16 pixels.) As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 billion parameters.) 4.2 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models The Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper (September 25, 2024) is notable because it promises to open source not only the model weights but also the dataset and source code similar to the language-only OLMo LLM. (This is great for LLM research as it allows us to take a look at the exact training procedure and code and also lets us run ablation studies and reproduce results on the same dataset.) If you are wondering why there are two names in the paper title, Molmo refers to the model (Multimodal Open Language Model), and PixMo (Pixels for Molmo) is the dataset. Illustration of the Molmo decoder-only approach (Method A). Annotated figure adapted from the Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper: https://www.arxiv.org/abs/2409.17146. As illustrated in the figure above, the image encoder employs an off-the-shelf vision transformer, specifically CLIP. The term "connector" here refers to a "projector" that aligns image features with the language model. Molmo streamlines the training process by avoiding multiple pretraining stages, choosing instead a simple pipeline that updates all parameters in a unified approach—including those of the base LLM, the connector, and the image encoder. The Molmo team offers several options for the base LLM: OLMo-7B-1024 (a fully open model backbone), OLMoE-1B-7B (a mixture-of-experts architecture; the most efficient model), Qwen2 7B (an open-weight model that performs better than OLMo-7B-1024), Qwen2 72B (an open-weight model and the best-performing model) Method A, the Unified Embedding Decoder Architecture ("decoder-only architecture," NVLM-D), and Method B, the Cross-Modality Attention Architecture ("cross-attention-based architecture," NVLM-X). Overview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: https://arxiv.org/abs/2409.11402) As summarized in the figure below, NVLM-D corresponds to Method A, and NVLM-X corresponds to Method B, as discussed earlier. The concept behind the hybrid model (NVLM-H) is to combine the strengths of both methods: an image thumbnail is provided as input, followed by a dynamic number of patches passed through cross-attention to capture finer high-resolution details. In short, the research team find that: NVLM-X demonstrates superior computational efficiency for high-resolution images. NVLM-D achieves higher accuracy in OCR-related tasks. NVLM-H combines the advantages of both methods. An overview of the multimodal Qwen model, which can process input images with various different resolutions natively. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The native resolution input is implemented via a modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE. They used a classic vision encoder with 675M parameters and LLM backbones of varying sizes, as shown in the table below. The components of the different Qwen2-VL models. (Annotated figure from the Qwen2-VL paper: https://arxiv.org/abs/2409.12191) The training itself consists of 3 stages: (1) pretraining only the image encoder, (2) unfreezing all parameters (including LLM), and (3) freezing the image encoder and instruction-finetuning only the LLM. 4.5 Pixtral 12B Pixtral 12B (September 17, 2024), which uses the Method A: Unified Embedding Decoder Architecture approach, is the first multimodal model from Mistral AI. Unfortunately, there is no technical paper or report available, but the Mistral team shared a few interesting tidbits in their blog post . Interestingly, they chose not to use a pretrained image encoder, instead training one with 400 million parameters from scratch. For the LLM backbone, they used the 12-billion-parameter Mistral NeMo model. Similar to Qwen2-VL, Pixtral also supports variable image sizes natively, as illustrated in the figure below. Illustration of how Pixtral processes images of different sizes. (Annotated figure from the Pixtral blog post: https://mistral.ai/news/pixtral-12b/) 4.6 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning The MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning paper (September 30, 2024) provides practical tips and introduces a mixture-of-experts multimodal model alongside a dense model similar to Molmo. The models span a wide size range, from 1 billion to 30 billion parameters. The models described in this paper focuse on Method A, a Unified Embedding Transformer Architecture, which structures inputs effectively for multimodal learning. In addition, the paper has a series of interesting ablation studies looking into data mixtures and the effects of using coordinate tokens. Illustration of the MM1.5 approach, which includes additional coordinate tokens to denote bounding boxes. (Annotated figure from the MM1.5 paper: https://arxiv.org/abs/2409.20566.) 4.7 Aria: An Open Multimodal Native Mixture-of-Experts Model The Aria: An Open Multimodal Native Mixture-of-Experts Model paper (October 8, 2024) introduces another mixture-of-experts model approach, similar to one of the variants in the Molmo and MM1.5 lineups. The Aria model has 24.9 billion parameters, with 3.5 billion parameters allocated per text token. The image encoder ( SigLIP ) has 438-million-parameters. This model is based on a cross-attention approach with the following overall training procedure: Training the LLM backbone entirely from scratch. Pretraining both the LLM backbone and the vision encoder. An overview of the Baichuan-Omni model, which can handle various input modalities. (Annotated figure from the Baichuan-Omni paper: https://arxiv.org/abs/2410.08565) The training process for Baichuan-Omni involves a three-stage approach: Projector training : Initially, only the projector is trained, while both the vision encoder and the language model (LLM) remain frozen. Vision encoder training : Next, the vision encoder is unfrozen and trained, with the LLM still frozen. Full model training : Finally, the LLM is unfrozen, allowing the entire model to be trained end-to-end.

Nlp

Machine Learning

0 views

Ahead of AI 1 years ago

Building A GPT-Style LLM Classifier From Scratch

In this article, I want to show you how to transform pretrained large language models (LLMs) into strong text classifiers. But why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more. Turning a GPT model into a text classifier To celebrate the book’s release, I’m sharing an excerpt from one of the chapters that walks you through how to finetune a pretrained LLM as a spam classifier. Important Note The chapter on classification finetuning is 35 pages long—too long for a single article. So, in this post, I’ll focus on a ~10-page subset that introduces the context and core concepts behind classification finetuning. Additionally, I’ll share insights from some extra experiments that aren’t included in the book and address common questions readers might have. (Please note that the excerpt below is based on my personal draft before Manning’s professional text editing and final figure design.) The full code for this excerpt can be found here on GitHub . In addition, I'll also answer 7 questions you might have regarding training LLM classifiers: 1) Do we need to train all layers? 2) Why finetuning the last token, not the first token? 3) How does BERT compare to GPT performance-wise? 4) Should we disable the causal mask? 5) What impact does increasing the model size have? 6) What improvements can we expect from LoRA? 7) Padding or no padding? Happy reading! The most common ways to finetune language models are instruction finetuning and classification finetuning . Instruction finetuning involves training a language model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described in natural language prompts, as illustrated in Figure 1 below. Figure 1: Illustration of two different instruction finetuning scenarios. At the top, the model is tasked with determining whether a given text is spam. At the bottom, the model is given an instruction on how to translate an English sentence into German. The next chapter will discuss instruction finetuning, as illustrated in Figure 1 above. Meanwhile, this chapter is centered on classification finetuning, a concept you might already be acquainted with if you have a background in machine learning. In classification finetuning, the model is trained to recognize a specific set of class labels, such as "spam" and "not spam." Examples of classification tasks extend beyond large language models and email filtering; they include identifying different species of plants from images, categorizing news articles into topics like sports, politics, or technology, and distinguishing between benign and malignant tumors in medical imaging. The key point is that a classification-finetuned model is restricted to predicting classes it has encountered during its training—for instance, it can determine whether something is "spam" or "not spam," as illustrated in Figure 2 below, but it can't say anything else about the input text. Figure 2: Illustration of a text classification scenario using an LLM. A model finetuned for spam classification does not require additional instructions alongside the input. However, in contrast to an instruction-finetuned model, it can only respond with "spam" and "not spam." In contrast to the classification-finetuned model depicted in Figure 2, an instruction-finetuned model typically has the capability to undertake a broader range of tasks. We can view a classification-finetuned model as highly specialized, and generally, it is easier to develop a specialized model than a generalist model that works well across various tasks. Choosing the right approach Instruction finetuning improves a model's ability to understand and generate responses based on specific user instructions. Instruction finetuning is best suited for models that need to handle a variety of tasks based on complex user instructions, improving flexibility and interaction quality. Classification finetuning, on the other hand, is ideal for projects requiring precise categorization of data into predefined classes, such as sentiment analysis or spam detection. While instruction finetuning is more versatile, it demands larger datasets and greater computational resources to develop models proficient in various tasks. In contrast, classification finetuning requires less data and compute power, but its use is confined to the specific classes on which the model has been trained.

Tutorial

0 views

Ahead of AI 1 years ago

Building LLMs from the Ground Up: A 3-hour Coding Workshop

If you’d like to spend a few hours this weekend to dive into Large Language Models (LLMs) and understand how they work, I've prepared a 3-hour coding workshop presentation on implementing, training, and using LLMs. Below, you'll find a table of contents to get an idea of what this video covers (the video itself has clickable chapter marks, allowing you to jump directly to topics of interest): 0:00 – Workshop overview 2:17 – Part 1: Intro to LLMs 9:14 – Workshop materials 10:48 – Part 2: Understanding LLM input data 23:25 – A simple tokenizer class 41:03 – Part 3: Coding an LLM architecture 45:01 – GPT-2 and Llama 2 1:07:11 – Part 4: Pretraining 1:29:37 – Part 5.1: Loading pretrained weights 1:45:12 – Part 5.2: Pretrained weights via LitGPT 1:53:09 – Part 6.1: Instruction finetuning 2:08:21 – Part 6.2: Instruction finetuning via LitGPT 02:26:45 – Part 6.3: Benchmark evaluation 02:36:55 – Part 6.4: Evaluating conversational performance 02:42:40 – Conclusion It's a slight departure from my usual text-based content, but the last time I did this a few months ago, it was so well-received that I thought it might be nice to do another one! Happy viewing! Build an LLM from Scratch book Build an LLM from Scratch GitHub repository GitHub repository with workshop code Lightning Studio for this workshop LitGPT GitHub repository This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Build an LLM from Scratch book Build an LLM from Scratch GitHub repository GitHub repository with workshop code Lightning Studio for this workshop LitGPT GitHub repository

Tutorial

0 views

Ahead of AI 1 years ago

New LLM Pre-training and Post-training Paradigms

The development of large language models (LLMs) has come a long way, from the early GPT models to the sophisticated open-weight LLMs we have today. Initially, the LLM training process focused solely on pre-training, but it has since expanded to include both pre-training and post-training. Post-training typically encompasses supervised instruction fine-tuning and alignment, which was popularized by ChatGPT. Training methodologies have evolved since ChatGPT was first released. In this article, I review the latest advancements in both pre-training and post-training methodologies, particularly those made in recent months. An overview of the LLM development and training pipeline, with a focus on new pre-training and post-training methodologies discussed in this article There are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports. In this article, I focus on the pre-training and post-training pipelines of the following models: Alibaba's Qwen 2 Apple Intelligence Foundation Language Models Google's Gemma 2 Meta AI's Llama 3.1 These models are presented in order based on the publication dates of their respective technical papers on arXiv.org, which also happens to align with their alphabetical order. This article is a passion project that I created in my free time and over the weekends. If you find it valuable and would like to support my work, please consider purchasing a copy of my books and recommending them to your colleagues. Your review on Amazon would also be greatly appreciated! Build a Large Language Model (from Scratch) , Machine Learning Q and AI , and Machine Learning with PyTorch and Scikit-Learn Build a Large Language Model (from Scratch) is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs. Machine Learning Q and AI is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more. Machine Learning with PyTorch and Scikit-Learn is a comprehensive guide to machine learning, deep learning, and AI, offering a well-balanced mix of theory and practical code. It's the ideal starting point for anyone new to the field. Let's begin with Qwen 2 , a really strong LLM model family that is competitive with other major LLMs. However, for some reason, it's less popular than the open-weight models from Meta AI, Microsoft, and Google. Before looking at the pre-training and post-training methods discussed in the Qwen 2 Technical Report , let's briefly summarize some core specifications. Qwen 2 models come in 5 flavors. There are 4 regular (dense) LLMs with sizes 0.5 billion, 1.5 billion, 7 billion, and 72 billion parameters. In addition, there is a Mixture-of-Experts model with 57 billion parameters, where 14 billion parameters are activated at the same time. (Since architecture details are not the focus this time, I won't go too much into the Mixture-of-Experts model; however, in a nutshell, this is similar to Mixtral by Mistral AI, except that it has more active experts. For a high-level overview, see the Mixtral Architecture section in my Model Merging, Mixtures of Experts, and Towards Smaller LLMs article .) One of the stand-out features of Qwen 2 LLMs are their good multilingual capabilities in 30 languages. They also have a surprisingly large 151,642 token vocabulary (for reference, Llama 2 uses a 32k vocabulary, and Llama 3.1 uses a 128k token vocabulary); as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so we can fit more text into the same input. Also it especially helps with multilingual data and coding to cover words outside the standard English vocabulary. Below is a brief MMLU benchmark comparison with other LLMs covered later. (Note that MMLU is a multiple-choice benchmark and thus has its limitations; however, it still is one of the most popular methods for reporting LLM performance.) MMLU benchmark scores for the latest open-weight models (higher values are better). I collected the scores for this plot from the official research papers of each model. (If you are new to MMLU, I briefly discussed it in my recent talk at minute 46:05 .) The Qwen 2 team trained the 1.5 billion, 7 billion, and 72 billion parameter models on 7 trillion training tokens, which is a reasonable size. For comparison, Llama 2 models were trained on 2 trillion tokens, and Llama 3.1 models were trained on 15 trillion tokens. Interestingly, the 0.5 billion parameter model was trained on 12 trillion tokens. However, the researchers did not train the other models on the larger 12 trillion token dataset because they did not observe any improvements during training, and the additional computational costs were not justified. One of the focus areas has been improving the data filtering pipeline to remove low-quality data and enhancing data mixing to increase data diversity— a theme we will revisit when examining other models later. Interestingly, they also used Qwen models (although they didn't specify details, I assume they mean previous generation Qwen models) to synthesize additional pre-training data. And the pre-training involved “multi-task instruction data… to enhance in-context learning and instruction-following abilities.” Furthermore, they performed training in two stages: regular pre-training followed by long-context training. The latter increased the context length from 4,096 to 32,768 tokens at the end phase of pre-training using "high-quality, lengthy data." Summary of techniques for Qwen 2 pre-training. "Continued pre-training" refers to the 2-stage pre-training, where the researchers started with regular pre-training and followed up with a long-context continued pre-training. (Unfortunately, another theme of the technical reports is that details about the dataset are scarce, so if my write-up does not appear very detailed, it's due to the lack of publicly available information.) The Qwen 2 team employed the popular two-phase post-training methodology, starting with supervised instruction fine-tuning (SFT), which was applied across 500,000 examples for 2 epochs. This phase aimed to refine the model’s response accuracy in predetermined scenarios. A typical LLM development flow. After SFT, they used direct preference optimization (DPO) to align the LLM with human preferences. (Interestingly referred to in their terminology as reinforcement learning from human feedback, RLHF.) As I discussed in my Tips for LLM Pretraining and Evaluating Reward Models article a few weeks ago, the SFT+DPO approach seems to be the most popular preference tuning strategy at the moment due to the ease of use compared to other methods, such as RLHF with PPO. (If you want to learn how DPO works, I recently implemented it from scratch here .) The alignment phase itself was also done in 2 stages. First using DPO on an existing dataset (offline stage). Second, using a reward model to form the preference pair (online). Here, the model generates multiple responses during training, and a reward model selects the preferred response for the optimization step in "real-time" (that is, during training). This is also often referred to as "rejection sampling." For the construction of the dataset, they used existing corpora complemented by human labeling to determine target responses for SFT and identify preferred and rejected responses essential for DPO. The researchers also synthesized artificially annotated data. Moreover, the team used LLMs to generate instruction-response pairs specifically tailored for "high-quality literary data," to create high-quality Q&A pairs for training. Summary of techniques for Qwen 2 post-training. Qwen 2 is a relatively capable model, and similar to earlier generations of Qwen. When attending the NeurIPS LLM efficiency challenge in December 2023, I remember that most of the winning approaches involved a Qwen model. Regarding the training pipeline of Qwen 2, what stands out is that synthetic data has been used for both pre-training and post-training. Also, the focus on dataset filtering (rather than collecting as much data as possible) is one of the notable trends in LLM training. Here, I would say, more is better, but only if it meets certain quality standards. Direct Preference Optimization (DPO) has become one of the go-to methods to align LLMs more closely with user preferences, and it's something you will read a lot in this article. If you want to learn how it works, I coded it from scratch here: Direct Preference Optimization (DPO) for LLM Alignment (From Scratch) . An overview of DPO for LLM alignment Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. I was really delighted to see another technical paper by Apple on arXiv.org that outlines their model training. An unexpected but definitely positive surprise! In the Apple Intelligence Foundation Language Models paper, available at, the research team outlines the development of two primary models designed for use in the "Apple Intelligence" context on Apple devices. For brevity, these models will be abbreviated as AFM for "Apple Foundation Models" throughout this section. Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable server model of unspecified size. These models are developed for chat, math, and coding tasks, although the paper does not discuss any of the coding-specific training and capabilities. Like the Qwen 2, the AFMs are dense LLMs and do not utilize a mixture-of-experts approach. I'd like to extend two big kudos to the researchers. First, besides using publicly available data and data licensed by publishers, they respected the robots.txt files on websites and refrained from crawling these. Second, they also mentioned that they performed decontamination with benchmark data. To reinforce one of the takeaways of the Qwen 2 paper, the researchers mentioned that quality was much more important than quantity. (With a vocabulary size of 49k tokens for the device model and 100k tokens for the server model, the vocabulary sizes were noticeably smaller than those of the Qwen 2 models, which used 150k token vocabulary.) Interestingly, the pre-training was not done in 2 but 3 stages! Core (regular) pre-training Continued pre-training where web-crawl (lower-quality) data was down-weighted; math and code was up-weighted Context-lengthening with longer sequence data and synthetic data Overview of the 3-step pre-training process that the AFM models underwent. Let's take a look at these 3 steps in a bit more detail. Core pre-training describes the first pre-training stage in Apple's pre-training pipeline. This is akin to regular pre-training, where the AFM-server model was trained on 6.3 trillion tokens, a batch size of 4096 batch size and a 4096-token sequence length. This is very similar to Qwen 2 models, which were trained in 7 trillion tokens. However, it gets more interesting for the AFM-on-device model, which is distilled and pruned from a larger 6.4-billion-parameter model (trained from scratch like the AFM-server model described in the previous paragraph. There's not much detail on the distillation process besides "a distillation loss is used by replacing the target labels with a convex combination of the true labels and the teacher model's top-1 predictions (with 0.9 weight assigned to the teacher labels)." I feel that knowledge distillation is becoming increasingly prevalent and useful for LLM pre-training (Gemma-2 uses it, too). I plan to cover it in more detail one day. For now, here's a brief overview of how this process would work on a high level. An overview of knowledge distillation, where a small model (here, the AFM-device 3B model) is trained on the original training tokens plus the outputs from a larger teacher model (here, a 6.4B model). Note that the cross entropy loss in a) is the regular training loss used for pre-training LLMs (see chapter 5 in my " Build a Large Language Model from Scratch " book for more details on how the regular pre-training step is implemented). Knowledge distillation, as illustrated above, still involves training on the original dataset. However, in addition to the training tokens in the dataset, the model to be trained (referred to as the student) receives information from the larger (teacher) model, which provides a richer signal compared to training without knowledge distillation. The downside is that you must: 1) train the larger teacher model first, and 2) compute predictions on all training tokens using the larger teacher model. These predictions can be computed ahead of time (which requires substantial storage space) or during training (which may slow down the training process). The continued pre-training stage includes a small context lengthening step from 4,096 to 8,192 tokens on a dataset consisting of 1 trillion tokens (the core pre-training set was five times larger). The primary focus, however, is on training with a high-quality data mix, with an emphasis on math and code. Interestingly, the researchers found that the distillation loss was not beneficial in this context. The third pre-training stage involves only 100 billion tokens (10% of the tokens used in the second stage) but represents a more significant context lengthening to 32,768 tokens. To achieve this, the researchers augmented the dataset with synthetic long-context Q&A data. Summary of techniques for AFM pre-training. 2.3 AFM Post-training Apple appears to have taken a similarly comprehensive approach to their post-training process as they did with pre-training. They leveraged both human-annotated and synthetic data, emphasizing that data quality was prioritized over quantity. Interestingly, they did not rely on predetermined data ratios; instead, they fine-tuned the data mixture through multiple experiments to achieve the optimal balance. The post-training phase involved a two-step process: supervised instruction fine-tuning followed by several rounds of reinforcement learning with human feedback (RLHF). A particularly noteworthy aspect of this process is Apple’s introduction of two new algorithms for the RLHF stage: Rejection Sampling Fine-tuning with Teacher Committee (iTeC) RLHF with Mirror Descent Policy Optimization Given the length of this article, I won’t go into the technical details of these methods, but here’s a brief overview: The iTeC algorithm combines rejection sampling with multiple preference tuning techniques—specifically, SFT, DPO, IPO, and online RL. Rather than relying on a single algorithm, Apple trained models using each approach independently. These models then generated responses, which were evaluated by humans who provided preference labels. This preference data was used to iteratively train a reward model in an RLHF framework. During the rejection sampling phase, a committee of models generated multiple responses, with the reward model selecting the best one. This committee-based approach is quite complex but should be relatively feasible, particularly given the relatively small size of the models involved (around 3 billion parameters). Implementing such a committee with much larger models, like the 70B or 405B parameter models in Llama 3.1, would definitely be more challenging. As for the second algorithm, RLHF with Mirror Descent, it was chosen because it proved more effective than the commonly used PPO (Proximal Policy Optimization). Summary of techniques for AFM post-training. Apple's approach to pre-training and post-training is relatively comprehensive, likely because the stakes are very high (the model is deployed on millions, if not billions, of devices). However, given the small nature of these models, a vast array of techniques also becomes feasible, since a 3B model is less than half the size of the smallest Llama 3.1 model. One of the highlights is that it’s not a simple choice between RLHF and DPO; instead, they used multiple preference-tuning algorithms in the form of a committee. It’s also interesting that they explicitly used Q&A data as part of the pre-training—something I discussed in my previous article, Instruction Pretraining LLMs . All in all, it's a refreshing and delightful technical report. Google's Gemma models were recently described in Gemma 2: Improving Open Language Models at a Practical Size . I'll provide an overview of some of key facts in the following overview section before discussing the pre-training and post-training processes. The Gemma 2 models are available in three sizes: 2 billion, 9 billion, and 27 billion parameters. The primary focus is on exploring techniques that do not necessarily require increasing the size of training datasets but rather on developing relatively small and efficient LLMs. Notably, Gemma 2 features a substantial vocabulary size of 256k tokens. For comparison, Llama 2 uses a 32k token vocabulary, and Llama 3 has a 128k token vocabulary. Additionally, Gemma 2 employs sliding window attention, similar to Mistral's early models, likely to reduce memory costs. For more details on the Gemma 2 architecture, please refer to the Gemma 2 section in my previous article . The Gemma researchers argue that even small models are often undertrained. However, rather than simply increasing the size of the training dataset, they focus on maintaining quality and achieve improvements through alternative methods, such as knowledge distillation, similar to Apple's approach. While the 27B Gemma 2 model was trained from scratch, the smaller models were trained using knowledge distillation similar to Apple's approach explained previously. The 27B model was trained on 13 trillion tokens, the 9B model on 8 trillion tokens, and the 2B model on 2 trillion tokens. Additionally, similar to Apple's approach, the Gemma team optimized the data mixture to improve performance. Summary of techniques for Gemma 2 pre-training. 3.3 Gemma 2 Post-training The post-training process for the Gemma models involved the typical supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) steps. The instruction data involved using English-only prompt pairs, which were a mix of human-generated and synthetic-generated content. Specifically, and interestingly, the responses were primarily generated by teacher models, and knowledge distillation was also applied during the SFT phase. An interesting aspect of their RLHF approach, following SFT, is that the reward model used for RLHF is ten times larger than the policy (target) model. The RLHF algorithm employed by Gemma is fairly standard, but with a unique twist: they average the policy models through a method called WARP, a successor to WARM (weight-averaged reward models). I previously discussed this method in detail in my article "Model Merging, Mixtures of Experts, and Towards Smaller LLMs" . Summary of techniques for Gemma 2 post-training. 3.4 Conclusion The Gemma team seems to really double down on knowledge distillation, which they use during both pre-training and post-training similar to Apple. Interestingly, they didn't use a multi-stage pre-training approach though, or at least, they didn't detail it in their paper. I am excited to be invited to give a keynote talk at the upcoming PyTorch conference . It will be my first PyTorch conference, and I look forward to meeting the community and chatting about the latest AI and LLM developments! 4. Meta AI's Llama 3.1 New releases of Meta's Llama LLMs are always a big thing. This time, the release was accompanied by a 92-page technical report: The Llama 3 Herd of Models . Last but not least, in this section, we will look at the fourth big model paper released last month. Along with releasing a huge 405 billion parameter model, Meta updated their previous 8 billion and 70 billion parameter models, giving them a slight MMLU performance boost. MMLU benchmark performance of different models. While Llama 3 uses group query attention like other recent LLMs, surprisingly, Meta AI said no to sliding window attention and Mixture-of-Experts approaches. In other words, the Llama 3.1 looks very traditional, and the focus was clearly on the pre-training and post-training rather than architecture innovations. Similar to previous Llama releases, the weights are openly available. Moreover, Meta said that they updated the Llama 3 license so that it's now finally possible (allowed) to use Llama 3 for synthetic data generation or knowledge distillation to improve other models. Llama 3 was trained on a massive 15.6 trillion tokens dataset, which is a substantial increase from Llama 2's 1.8 trillion tokens. The researchers say that it supports at least eight languages, (whereas Qwen 2 is capable of handling 20). An interesting aspect of Llama 3 is its vocabulary size of 128,000, which was developed using OpenAI's tiktoken tokenizer. (For those interested in tokenizer performance, I did a simple benchmark comparison here .) In terms of pre-training data quality control, Llama 3 employs heuristic-based filtering alongside model-based quality filtering, utilizing fast classifiers like Meta AI's fastText and RoBERTa-based classifiers. These classifiers also help in determining the context categories for the data mix used during training. The pre-training for Llama 3 is divided into three stages. The first stage involves standard initial pre-training using the 15.6 trillion tokens with an 8k context window. The second stage continues with the pre-training but extends the context length to 128k. The final stage involves annealing, which further enhances the model's performance. Let's look into these stages in more detail below. In their training setup, they began with batches consisting of 4 million tokens, each with a sequence length of 4096. This implies a batch size of approximately 1024 tokens, assuming that the 4 million figure is rounded to the nearest digit. After processing the first 252 million tokens, they doubled the sequence length to 8192. Further into the training process, after 2.87 trillion tokens, they doubled the batch size again. Additionally, the researchers did not keep the data mix constant throughout the training. Instead, they adjusted the mix of data being used during the training process to optimize model learning and performance. This dynamic approach to data handling likely helped in improving the model's ability to generalize across different types of data. Compared to other models that increased their context window in a single step, the Llama 3.1 context lengthening was a more gradual approach: Here, the researchers increased the context length through six distinct stages from 8,000 to 128,000 tokens. This stepwise increment likelely allowed the model to adapt more smoothly to larger contexts. The training set utilized for this process was involved 800 billion tokens, about 5% of the total dataset size. For the third pre-training stage, the researchers trained the model on a small but high-quality mix, which they found helps improve the performance on benchmark datasets. For example, annealing on the GSM8K and MATH training sets provided a significant boost on the respective GSM8K and MATH validation sets. In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size); this 40B annealing dataset was used to assess data quality. In section 3.4.3, they state that the actual annealing was done only on 40 million tokens (0.1% of the annealing data). Summary of techniques for Llama 3.1 pre-training. For their post-training process, the Meta AI team employed a relatively straightforward method that included supervised fine-tuning (SFT), rejection sampling, and direct preference optimization (DPO). They observed that reinforcement learning algorithms like RLHF with PPO were less stable and more challenging to scale compared to these techniques. It's worth noting that the SFT and DPO steps were iteratively repeated over multiple rounds, incorporating both human-generated and synthetic data. Before describing the further details, their workflow is illustrated in the figure below. Annotated figure from the Llama 3.1 paper describing the post-training procedure Note that even though they used DPO, they also developed a reward model as you'd do in RLHF. Initially, they trained the reward model using a checkpoint from the pre-training phase, utilizing human-annotated data. This reward model was then used for the rejection sampling process, helping to select appropriate prompts for further training. In each training round, they applied model averaging techniques not only to the reward model but also to the SFT and DPO models. This averaging involved merging the parameters from recent and previous models to stabilize (and improve) performance over time. For those interested in the technical specifics of model averaging, I discussed this topic in the section "Understanding Model Merging and Weight Averaging" of my earlier article Model Merging, Mixtures of Experts, and Towards Smaller LLMs . To sum it up, at the core, it's a relatively standard SFT + DPO stage. However, this stage is repeated over multiple rounds. Then, they sprinkled in a reward model for rejection sampling (like Qwen 2 and AFM). They also used model averaging like Gemma; however, it's not just for the reward models but all models involved. Summary of techniques for Llama 3.1 post-training. 4.4 Conclusion The Llama 3 models remain fairly standard and similar to the earlier Llama 2 models but with some interesting approaches. Notably, the large 15 trillion token training set distinguishes Llama 3 from other models. Interestingly, like Apple's AFM model, Llama 3 also implemented a 3-stage pre-training process. In contrast to other recent large language models, Llama 3 did not employ knowledge distillation techniques, opting instead for a more straightforward model development path. For post-training, the model utilized Direct Preference Optimization (DPO) instead of the more complex reinforcement learning strategies that have been popular in other models. Overall, this choice is interesting as it indicates a focus on refining LLM performance through simpler (but proven) methods. What can we learn from these four models discussed in this article: Alibaba's Qwen 2, Apple's foundational models (AFM), Google's Gemma 2, and Meta's Llama 3? All four models take somewhat different approaches to pre-training and post-training. Of course, methodologies overlap, but no training pipeline is quite the same. For pre-training, a shared feature seems to be that all methods use a multi-stage pre-training pipeline, where a general core pre-training is followed by a context lengthening and sometimes high-quality annealing step. The figure below shows again the different methods employed in pre-training at a glance. Overview of the techniques used for pre-training When it comes to post-training, also none of the pipelines was exactly the same. It seems that rejection sampling is now a common staple in the post-training process. However, when it comes to DPO or RLHF, there's no consensus or preference (no pun intended) yet. Overview of the techniques used for post-training So, in all, there is no single recipe but many paths to developing highly-performant LLMs. Lastly, the four models perform in the same ballpark. Unfortunately, several of these models have not made it into the LMSYS and AlpacaEval leaderboards, so we have no direct comparison yet, except for the scores on multiple-choice benchmarks like MMLU and others. This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Alternatively, I also recently enabled the paid subscription option on Substack to support this magazine directly. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. An overview of the LLM development and training pipeline, with a focus on new pre-training and post-training methodologies discussed in this article There are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports. In this article, I focus on the pre-training and post-training pipelines of the following models: Alibaba's Qwen 2 Apple Intelligence Foundation Language Models Google's Gemma 2 Meta AI's Llama 3.1 Build a Large Language Model (from Scratch) is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs. Machine Learning Q and AI is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more. Machine Learning with PyTorch and Scikit-Learn is a comprehensive guide to machine learning, deep learning, and AI, offering a well-balanced mix of theory and practical code. It's the ideal starting point for anyone new to the field. MMLU benchmark scores for the latest open-weight models (higher values are better). I collected the scores for this plot from the official research papers of each model. (If you are new to MMLU, I briefly discussed it in my recent talk at minute 46:05 .) 1.2 Qwen 2 Pre-training The Qwen 2 team trained the 1.5 billion, 7 billion, and 72 billion parameter models on 7 trillion training tokens, which is a reasonable size. For comparison, Llama 2 models were trained on 2 trillion tokens, and Llama 3.1 models were trained on 15 trillion tokens. Interestingly, the 0.5 billion parameter model was trained on 12 trillion tokens. However, the researchers did not train the other models on the larger 12 trillion token dataset because they did not observe any improvements during training, and the additional computational costs were not justified. One of the focus areas has been improving the data filtering pipeline to remove low-quality data and enhancing data mixing to increase data diversity— a theme we will revisit when examining other models later. Interestingly, they also used Qwen models (although they didn't specify details, I assume they mean previous generation Qwen models) to synthesize additional pre-training data. And the pre-training involved “multi-task instruction data… to enhance in-context learning and instruction-following abilities.” Furthermore, they performed training in two stages: regular pre-training followed by long-context training. The latter increased the context length from 4,096 to 32,768 tokens at the end phase of pre-training using "high-quality, lengthy data." Summary of techniques for Qwen 2 pre-training. "Continued pre-training" refers to the 2-stage pre-training, where the researchers started with regular pre-training and followed up with a long-context continued pre-training. (Unfortunately, another theme of the technical reports is that details about the dataset are scarce, so if my write-up does not appear very detailed, it's due to the lack of publicly available information.) 1.3 Qwen 2 Post-training The Qwen 2 team employed the popular two-phase post-training methodology, starting with supervised instruction fine-tuning (SFT), which was applied across 500,000 examples for 2 epochs. This phase aimed to refine the model’s response accuracy in predetermined scenarios. A typical LLM development flow. After SFT, they used direct preference optimization (DPO) to align the LLM with human preferences. (Interestingly referred to in their terminology as reinforcement learning from human feedback, RLHF.) As I discussed in my Tips for LLM Pretraining and Evaluating Reward Models article a few weeks ago, the SFT+DPO approach seems to be the most popular preference tuning strategy at the moment due to the ease of use compared to other methods, such as RLHF with PPO. (If you want to learn how DPO works, I recently implemented it from scratch here .) The alignment phase itself was also done in 2 stages. First using DPO on an existing dataset (offline stage). Second, using a reward model to form the preference pair (online). Here, the model generates multiple responses during training, and a reward model selects the preferred response for the optimization step in "real-time" (that is, during training). This is also often referred to as "rejection sampling." For the construction of the dataset, they used existing corpora complemented by human labeling to determine target responses for SFT and identify preferred and rejected responses essential for DPO. The researchers also synthesized artificially annotated data. Moreover, the team used LLMs to generate instruction-response pairs specifically tailored for "high-quality literary data," to create high-quality Q&A pairs for training. Summary of techniques for Qwen 2 post-training. 1.4 Conclusion Qwen 2 is a relatively capable model, and similar to earlier generations of Qwen. When attending the NeurIPS LLM efficiency challenge in December 2023, I remember that most of the winning approaches involved a Qwen model. Regarding the training pipeline of Qwen 2, what stands out is that synthetic data has been used for both pre-training and post-training. Also, the focus on dataset filtering (rather than collecting as much data as possible) is one of the notable trends in LLM training. Here, I would say, more is better, but only if it meets certain quality standards. Aligning LLMs with Direct Preference Optimization from Scratch Direct Preference Optimization (DPO) has become one of the go-to methods to align LLMs more closely with user preferences, and it's something you will read a lot in this article. If you want to learn how it works, I coded it from scratch here: Direct Preference Optimization (DPO) for LLM Alignment (From Scratch) . An overview of DPO for LLM alignment Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 2. Apple's Apple Intelligence Foundation Language Models (AFM) I was really delighted to see another technical paper by Apple on arXiv.org that outlines their model training. An unexpected but definitely positive surprise! 2.1 AFM Overview In the Apple Intelligence Foundation Language Models paper, available at, the research team outlines the development of two primary models designed for use in the "Apple Intelligence" context on Apple devices. For brevity, these models will be abbreviated as AFM for "Apple Foundation Models" throughout this section. Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable server model of unspecified size. These models are developed for chat, math, and coding tasks, although the paper does not discuss any of the coding-specific training and capabilities. Like the Qwen 2, the AFMs are dense LLMs and do not utilize a mixture-of-experts approach. 2.2 AFM Pre-training I'd like to extend two big kudos to the researchers. First, besides using publicly available data and data licensed by publishers, they respected the robots.txt files on websites and refrained from crawling these. Second, they also mentioned that they performed decontamination with benchmark data. To reinforce one of the takeaways of the Qwen 2 paper, the researchers mentioned that quality was much more important than quantity. (With a vocabulary size of 49k tokens for the device model and 100k tokens for the server model, the vocabulary sizes were noticeably smaller than those of the Qwen 2 models, which used 150k token vocabulary.) Interestingly, the pre-training was not done in 2 but 3 stages! Core (regular) pre-training Continued pre-training where web-crawl (lower-quality) data was down-weighted; math and code was up-weighted Context-lengthening with longer sequence data and synthetic data An overview of knowledge distillation, where a small model (here, the AFM-device 3B model) is trained on the original training tokens plus the outputs from a larger teacher model (here, a 6.4B model). Note that the cross entropy loss in a) is the regular training loss used for pre-training LLMs (see chapter 5 in my " Build a Large Language Model from Scratch " book for more details on how the regular pre-training step is implemented). Knowledge distillation, as illustrated above, still involves training on the original dataset. However, in addition to the training tokens in the dataset, the model to be trained (referred to as the student) receives information from the larger (teacher) model, which provides a richer signal compared to training without knowledge distillation. The downside is that you must: 1) train the larger teacher model first, and 2) compute predictions on all training tokens using the larger teacher model. These predictions can be computed ahead of time (which requires substantial storage space) or during training (which may slow down the training process). 2.2.2 Pre-training II: Continued Pre-training The continued pre-training stage includes a small context lengthening step from 4,096 to 8,192 tokens on a dataset consisting of 1 trillion tokens (the core pre-training set was five times larger). The primary focus, however, is on training with a high-quality data mix, with an emphasis on math and code. Interestingly, the researchers found that the distillation loss was not beneficial in this context. 2.2.3 Pre-training III: Context Lengthening The third pre-training stage involves only 100 billion tokens (10% of the tokens used in the second stage) but represents a more significant context lengthening to 32,768 tokens. To achieve this, the researchers augmented the dataset with synthetic long-context Q&A data. Summary of techniques for AFM pre-training. 2.3 AFM Post-training Apple appears to have taken a similarly comprehensive approach to their post-training process as they did with pre-training. They leveraged both human-annotated and synthetic data, emphasizing that data quality was prioritized over quantity. Interestingly, they did not rely on predetermined data ratios; instead, they fine-tuned the data mixture through multiple experiments to achieve the optimal balance. The post-training phase involved a two-step process: supervised instruction fine-tuning followed by several rounds of reinforcement learning with human feedback (RLHF). A particularly noteworthy aspect of this process is Apple’s introduction of two new algorithms for the RLHF stage: Rejection Sampling Fine-tuning with Teacher Committee (iTeC) RLHF with Mirror Descent Policy Optimization

Machine Learning

0 views

Ahead of AI 1 years ago

Instruction Pretraining LLMs

A lot has happened last month: Apple announced the integration of on-device LLMs, Nvidia shared their large Nemotron model, FlashAttention-3 was announced, Google's Gemma 2 came out, and much more. You've probably already read about it all in various news outlets. So, in this article, I want to focus on recent research centered on instruction finetuning, a fundamental technique for training LLMs. What I am going to cover in this article: A new, cost-effective method for generating data for instruction finetuning Instruction finetuning from scratch Pretraining LLMs with instruction data An overview of what's new in Gemma 2 An overview of all the other interesting research papers that came out in June Happy reading! The Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing paper shares a fascinating hack to generate a high-quality dataset for LLM instruction finetuning. While this doesn't offer any particularly recent research insights, it's one of those interesting, practical exploits that seems super useful. What distinguishes this instruction-data-generating method from others is that it can be fully automated and doesn't require any initial questions or instructions. As the paper title suggests, it enables the creation of an instruction dataset from "Nothing" – the only thing we need is a locally running Llama 3 8B model. The figure below summarizes how this method works. Annotated illustration of the Magpie method for generating a synthetic dataset for instruction finetuning. The figure is based on illustrations from the Magpie paper: https://arxiv.org/abs/2406.08464 Essentially, as shown in the figure above, we just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for us. Then, we feed that instruction back to the LLM, and it will generate a response. If we repeat this procedure a couple of thousand times, we obtain a dataset for instruction finetuning. (Optionally, we can apply an LLM to filter the instruction-response pairs by quality.) What's fascinating is that with the resulting instruction dataset, the authors found that finetuning a Llama 3 8B base model with just instruction finetuning (no preference finetuning via RLHF and DPO) beats the original Llama 2 8B Instruct model by Meta AI, as shown in the figure below. A Llama 3 8B base model finetuned on the Magpie-generated instruction dataset beats the original Llama 3 8B Instruct model. Based on an annotated illustration from the Magpie paper: https://arxiv.org/abs/2406.08464 The Magpie results shown in the figure above wer achieved with 300 thousand samples only. In comparison, The original Llama 3 Instruct model was finetuned and aligned on 100 million samples! I was skeptical at first, so I tried to implement this myself. It really works! Here , you can find my reimplementation using Ollama, which even runs fine locally on a MacBook Air. Code screenshot from a reimplementation of the Magpie method that runs locally. The code is available here . The authors created two sets of datasets: A "Pro" version using the Llama 3 70B Instruct model and an "Air" version using the Llama 3 8B Instruct model. As an earlier figure showed, the Magpie-Pro-generated dataset results in slightly stronger models compared to the Magpie-Air dataset when using it to instruction-finetune a Llama 3 8B base model. The figure below shows an additional comparison of the dataset qualities and difficulties as rated via an LLM. Annotated plots from the Magpie paper showing the dataset quality and difficulty of the Air and Pro datasets relative to each other. As the figure above shows, the quality of the Air and Pro datasets is roughly on par. In addition, it would have been interesting to see how the Alpaca dataset compares to these. (The assumption is that the Magpie data is of much higher quality than Alpaca, but a reference point would be interesting.) Furthermore, the paper contains an analysis showing that the breadth or diversity in this dataset is much larger than that of other popular datasets for instruction finetuning, such as Alpaca, Evol Instruct, and UltraChat. In addition, when compared to models trained with other instruction finetuning datasets, the Magpie-Pro finetuned model also compares very favorably. Overall, I think that Magpie is an interesting exploit that is, on the one hand, fascinating in its effectiveness and, on the other hand, has a lot of practical utility. I will certainly consider it as an interesting, simple, and cost-effective candidate for constructing general-purpose instruction datasets in the future. If you are looking for a resource to understand the instruction finetuning process in LLMs, I am happy to share that Chapter 7 on instruction finetuning LLMs is now finally live on the Manning website . This is the longest chapter in the book and takes a from-scratch approach to implementing the instruction finetuning pipeline. This includes everything from input formatting to batching with a custom collate function, masking padding tokens, the training loop itself, and scoring the response quality of the finetuned LLM on a custom test set. (The exercises include changing prompt styles, instruction masking, and adding LoRA.) Happy coding! An overview of chapter 7 in my "Build a Large Language Model From Scratch" book. Supplementary code materials are available here on GitHub . PS: it's also the last chapter, and the publisher is currently preparing the layouts for the print version. In the paper "Instruction Pre-Training: Language Models are Supervised Multitask Learners" ( https://arxiv.org/abs/2406.14491 ), researchers investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. (Here, "raw text" means text from books, websites, papers, and so forth that has not been reprocessed into a specific format.) A comparison between regular pretraining (top) and the proposed instruction pretraining approach (bottom) via an annotated figure from https://arxiv.org/abs/2406.14491 Specifically, the researchers experiment with generating instruction-response data from the raw training corpus itself via an "instruction synthesizer," an LLM specifically finetuned for this task. (Note that this is not the first paper proposing the formatting of raw text as instruction data. Another work that comes to mind is "Genie: Achieving Human Parity in Content-Grounded Datasets Generation" ( https://arxiv.org/abs/2401.14367 ). I also recall seeing another paper or blog post using instruction data during pretraining a few months ago—I discussed this method with some of my colleagues—but unfortunately, I couldn't find the reference. Nonetheless, the paper discussed here is particularly intriguing since it builds on openly available LLMs that run locally and covers both pretraining and continual pretraining.) Before we dive into the pretraining and continual pretraining results, let's talk about the core component of this method: the instruction synthesizer. This is an openly available Mistral 7B v0.1 LLM (which I wrote about last year here: https://magazine.sebastianraschka.com/i/138555764/mistral-b ) that has been finetuned to generate instruction-response pairs from raw text. To finetune this synthesizer, the researchers use datasets such as HotpotQA ( https://arxiv.org/abs/1809.09600 ), which consists of passages from Wikipedia associated with questions and answers. For this, the authors also ensure that a variety of tasks, like commonsense reasoning, sentiment analysis, math problems, etc., are covered. The input and output data of the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Once this instruction synthesizer is developed (i.e., finetuned), it can be used to generate the input data for pretraining the target LLMs. One last noteworthy detail regarding the instruction synthesizer is that multiple raw texts (T n ) and instruction-response pairs (I n ⊕ R n ) are concatenated as few-shot examples, as shown in the figure below. The formatting of the instruction-data for finetuning (and using) the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Now that we have discussed the method to generate the instruction-response pairs, let's get to the interesting part: how well do models train on this augmented dataset. The first set of results looks at two small models trained from scratch: 500M parameters and 1.3B parameters (both are based on the Mistral architecture). A comparison of 3 different pretraining approaches used to train models from scratch (annotated table from https://arxiv.org/abs/2406.14491) As we can see in the table above, the model trained via the proposed instruction pretraining approach ( Instruct PT ) performs best on most benchmark tasks (higher values are better). Note, though, that it has seen more tokens than the Vanilla PT approach since it included the synthesized instruction-response pairs. Hence, the authors included the Mix PT comparison, which is a model that has been trained on a data mix containing both the raw text and the instruction data used to train the synthesizer. From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.) In addition, it's worth noting that the Instruct PT pretrained models have another advantage: They improve a more when they are instruction-finetuned afterwards, as the figure below shows. Finetuning LLMs that have been pretrained with either the traditional pretraining pardigm (Vanilla PT) or instruction pretraining (annotated figure from https://arxiv.org/abs/2406.14491) Pretraining from scratch is interesting because that's how LLMs are created in the first place. However, I'd say that practitioners care more about continual pretraining and finetuning. Continual pretraining here means that we take an existing pretrained model and pretrain it further on new domain data. For instance, think of a Llama 3 8B base model that has been trained on a general text corpus and that you want to adapt for finance, medical, legal, or other domains. The table below summarizes the results the researchers obtained when applying the instruction pretraining method to a pretrained Llama 3 8B base model. Specifically, they conducted continual pretraining with both biomedical texts and finance texts. A comparison of 3 different pretraining approaches used for continual pretraining (annotated table from https://arxiv.org/abs/2406.14491) Looking at the table above, we can see that the instruction pretraining approach ( Instruct PT ) clearly outperforms the vanilla pretraining ( Vanilla PT ) approach (here, this means regular continual pretraining of the base model). The Llama 3 70B base model is included as a reference; I suppose to showcase that small specialized models can beat larger general models. Almost every time I explain the LLM pretraining pipeline to someone, they are surprised by its simplicity and the fact that this is still what's commonly used to train LLMs today. The instruction pretraining approach is quite refreshing in that sense. One caveat is that for large pretraining corpora, it might still be expensive to create the instruction-augmented corpora. However, the nice thing about generated data is that it can be reused in many different projects once created. I cannot write this article without mentioning Google's new Gemma 2 models , which are arguably the biggest model release last month. However, when it comes to pure size, Nvidia's Nemotron-4 340B takes the crown (https://arxiv.org/abs/2406.11704). The Gemma 2 models come in 2.6B, 9B, and 27B parameter versions. Since this article is already quite lengthy, and you're likely familiar with Gemma 2 from other sources, let's cut to the chase. What are the main highlights and noteworthy updates in Google's newly released Gemma 2 LLMs? The main theme is exploring techniques without necessarily increasing the size of training datasets but rather focusing on developing relatively small and efficient LLMs. Specifically, they blend three main architectural and training choices to create the 2.6B and 9B parameter models: sliding window attention, grouped-query attention, and knowledge distillation. Sliding window attention (e.g., as popularized by Mistral) is a technique using a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below. Annotated figure from https://arxiv.org/abs/2310.06825 explaining sliding window attention. In the case of Gemma 2, the authors alternated between regular attention and sliding window attention layers. The sliding attention block size was 4096 tokens, spanning a total block size of 8192 tokens. Sliding window attention is mainly used to improve computational performance, and the researchers also included a small ablation study showing that there's a barely noticeable difference in perplexity when shrinking the block size during inference. An ablation study from the Gemma 2 technical report showing that a decreased block size for the sliding window barely impacts the modeling performance of the 9B parameter model during inference. (It would have been interesting to see the GPU memory improvement side-by-side.) Group-query attention (like in Llama 2 and 3) can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements. Annotated figure from Ainslie et al. 2023 The general idea of Knowledge distillation (as in MiniLLM, https://arxiv.org/abs/2306.08543) is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a "teacher" for the smaller models. An overview of knowledge distillation from my Machine Learning Q and AI book in the context of computer vision. In an LLM-context, think of text instead of images and predicted tokens instead of class labels. The paper contains many other interesting tidbits. For instance, one hallmark of Gemma 2 is its relatively large vocabulary size: 256,000 tokens. This is similar to the first Gemma model, but it's still worth noting since it's twice the size of the Llama 3 vocabulary (128,000) and eight times the size of the Phi-3 vocabulary (32,000). The vocabulary size of an LLM refers to the number of unique tokens (words, subwords, or characters) that the model can recognize and generate. A large vocabulary size in LLMs allows for better coverage of words and concepts, improved handling of multilingual content, and reduced tokenization artifacts. However, a large vocabulary size also comes with trade-offs, such as increased model size and potentially slower inference due to the larger embedding and output layers. (That's where the sliding window attention and multi-query attention mechanism are important to offset this.) There's also an interesting section on "logit capping," a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training. logits ← soft_cap ∗ tanh(logits/soft_cap). Additionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although the paper doesn't provide much detail about that. (However, interested readers can read more about this in WARP: On the Benefits of Weight Averaged Rewarded Policies , which Gemma 2 uses for this.) In terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model. A comparison between two other popular models with openly available weights: Llama 3 and Qwen 1.5. (Annotated table from the Gemma 2 technical report ). Personally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research. An example of one of the ablation studies included in the Gemma 2 technical report . Here "wide" refers to a model with 28 layers and an intermediate size of 24,576, and "deep" refers to a architecture with 42 layers with an intermediate size of 14,336. It's refreshing to see such a relatively detailed technical report from Google. When it comes to the model itself, based on public consensus, Gemma 2 is likely the most capable model for single-GPU use cases today. For larger models, Llama 3 70B and Qwen 2 72B remain strong contenders. Ahead of AI is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. If you have a few moments, a review of Machine Learning Q and AI or Machine Learning with PyTorch and Scikit-Learn on Amazon would really help, too! Your support means a great deal and is tremendously helpful in continuing this journey. Thank you! Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 20 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects. Scaling Synthetic Data Creation with 1,000,000,000 Personas by Chan, Wang, Yu, et al. (28 June), https://arxiv.org/abs/2406.20094 The research proposes a persona-driven data synthesis methodology that utilizes an LLM to create diverse synthetic data by leveraging a vast collection of automatically curated personas, called Persona Hub, which represents about 13% of the world's population. LLM Critics Help Catch LLM Bugs by McAleese, Pokorny, Ceron Uribe, et al. (28 June), https://arxiv.org/abs/2407.00215 This study develops "critic" models using RLHF to assist humans in evaluating model-generated code, training LLMs to write natural language feedback on code errors, and demonstrating their effectiveness in catching bugs across various tasks. Direct Preference Knowledge Distillation for Large Language Models by Li, Gu, Dong, et al. (28 June), https://arxiv.org/abs/2406.19774 DPKD reformulates Knowledge Distillation for LLMs into a two-stage process: first optimizing an objective combining implicit reward and reverse KL divergence, then improving the preference probability of teacher outputs over student outputs. Changing Answer Order Can Decrease MMLU Accuracy by Gupta, Pantoja, Ross, et al. (27 June), https://arxiv.org/abs/2406.19470 This study investigates the robustness of accuracy measurements on the MMLU benchmark for LLMs, revealing that shuffling answer label contents leads to decreased accuracy across models, with varying sensitivity. From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Xiong, Papageorgiou, Lee, and Papailiopoulos (27 June), https://arxiv.org/abs/2406.19292 This study proposes a finetuning approach using a synthetic dataset of numerical key-value retrieval tasks to improve LLM's long-context information retrieval and reasoning capabilities. Dataset Size Recovery from LoRA Weights by Salama, Kahana, Horwitz, and Hoshen (27 June), https://arxiv.org/abs/2406.19395 This study introduces a method for recovering the number of images used to finetune a vision model using LoRA, by analyzing the norm and spectrum of LoRA matrices. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs by Azerbayev, Shao, Lin, et al. (26 June), https://arxiv.org/abs/2406.18629 This paper introduces Step-DPO, a method that optimizes individual reasoning steps in mathematical problem-solving for LLMs using a custom 10K step-wise preference pair dataset. RouteLLM: Learning to Route LLMs with Preference Data by Ong, Amjad, et al. (26 June), https://arxiv.org/abs/2406.18665 This study proposes efficient router models that dynamically select between stronger and weaker LLMs during inference to optimize cost-performance trade-offs. * A Closer Look into Mixture-of-Experts in Large Language Models by Zhang, Liu, Patel, et al. (26 June), https://arxiv.org/abs/2406.18219 This study looks at the inner workings of Mixture-of-Experts (MoE) LLMs to share insights about neuron behavior, expert selection criteria, and expert diversity across layers, while providing practical suggestions for MoE design and implementation based on these observations. * Following Length Constraints in Instructions by Yuan, Kulikov, Yu, et al. (25 June), https://arxiv.org/abs/2406.17744 This study introduces a method to train LLMs that can follow user-specified length constraints at inference time, addressing the length bias in model evaluation and outperforming standard instruction-following models in length-controlled tasks. LongIns: A Challenging Long-context Instruction-based Exam for LLMs by Shaham, Bai, An, et al. (25 June), https://arxiv.org/abs/2406.17588 LongIns is a new benchmark for evaluating LLM's long-context capabilities, using three settings to assess retrieval and reasoning abilities. * The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale by He, Wang, Shen, et al. (25 June), https://arxiv.org/abs/2406.17557 This report introduces FineWeb, a 15-trillion token dataset derived from Common Crawl, and FineWeb-Edu, a 1.3-trillion token educational subset. Adam-mini: Use Fewer Learning Rates To Gain More by Zhang, Chen, Li, et al . (24 June), https://arxiv.org/abs/2406.16793 Adam-mini is a proposed optimizer that achieves comparable or better performance than AdamW while using 45-50% less memory by strategically reducing learning rate resources, partitioning parameters based on Hessian structure, and assigning optimized single learning rates to parameter blocks. WARP: On the Benefits of Weight Averaged Rewarded Policies by Ramé, Ferret, Vieillard, et al. (24 June), https://arxiv.org/abs/2406.16768 The paper introduces a new alignment strategy for LLMs that merges policies at three stages: using exponential moving average for dynamic KL regularization, spherical interpolation of independently fine-tuned policies, and linear interpolation with initialization. Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers by Lou, Jia, Zheng, and Tu (24 June), https://arxiv.org/abs/2406.16747 The authors propose a new sparse attention mechanism for autoregressive Transformers, using a scoring network and differentiable top-k mask operator to select a constant number of KV pairs per query to achieve linear time complexity and a constant memory footprint. Efficient Continual Pre-training by Mitigating the Stability Gap by Wang, Hu, Xiong, et al. (21 June), https://arxiv.org/abs/2406.14833 This study proposes three strategies to improve continual pretraining of LLMs: multiple epochs on a subset, focusing on high-quality data, and using a mixture similar to pretraining data. MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression by Fu, Huang, Ning, et al. (21 June), https://arxiv.org/abs/2406.14909 Mixture of Attention (MoA) automatically optimizes sparse attention patterns for different model components and input lengths in LLMs, improving context length, accuracy, and efficiency over uniform sparse attention approaches. LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs by Jiang, Ma, Chen, et al. (21 June), https://arxiv.org/abs/2406.15319 LongRAG introduces a new RAG framework using 4K-token retrieval units and long-context LLMs for answer extraction, which improves retrieval performance and achieving state-of-the-art results on question-answering tasks without additional training. * A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems by Cuconasu, Trappolini, Tonellotto, et al. (21 June), https://arxiv.org/abs/2406.14972 This study challenges conventional wisdom by demonstrating that base LLMs outperform instruction-tuned models in Retrieval Augmented Generation (RAG) tasks. Can LLMs Learn by Teaching? A Preliminary Study by Ning, Wang, Li, Lin, et al. (20 June), https://arxiv.org/abs/2406.14629 The authors develop and test three methods for implementing "Learning by Teaching" in LLMs, mimicking human teaching processes at different levels: observing student feedback, learning from feedback, and iterative learning, to improve model performance without relying on additional human-produced data or stronger models. * Instruction Pre-Training: Language Models are Supervised Multitask Learners by Cheng, Gu, Huang, et al. (20 June), https://arxiv.org/abs/2406.14491 This study introduces a framework for supervised multitask pretraining of LLMs that augments raw corpora with synthetically generated instruction-response pairs. * Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? by Wu, Zhang, Johnson, et al. (19 June), https://arxiv.org/abs/2406.13121 This study introduces a benchmark for evaluating long-context LLMs on tasks requiring up to millions of tokens, demonstrating that these long-context LLMs can compete with specialized retrieval and RAG systems in in-context retrieval and reasoning tasks. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges by Ye, Turpin, Li, He, et al. (18 June), https://arxiv.org/abs/2406.12624 This paper evaluates the LLM-as-a-judge paradigm using TriviaQA as a benchmark, comparing 9 judge models and 9 exam taker models against human annotations, revealing that models with high human alignment may not necessarily be the best at ranking exam taker models. From RAGs to Rich Parameters: Probing How Language Models Utilize External Knowledge Over Parametric Information for Factual Queries by Wadhwa, Seetharaman, Aggarwal, et al. (18 June), https://arxiv.org/abs/2406.12824 The authors investigate the mechanics of Retrieval Augmented Generation (RAG) in LLMs to reveal that models predominantly rely on retrieved context information rather than their parametric memory when answering questions, exhibiting a shortcut behavior across different model families. Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts by Kang, Karlinsky, and Luo, et al. (17 June), https://arxiv.org/abs/2406.12034 This paper introduces a method that transforms a monolithic (LLM into a modular system called MiXSE (MiXture of Self-specialized Experts), using self-generated synthetic data to create specialized expert modules with shared base LLM and self-optimized routing. Measuring memorization in RLHF for code completion by Pappu, Porter, Shumailov, and Hayes (17 June), https://arxiv.org/abs/2406.11715 This study investigates the impact of Reinforcement Learning with Human Feedback (RLHF) on data memorization in LLMs, focusing on code completion tasks, and finds that while RLHF reduces memorization of data used in reward modeling and reinforcement learning compared to direct finetuning, it largely preserves memorization from the initial finetuning stage. HARE: HumAn pRiors, a key to small language model Efficiency by Zhang, Jin, Ge, et al. (17 June), https://arxiv.org/abs/2406.11410 This study proposes a principle for leveraging human priors in data construction for Small Language Models (SLMs) that focuses on semantic diversity and data quality consistency while avoiding benchmark data leakage. Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level by Kim, Lee, Park, et al. (17 June), https://arxiv.org/abs/2406.11817 This study introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a method that improves LLM alignment with human preferences while controlling response verbosity. Unveiling Encoder-Free Vision-Language Models by Choi, Yoon, Lee, et al . (17 June), https://arxiv.org/abs/2406.11832 This study presents an encoder-free vision-language model (VLM) that directly processes both visual and textual inputs in a unified decoder. * DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence by Zhu, Wang, Lee, et al. (17 June), https://arxiv.org/abs/2406.11931 DeepSeek-Coder-V2 is an open-source Mixture-of-Experts code LLM that achieves GPT4-Turbo-level performance on coding tasks through continued pretraining on 6 trillion additional tokens. Tokenization Falling Short: The Curse of Tokenization by Nguyen, Kim, Patel, et al. (17 June), https://arxiv.org/abs/2406.11687 This study investigates the "curse of tokenization" in LLMs by examining their performance on complex problem solving, token structure probing, and resilience to typographical variation, revealing that while scaling model size helps, LLMs remain susceptible to tokenization-induced biases DataComp-LM: In Search of the Next Generation of Training Sets for Language Models by Li, Fang, Smyrnis, et al. (17 June), https://arxiv.org/abs/2406.11794 The authors provide a standardized testbed for experimenting with dataset curation strategies in language model training, including a 240T token corpus, pretraining recipes, and 53 downstream evaluations, * Nemotron-4 340B Technical Report by Unknown Authors at NVIDIA (17 June), https://arxiv.org/abs/2406.11704 This technical report accompanies NVIDIA release of the Nemotron-4 340B model family, which performs competitively on various benchmarks and excels in synthetic data generation, along with open-sourcing its data generation pipeline for further research and development. mDPO: Conditional Preference Optimization for Multimodal Large Language Models by Wang, Zhou, Huang, et al. (17 June), https://arxiv.org/abs/2406.11839 mDPO addresses the unconditional preference problem in multimodal DPO by optimizing image preference alongside language preferences and introducing a reward anchor to prevent likelihood decrease for chosen responses. * How Do Large Language Models Acquire Factual Knowledge During Pretraining? by Chang, Park, Ye, et al. (17 June), https://arxiv.org/abs/2406.11813 Task Me Anything by Zhang, Huang, Ma, et al. (17 June), https://arxiv.org/abs/2406.11775 Task-Me-Anything is a benchmark generation engine that creates tailored benchmarks for multimodal language models by programmatically generating task instances from a vast taxonomy of images and videos. THEANINE: Revisiting Memory Management in Long-term Conversations with Timeline-augmented Response Generation by Kim, Ong, Kwon, et al. (16 June), https://arxiv.org/abs/2406.10996 Theanine augments LLMs' response generation by using memory timelines (series of memories showing the development and causality of past events) to improve the model's ability to recall and utilize information from lengthy dialogue histories. Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs by Yang, Ding, Lin, et al. (14 June) https://arxiv.org/abs/2406.10216 This study proposes enhancing reward model generalization in RLHF by regularizing hidden states through retaining the base model's language model head and incorporating text-generation losses, while simultaneously learning a reward head, thus improving out-of-distribution task performance and mitigating reward over-optimization. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs by Hans, Wen, Jain, et al. (14 Jun) , https://arxiv.org/abs/2406.10209 The "goldfish loss" technique reduces model memorization in LLMs by randomly excluding a subset of tokens from the loss computation during training, preventing the model from learning complete verbatim sequences from the training data. Bootstrapping Language Models with DPO Implicit Rewards by Chen, Liu, Du, et al. (14 June), https://arxiv.org/abs/2406.09760 Researchers find that using the aligned model, an implicit reward model, generated during direct preference optimization (DPO) can itself be used to generate a preference dataset to further substantially improve itself. FouRA: Fourier Low Rank Adaptation by Borse, Kadambi, Pandey, et al. (13 June), https://arxiv.org/abs/2406.08798 This research introduces FouRA, a new low-rank adaptation (LoRA) method that operates in the Fourier domain and uses adaptive rank selection, addressing issues of data copying and distribution collapse in LoRA-fine-tuned text-to-image diffusion models while improving image quality and generalization. * An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels by Nguyen, Mahmoud Assran, Jain, et al. (13 June), https://arxiv.org/abs/2406.09415 This research reveals that vanilla Transformers can achieve high performance in various computer vision tasks by treating individual pixels as tokens, which challenges the assumed necessity of locality-based inductive bias in modern vision architectures and suggests new possibilities for future neural network designs in computer vision. MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding by Zuhri, Adilazuarda,Purwarianti, and Aji (13 June), https://arxiv.org/abs/2406.09297 This research introduces Multi-Layer Key-Value (MLKV) sharing, a new technique that extends Key-Value (KV) caching across transformer layers, which substantially reduces memory usage during auto-regressive inference beyond existing methods like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), while maintaining performance on NLP tasks. Transformers Meet Neural Algorithmic Reasoners by Bounsi, Ibarz, Dudzik, et al. (13 June), https://arxiv.org/abs/2406.09308 TransNAR is a hybrid architecture combining Transformers with graph neural network-based neural algorithmic reasoners (NARs), which enables improved performance on algorithmic reasoning tasks by allowing the Transformer to leverage the robust computational capabilities of NARs while maintaining strong natural language understanding. Discovering Preference Optimization Algorithms with and for Large Language Models by Lu, Holt, Fanconi, et al. (12 June), https://arxiv.org/abs/2406.08414 The proposed Discovered Preference Optimization method uses LLMs to automatically discover and implement new preference optimization algorithms for improving LLM outputs. * An Empirical Study of Mamba-based Language Models by Waleffe, Byeon, Riach, et al. (12 June), https://arxiv.org/abs/2406.07887 This research compares 8B-parameter state-space models (Mamba, Mamba-2) and Transformer models trained on large datasets, finding that while pure state-space models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying, in-context learning, or long-context reasoning; however, hybrids seem to offer the best of both worlds. * Large Language Models Must Be Taught to Know What They Don't Know by Kapoor, Gruver, Roberts, et al. (12 June), https://arxiv.org/abs/2406.08391 This research demonstrates that finetuning LLMs on a small dataset of graded examples can produce more reliable uncertainty estimates than prompting alone, with the resulting models capable of estimating uncertainty for themselves and other models. Large Language Model Unlearning via Embedding-Corrupted Prompts by Liu, Flannigan, and Liu (12 June), https://arxiv.org/abs/2406.07933 This research introduces embedding-corrupted prompts, a method for selective knowledge unlearning in LLMs that uses prompt classification and embedding corruption to achieve targeted forgetting with minimal side effects across a wide range of model sizes. What If We Recaption Billions of Web Images with LLaMA-3? by Li, Tu, Hui, et al. (12 June) https://arxiv.org/abs/2406.08478 This research demonstrates that using a finetuned Llama 3-powered LLaVA-1.5 multimodal LLM to recaption 1.3 billion images from the DataComp-1B dataset significantly improves the performance of vision-language models in various tasks. * Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing by Xu, Jiang, Niu et al. (12 June), https://arxiv.org/abs/2406.08464 Researchers propose a synthetic instruction data generation method that generates 300,000 high-quality instruction-response pairs from Llama-3-Instruct; this data can be used for supervised instruction fine-tuning to rival the performance of aligned LLMs without requiring an actual alignment step. * Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling by (11 June), https://arxiv.org/abs/2406.07522 Samba is a hybrid model combining selective state space models (think Mamba) with sliding window attention that scales efficiently to 3.8B parameters. * Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement (11 June) by Wu, Zhao, and Zheng, https://arxiv.org/abs/2406.07138 CREAM is a training-efficient method for extending the context length of LLMs by interpolating positional encodings and using a truncated Gaussian to prioritize middle-context information. Simple and Effective Masked Diffusion Language Models by Sahoo, Arriola, Schiff, et al. (11 June), https://arxiv.org/abs/2406.07524 This work demonstrates that masked discrete diffusion models, when trained with an effective recipe and a simplified objective, can substantially narrow the performance gap with autoregressive methods in language modeling. TextGrad: Automatic "Differentiation" via Text by Yuksekgonul, Bianchi, Boen, et al. (11 June), https://arxiv.org/abs/2406.07496 TextGrad is a framework that leverages LLMs to "backpropagate" textual feedback for optimizing building blocks (such as "tool caller", "search engine", etc.) in compound AI systems. An Image is Worth 32 Tokens for Reconstruction and Generation by Yu, Weber, Deng, et al. (11 June), https://arxiv.org/abs/2406.07550 The authors propose a transformer-based 1-dimensional tokenizer for image generation that reduces a 256x256x3 image to just 32 discrete tokens. * Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching by Zhang, Peng, Zhou, et al. , (10 June), https://arxiv.org/abs/2406.06326 The Self-Tuning framework improves the knowledge acquisition of LLMs from raw documents through self-teaching tasks focused on memorization, comprehension, and self-reflection. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters by Song, Xie, Zhang, et al. (10 June), https://arxiv.org/abs/2406.05955 This paper proposes the dReLU activation function and an optimized training data mixture to improve activation sparsity in LLMs. Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning by Kim, Paranjape, Khot, and Hajishirzi (10 June), https://arxiv.org/abs/2406.06469 Husky is an open-source language agent that learns to reason over a unified action space to tackle diverse tasks involving numerical, tabular, and knowledge-based reasoning by iterating between generating and executing actions with expert models. Margin-aware Preference Optimization for Aligning Diffusion Models Without Reference by Hong, Paul, Lee, et al. (10 June), https://arxiv.org/abs/2406.06424 To address limitations of traditional alignment techniques like RLHF and DPO, the authors propose Margin-Aware Preference Optimization (MaPO) for text-to-image diffusion models, which maximizes the likelihood margin between preferred and dispreferred image sets without using a reference model. * Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation by Sun, Jian, Chen, et al. (10 June), https://arxiv.org/abs/2406.06525 The authors propose LlamaGen, which involves applying the "next-token prediction" paradigm of large language models to image generation. Creativity Has Left the Chat: The Price of Debiasing Language Models by Mohammidi (8 June), https://arxiv.org/abs/2406.05587 This research reveals that while alignment techniques like RLHF mitigate biases in LLMs, they can diminish the models' creative capabilities, impacting syntactic and semantic diversity, which is crucial for tasks requiring creative output. 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination by Yang, Chen, Madaan, et al. (7 June), https://arxiv.org/abs/2406.05132 This research introduces 3D-GRAND, a dataset of 40,087 household scenes paired with 6.2 million scene-language instructions, and utilizes instruction tuning and the 3D-POPE benchmark to enhance grounding capabilities and reduce hallucinations in 3D-LLMs BERTs are Generative In-Context Learners by Samuel (7 June), https://arxiv.org/abs/2406.04823 This paper demonstrates that masked language models, like DeBERTa, can perform in-context learning using a simple inference technique that reformats the sequence of input tokens with mask tokens that resemble the structure of a causal attention mask. June 7, Mixture-of-Agents Enhances Large Language Model Capabilities, https://arxiv.org/abs/2406.04692 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild by Lin, Deng, Chandu, et al. (7 June), https://arxiv.org/abs/2406.04770 The authors introduce an automated evaluation framework for benchmarking LLMs using real-world user queries, featuring 1,024 tasks and two advanced metrics, WB-Reward and WB-Score, which provide reliable and interpretable automatic judgments by employing task-specific checklists and structured explanations. CRAG -- Comprehensive RAG Benchmark by Yang, Sun, Xin, et al. (7 June), https://arxiv.org/abs/2406.04744 This research introduces a factual question answering dataset of 4,409 question-answer pairs with mock APIs simulating web and Knowledge Graph searches, designed to reflect diverse, dynamic real-world QA tasks. Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach by Dong, Luo, Zhang, et al. (7 June), https://arxiv.org/abs/2406.04594 This research introduces C4, a communication-driven solution for parallel training of LLMs, which rapidly identifies and isolates hardware faults and optimizes traffic planning to reduce network congestion, which can cut error-induced overhead by up to 30% and improving runtime performance by up to 15%. Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step by Liang, Yuan, Gu, et al. (6 June), https://arxiv.org/abs/2406.04314 This research introduces Step-aware Preference Optimization, a post-training approach that independently evaluates and adjusts denoising performance at each step in text-to-image diffusion models, which outperforms Diffusion-DPO in image alignment and aesthetics while offering 20x faster training efficiency. * Are We Done with MMLU? by Gema, Leang, Hong, et al. (6 June), https://arxiv.org/abs/2406.04127 This study identifies numerous errors in the widely-used MMLU benchmark, creates a re-annotated subset called MMLU-Redux that reveals significant discrepancies in reported model performance, and advocates for revising MMLU to improve its reliability. * Transformers Need Glasses! Information Over-Squashing in Language Tasks by Barbero, Banino, Kapturowski, et al. (6 June), https://arxiv.org/abs/2406.04267 The study analyzes information propagation in LLMs (specifically: decoder-only transformers), revealing a representational collapse phenomenon where distinct input sequences can yield arbitrarily close final token representations, leading to errors in tasks like counting or copying and loss of sensitivity to specific input tokens. The Prompt Report: A Systematic Survey of Prompting Techniques by Schulhoff, Ilie, Balepur, et al. (6 June), https://arxiv.org/abs/2406.06608 This 76-page paper aims to provide a clear and organized framework for understanding prompts and prompting techniques. Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models by Yang, Yu, Zhang, et al. (6 June), https://arxiv.org/abs/2406.04271 This Buffer of Thoughts approach improves LLMs by retrieving and instantiating thought-templates, which are generic problem-solving blueprints, for reasoning across various domains. Block Transformer: Global-to-Local Language Modeling for Fast Inference (4 June) by Ho, Bae, Kim, et al. , https://arxiv.org/abs/2406.02657 The proposed Block Transformer improves inference throughput 10-20x by isolating expensive global attention to lower layers on fixed-size token blocks and applying fast local attention in upper layers. * Scalable MatMul-free Language Modeling by Zhu, Zhang, Sifferman, et al. (4 June), https://arxiv.org/abs/2406.02528 This paper presents a scalable MatMul-free language model architecture that replaces matrix multiplications with element-wise products and accumulations using ternary weights that works well even in billion-parameter scales. Towards Scalable Automated Alignment of LLMs: A Survey , (3 June) by Cao, Lu, Lu, et al. https://arxiv.org/abs/2406.01252 This paper reviews the recent and emerging automated alignment methods for LLMs that typically follow the instruction finetuning step in an LLM development pipeline. The Geometry of Categorical and Hierarchical Concepts in Large Language Models by by Park, Choe, Jiang, and Veitch (3 June), https://arxiv.org/abs/2406.01506 Using a Gemma LLM, this paper extends the linear representation hypothesis, showing that categorical concepts are simplices, hierarchical relations are orthogonal, and complex concepts are polytopes, validated with 957 WordNet concepts. OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models by Büyükakyüz (3 June), https://arxiv.org/abs/2406.01775 OLoRA is an enhancement of Low-Rank Adaptation (LoRA) using orthonormal matrix initialization via QR decomposition, which accelerates the convergence of LLM training compared to regular LoRA. Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models by Wei, Zhu, Zhao et al. (3 June), https://arxiv.org/abs/2406.06563 A report that describes some of the approaches and methods behind developing a 146B parameter mixture-of-experts LLM from an existing 13B parameter dense (non-mixture-of-experts) model. Show, Don't Tell: Aligning Language Models with Demonstrated Feedback by Shaikh, Lam, Hejna, et al. (2 June), https://arxiv.org/abs/2406.00888 The proposed method aligns method aligns LLM outputs to specific user behaviors using fewer than 10 demonstrations as feedback, leveraging imitation learning. This magazine is personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. (Sharing your feedback with others via a book review on Amazon helps a lot, too!) Build A Large Language Model (From Scratch) , Machine Learning Q And AI , and Machine Learning with PyTorch and Scikit-Learn Your support means a great deal! Thank you! A new, cost-effective method for generating data for instruction finetuning Instruction finetuning from scratch Pretraining LLMs with instruction data An overview of what's new in Gemma 2 An overview of all the other interesting research papers that came out in June Annotated illustration of the Magpie method for generating a synthetic dataset for instruction finetuning. The figure is based on illustrations from the Magpie paper: https://arxiv.org/abs/2406.08464 Essentially, as shown in the figure above, we just have to prompt the Llama 3 8B Instruct model with a pre-query template, and it will generate an instruction for us. Then, we feed that instruction back to the LLM, and it will generate a response. If we repeat this procedure a couple of thousand times, we obtain a dataset for instruction finetuning. (Optionally, we can apply an LLM to filter the instruction-response pairs by quality.) 1.2 Dataset quality What's fascinating is that with the resulting instruction dataset, the authors found that finetuning a Llama 3 8B base model with just instruction finetuning (no preference finetuning via RLHF and DPO) beats the original Llama 2 8B Instruct model by Meta AI, as shown in the figure below. A Llama 3 8B base model finetuned on the Magpie-generated instruction dataset beats the original Llama 3 8B Instruct model. Based on an annotated illustration from the Magpie paper: https://arxiv.org/abs/2406.08464 The Magpie results shown in the figure above wer achieved with 300 thousand samples only. In comparison, The original Llama 3 Instruct model was finetuned and aligned on 100 million samples! 1.3 Running the Dataset Generation Locally I was skeptical at first, so I tried to implement this myself. It really works! Here , you can find my reimplementation using Ollama, which even runs fine locally on a MacBook Air. Code screenshot from a reimplementation of the Magpie method that runs locally. The code is available here . 1.4 Additional Details The authors created two sets of datasets: A "Pro" version using the Llama 3 70B Instruct model and an "Air" version using the Llama 3 8B Instruct model. As an earlier figure showed, the Magpie-Pro-generated dataset results in slightly stronger models compared to the Magpie-Air dataset when using it to instruction-finetune a Llama 3 8B base model. The figure below shows an additional comparison of the dataset qualities and difficulties as rated via an LLM. Annotated plots from the Magpie paper showing the dataset quality and difficulty of the Air and Pro datasets relative to each other. As the figure above shows, the quality of the Air and Pro datasets is roughly on par. In addition, it would have been interesting to see how the Alpaca dataset compares to these. (The assumption is that the Magpie data is of much higher quality than Alpaca, but a reference point would be interesting.) Furthermore, the paper contains an analysis showing that the breadth or diversity in this dataset is much larger than that of other popular datasets for instruction finetuning, such as Alpaca, Evol Instruct, and UltraChat. In addition, when compared to models trained with other instruction finetuning datasets, the Magpie-Pro finetuned model also compares very favorably. 1.5 Conclusion Overall, I think that Magpie is an interesting exploit that is, on the one hand, fascinating in its effectiveness and, on the other hand, has a lot of practical utility. I will certainly consider it as an interesting, simple, and cost-effective candidate for constructing general-purpose instruction datasets in the future. 2. Instruction Finetuning from Scratch If you are looking for a resource to understand the instruction finetuning process in LLMs, I am happy to share that Chapter 7 on instruction finetuning LLMs is now finally live on the Manning website . This is the longest chapter in the book and takes a from-scratch approach to implementing the instruction finetuning pipeline. This includes everything from input formatting to batching with a custom collate function, masking padding tokens, the training loop itself, and scoring the response quality of the finetuned LLM on a custom test set. (The exercises include changing prompt styles, instruction masking, and adding LoRA.) Happy coding! An overview of chapter 7 in my "Build a Large Language Model From Scratch" book. Supplementary code materials are available here on GitHub . PS: it's also the last chapter, and the publisher is currently preparing the layouts for the print version. 3. Instruction Pretraining LLMs In the paper "Instruction Pre-Training: Language Models are Supervised Multitask Learners" ( https://arxiv.org/abs/2406.14491 ), researchers investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. (Here, "raw text" means text from books, websites, papers, and so forth that has not been reprocessed into a specific format.) A comparison between regular pretraining (top) and the proposed instruction pretraining approach (bottom) via an annotated figure from https://arxiv.org/abs/2406.14491 Specifically, the researchers experiment with generating instruction-response data from the raw training corpus itself via an "instruction synthesizer," an LLM specifically finetuned for this task. (Note that this is not the first paper proposing the formatting of raw text as instruction data. Another work that comes to mind is "Genie: Achieving Human Parity in Content-Grounded Datasets Generation" ( https://arxiv.org/abs/2401.14367 ). I also recall seeing another paper or blog post using instruction data during pretraining a few months ago—I discussed this method with some of my colleagues—but unfortunately, I couldn't find the reference. Nonetheless, the paper discussed here is particularly intriguing since it builds on openly available LLMs that run locally and covers both pretraining and continual pretraining.) 3.1 Instruction Synthesizer Before we dive into the pretraining and continual pretraining results, let's talk about the core component of this method: the instruction synthesizer. This is an openly available Mistral 7B v0.1 LLM (which I wrote about last year here: https://magazine.sebastianraschka.com/i/138555764/mistral-b ) that has been finetuned to generate instruction-response pairs from raw text. To finetune this synthesizer, the researchers use datasets such as HotpotQA ( https://arxiv.org/abs/1809.09600 ), which consists of passages from Wikipedia associated with questions and answers. For this, the authors also ensure that a variety of tasks, like commonsense reasoning, sentiment analysis, math problems, etc., are covered. The input and output data of the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 Once this instruction synthesizer is developed (i.e., finetuned), it can be used to generate the input data for pretraining the target LLMs. One last noteworthy detail regarding the instruction synthesizer is that multiple raw texts (T n ) and instruction-response pairs (I n ⊕ R n ) are concatenated as few-shot examples, as shown in the figure below. The formatting of the instruction-data for finetuning (and using) the instruction synthesizer via an annotated figure from https://arxiv.org/abs/2406.14491 3.2 Pretraining with Instruction Data Now that we have discussed the method to generate the instruction-response pairs, let's get to the interesting part: how well do models train on this augmented dataset. The first set of results looks at two small models trained from scratch: 500M parameters and 1.3B parameters (both are based on the Mistral architecture). A comparison of 3 different pretraining approaches used to train models from scratch (annotated table from https://arxiv.org/abs/2406.14491) As we can see in the table above, the model trained via the proposed instruction pretraining approach ( Instruct PT ) performs best on most benchmark tasks (higher values are better). Note, though, that it has seen more tokens than the Vanilla PT approach since it included the synthesized instruction-response pairs. Hence, the authors included the Mix PT comparison, which is a model that has been trained on a data mix containing both the raw text and the instruction data used to train the synthesizer. From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.) In addition, it's worth noting that the Instruct PT pretrained models have another advantage: They improve a more when they are instruction-finetuned afterwards, as the figure below shows. Finetuning LLMs that have been pretrained with either the traditional pretraining pardigm (Vanilla PT) or instruction pretraining (annotated figure from https://arxiv.org/abs/2406.14491) 3.3 Continual Pretraining with Instruction Data Pretraining from scratch is interesting because that's how LLMs are created in the first place. However, I'd say that practitioners care more about continual pretraining and finetuning. Continual pretraining here means that we take an existing pretrained model and pretrain it further on new domain data. For instance, think of a Llama 3 8B base model that has been trained on a general text corpus and that you want to adapt for finance, medical, legal, or other domains. The table below summarizes the results the researchers obtained when applying the instruction pretraining method to a pretrained Llama 3 8B base model. Specifically, they conducted continual pretraining with both biomedical texts and finance texts. A comparison of 3 different pretraining approaches used for continual pretraining (annotated table from https://arxiv.org/abs/2406.14491) Looking at the table above, we can see that the instruction pretraining approach ( Instruct PT ) clearly outperforms the vanilla pretraining ( Vanilla PT ) approach (here, this means regular continual pretraining of the base model). The Llama 3 70B base model is included as a reference; I suppose to showcase that small specialized models can beat larger general models. 3.4 Conclusion Almost every time I explain the LLM pretraining pipeline to someone, they are surprised by its simplicity and the fact that this is still what's commonly used to train LLMs today. The instruction pretraining approach is quite refreshing in that sense. One caveat is that for large pretraining corpora, it might still be expensive to create the instruction-augmented corpora. However, the nice thing about generated data is that it can be reused in many different projects once created. 4. Gemma 2 I cannot write this article without mentioning Google's new Gemma 2 models , which are arguably the biggest model release last month. However, when it comes to pure size, Nvidia's Nemotron-4 340B takes the crown (https://arxiv.org/abs/2406.11704). The Gemma 2 models come in 2.6B, 9B, and 27B parameter versions. Since this article is already quite lengthy, and you're likely familiar with Gemma 2 from other sources, let's cut to the chase. What are the main highlights and noteworthy updates in Google's newly released Gemma 2 LLMs? The main theme is exploring techniques without necessarily increasing the size of training datasets but rather focusing on developing relatively small and efficient LLMs. Specifically, they blend three main architectural and training choices to create the 2.6B and 9B parameter models: sliding window attention, grouped-query attention, and knowledge distillation. 4.1 Sliding window attention Sliding window attention (e.g., as popularized by Mistral) is a technique using a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below. Annotated figure from https://arxiv.org/abs/2310.06825 explaining sliding window attention. In the case of Gemma 2, the authors alternated between regular attention and sliding window attention layers. The sliding attention block size was 4096 tokens, spanning a total block size of 8192 tokens. Sliding window attention is mainly used to improve computational performance, and the researchers also included a small ablation study showing that there's a barely noticeable difference in perplexity when shrinking the block size during inference. An ablation study from the Gemma 2 technical report showing that a decreased block size for the sliding window barely impacts the modeling performance of the 9B parameter model during inference. (It would have been interesting to see the GPU memory improvement side-by-side.) 4.2 Group-query attention Group-query attention (like in Llama 2 and 3) can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements. Annotated figure from Ainslie et al. 2023 4.3 Knowledge distillation The general idea of Knowledge distillation (as in MiniLLM, https://arxiv.org/abs/2306.08543) is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a "teacher" for the smaller models. An overview of knowledge distillation from my Machine Learning Q and AI book in the context of computer vision. In an LLM-context, think of text instead of images and predicted tokens instead of class labels. 4.4 Other interesting architecture details The paper contains many other interesting tidbits. For instance, one hallmark of Gemma 2 is its relatively large vocabulary size: 256,000 tokens. This is similar to the first Gemma model, but it's still worth noting since it's twice the size of the Llama 3 vocabulary (128,000) and eight times the size of the Phi-3 vocabulary (32,000). The vocabulary size of an LLM refers to the number of unique tokens (words, subwords, or characters) that the model can recognize and generate. A large vocabulary size in LLMs allows for better coverage of words and concepts, improved handling of multilingual content, and reduced tokenization artifacts. However, a large vocabulary size also comes with trade-offs, such as increased model size and potentially slower inference due to the larger embedding and output layers. (That's where the sliding window attention and multi-query attention mechanism are important to offset this.) There's also an interesting section on "logit capping," a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training. logits ← soft_cap ∗ tanh(logits/soft_cap). Additionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although the paper doesn't provide much detail about that. (However, interested readers can read more about this in WARP: On the Benefits of Weight Averaged Rewarded Policies , which Gemma 2 uses for this.) In terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model. A comparison between two other popular models with openly available weights: Llama 3 and Qwen 1.5. (Annotated table from the Gemma 2 technical report ). Personally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research. An example of one of the ablation studies included in the Gemma 2 technical report . Here "wide" refers to a model with 28 layers and an intermediate size of 24,576, and "deep" refers to a architecture with 42 layers with an intermediate size of 14,336. 4.5 Conclusion It's refreshing to see such a relatively detailed technical report from Google. When it comes to the model itself, based on public consensus, Gemma 2 is likely the most capable model for single-GPU use cases today. For larger models, Llama 3 70B and Qwen 2 72B remain strong contenders. Supporting Ahead of AI Ahead of AI is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of my books . If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. If you have a few moments, a review of Machine Learning Q and AI or Machine Learning with PyTorch and Scikit-Learn on Amazon would really help, too! Your support means a great deal and is tremendously helpful in continuing this journey. Thank you! 5. Other Interesting Research Papers In April Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 20 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects. Scaling Synthetic Data Creation with 1,000,000,000 Personas by Chan, Wang, Yu, et al. (28 June), https://arxiv.org/abs/2406.20094 The research proposes a persona-driven data synthesis methodology that utilizes an LLM to create diverse synthetic data by leveraging a vast collection of automatically curated personas, called Persona Hub, which represents about 13% of the world's population. This study develops "critic" models using RLHF to assist humans in evaluating model-generated code, training LLMs to write natural language feedback on code errors, and demonstrating their effectiveness in catching bugs across various tasks. DPKD reformulates Knowledge Distillation for LLMs into a two-stage process: first optimizing an objective combining implicit reward and reverse KL divergence, then improving the preference probability of teacher outputs over student outputs. This study investigates the robustness of accuracy measurements on the MMLU benchmark for LLMs, revealing that shuffling answer label contents leads to decreased accuracy across models, with varying sensitivity. This study proposes a finetuning approach using a synthetic dataset of numerical key-value retrieval tasks to improve LLM's long-context information retrieval and reasoning capabilities. This study proposes efficient router models that dynamically select between stronger and weaker LLMs during inference to optimize cost-performance trade-offs. This study introduces a method to train LLMs that can follow user-specified length constraints at inference time, addressing the length bias in model evaluation and outperforming standard instruction-following models in length-controlled tasks. LongIns is a new benchmark for evaluating LLM's long-context capabilities, using three settings to assess retrieval and reasoning abilities. This report introduces FineWeb, a 15-trillion token dataset derived from Common Crawl, and FineWeb-Edu, a 1.3-trillion token educational subset. Adam-mini is a proposed optimizer that achieves comparable or better performance than AdamW while using 45-50% less memory by strategically reducing learning rate resources, partitioning parameters based on Hessian structure, and assigning optimized single learning rates to parameter blocks. The paper introduces a new alignment strategy for LLMs that merges policies at three stages: using exponential moving average for dynamic KL regularization, spherical interpolation of independently fine-tuned policies, and linear interpolation with initialization. The authors propose a new sparse attention mechanism for autoregressive Transformers, using a scoring network and differentiable top-k mask operator to select a constant number of KV pairs per query to achieve linear time complexity and a constant memory footprint. This study proposes three strategies to improve continual pretraining of LLMs: multiple epochs on a subset, focusing on high-quality data, and using a mixture similar to pretraining data. Mixture of Attention (MoA) automatically optimizes sparse attention patterns for different model components and input lengths in LLMs, improving context length, accuracy, and efficiency over uniform sparse attention approaches. LongRAG introduces a new RAG framework using 4K-token retrieval units and long-context LLMs for answer extraction, which improves retrieval performance and achieving state-of-the-art results on question-answering tasks without additional training. This study challenges conventional wisdom by demonstrating that base LLMs outperform instruction-tuned models in Retrieval Augmented Generation (RAG) tasks. The authors develop and test three methods for implementing "Learning by Teaching" in LLMs, mimicking human teaching processes at different levels: observing student feedback, learning from feedback, and iterative learning, to improve model performance without relying on additional human-produced data or stronger models. This study introduces a framework for supervised multitask pretraining of LLMs that augments raw corpora with synthetically generated instruction-response pairs. This study introduces a benchmark for evaluating long-context LLMs on tasks requiring up to millions of tokens, demonstrating that these long-context LLMs can compete with specialized retrieval and RAG systems in in-context retrieval and reasoning tasks. This paper evaluates the LLM-as-a-judge paradigm using TriviaQA as a benchmark, comparing 9 judge models and 9 exam taker models against human annotations, revealing that models with high human alignment may not necessarily be the best at ranking exam taker models. The authors investigate the mechanics of Retrieval Augmented Generation (RAG) in LLMs to reveal that models predominantly rely on retrieved context information rather than their parametric memory when answering questions, exhibiting a shortcut behavior across different model families. This paper introduces a method that transforms a monolithic (LLM into a modular system called MiXSE (MiXture of Self-specialized Experts), using self-generated synthetic data to create specialized expert modules with shared base LLM and self-optimized routing. This study investigates the impact of Reinforcement Learning with Human Feedback (RLHF) on data memorization in LLMs, focusing on code completion tasks, and finds that while RLHF reduces memorization of data used in reward modeling and reinforcement learning compared to direct finetuning, it largely preserves memorization from the initial finetuning stage. This study proposes a principle for leveraging human priors in data construction for Small Language Models (SLMs) that focuses on semantic diversity and data quality consistency while avoiding benchmark data leakage. This study introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a method that improves LLM alignment with human preferences while controlling response verbosity. This study presents an encoder-free vision-language model (VLM) that directly processes both visual and textual inputs in a unified decoder. DeepSeek-Coder-V2 is an open-source Mixture-of-Experts code LLM that achieves GPT4-Turbo-level performance on coding tasks through continued pretraining on 6 trillion additional tokens. This study investigates the "curse of tokenization" in LLMs by examining their performance on complex problem solving, token structure probing, and resilience to typographical variation, revealing that while scaling model size helps, LLMs remain susceptible to tokenization-induced biases The authors provide a standardized testbed for experimenting with dataset curation strategies in language model training, including a 240T token corpus, pretraining recipes, and 53 downstream evaluations, This technical report accompanies NVIDIA release of the Nemotron-4 340B model family, which performs competitively on various benchmarks and excels in synthetic data generation, along with open-sourcing its data generation pipeline for further research and development. mDPO addresses the unconditional preference problem in multimodal DPO by optimizing image preference alongside language preferences and introducing a reward anchor to prevent likelihood decrease for chosen responses. Task-Me-Anything is a benchmark generation engine that creates tailored benchmarks for multimodal language models by programmatically generating task instances from a vast taxonomy of images and videos. Theanine augments LLMs' response generation by using memory timelines (series of memories showing the development and causality of past events) to improve the model's ability to recall and utilize information from lengthy dialogue histories. This study proposes enhancing reward model generalization in RLHF by regularizing hidden states through retaining the base model's language model head and incorporating text-generation losses, while simultaneously learning a reward head, thus improving out-of-distribution task performance and mitigating reward over-optimization. The "goldfish loss" technique reduces model memorization in LLMs by randomly excluding a subset of tokens from the loss computation during training, preventing the model from learning complete verbatim sequences from the training data. Researchers find that using the aligned model, an implicit reward model, generated during direct preference optimization (DPO) can itself be used to generate a preference dataset to further substantially improve itself. This research introduces FouRA, a new low-rank adaptation (LoRA) method that operates in the Fourier domain and uses adaptive rank selection, addressing issues of data copying and distribution collapse in LoRA-fine-tuned text-to-image diffusion models while improving image quality and generalization. This research reveals that vanilla Transformers can achieve high performance in various computer vision tasks by treating individual pixels as tokens, which challenges the assumed necessity of locality-based inductive bias in modern vision architectures and suggests new possibilities for future neural network designs in computer vision. This research introduces Multi-Layer Key-Value (MLKV) sharing, a new technique that extends Key-Value (KV) caching across transformer layers, which substantially reduces memory usage during auto-regressive inference beyond existing methods like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), while maintaining performance on NLP tasks. TransNAR is a hybrid architecture combining Transformers with graph neural network-based neural algorithmic reasoners (NARs), which enables improved performance on algorithmic reasoning tasks by allowing the Transformer to leverage the robust computational capabilities of NARs while maintaining strong natural language understanding. The proposed Discovered Preference Optimization method uses LLMs to automatically discover and implement new preference optimization algorithms for improving LLM outputs. This research compares 8B-parameter state-space models (Mamba, Mamba-2) and Transformer models trained on large datasets, finding that while pure state-space models match or exceed Transformers on many tasks, they lag behind on tasks requiring strong copying, in-context learning, or long-context reasoning; however, hybrids seem to offer the best of both worlds. This research demonstrates that finetuning LLMs on a small dataset of graded examples can produce more reliable uncertainty estimates than prompting alone, with the resulting models capable of estimating uncertainty for themselves and other models. This research introduces embedding-corrupted prompts, a method for selective knowledge unlearning in LLMs that uses prompt classification and embedding corruption to achieve targeted forgetting with minimal side effects across a wide range of model sizes. This research demonstrates that using a finetuned Llama 3-powered LLaVA-1.5 multimodal LLM to recaption 1.3 billion images from the DataComp-1B dataset significantly improves the performance of vision-language models in various tasks. Researchers propose a synthetic instruction data generation method that generates 300,000 high-quality instruction-response pairs from Llama-3-Instruct; this data can be used for supervised instruction fine-tuning to rival the performance of aligned LLMs without requiring an actual alignment step. Samba is a hybrid model combining selective state space models (think Mamba) with sliding window attention that scales efficiently to 3.8B parameters. CREAM is a training-efficient method for extending the context length of LLMs by interpolating positional encodings and using a truncated Gaussian to prioritize middle-context information. This work demonstrates that masked discrete diffusion models, when trained with an effective recipe and a simplified objective, can substantially narrow the performance gap with autoregressive methods in language modeling. TextGrad is a framework that leverages LLMs to "backpropagate" textual feedback for optimizing building blocks (such as "tool caller", "search engine", etc.) in compound AI systems. The authors propose a transformer-based 1-dimensional tokenizer for image generation that reduces a 256x256x3 image to just 32 discrete tokens. The Self-Tuning framework improves the knowledge acquisition of LLMs from raw documents through self-teaching tasks focused on memorization, comprehension, and self-reflection. This paper proposes the dReLU activation function and an optimized training data mixture to improve activation sparsity in LLMs. Husky is an open-source language agent that learns to reason over a unified action space to tackle diverse tasks involving numerical, tabular, and knowledge-based reasoning by iterating between generating and executing actions with expert models. To address limitations of traditional alignment techniques like RLHF and DPO, the authors propose Margin-Aware Preference Optimization (MaPO) for text-to-image diffusion models, which maximizes the likelihood margin between preferred and dispreferred image sets without using a reference model. The authors propose LlamaGen, which involves applying the "next-token prediction" paradigm of large language models to image generation. This research reveals that while alignment techniques like RLHF mitigate biases in LLMs, they can diminish the models' creative capabilities, impacting syntactic and semantic diversity, which is crucial for tasks requiring creative output. This research introduces 3D-GRAND, a dataset of 40,087 household scenes paired with 6.2 million scene-language instructions, and utilizes instruction tuning and the 3D-POPE benchmark to enhance grounding capabilities and reduce hallucinations in 3D-LLMs This paper demonstrates that masked language models, like DeBERTa, can perform in-context learning using a simple inference technique that reformats the sequence of input tokens with mask tokens that resemble the structure of a causal attention mask. The authors introduce an automated evaluation framework for benchmarking LLMs using real-world user queries, featuring 1,024 tasks and two advanced metrics, WB-Reward and WB-Score, which provide reliable and interpretable automatic judgments by employing task-specific checklists and structured explanations. This research introduces a factual question answering dataset of 4,409 question-answer pairs with mock APIs simulating web and Knowledge Graph searches, designed to reflect diverse, dynamic real-world QA tasks. This research introduces C4, a communication-driven solution for parallel training of LLMs, which rapidly identifies and isolates hardware faults and optimizes traffic planning to reduce network congestion, which can cut error-induced overhead by up to 30% and improving runtime performance by up to 15%. This research introduces Step-aware Preference Optimization, a post-training approach that independently evaluates and adjusts denoising performance at each step in text-to-image diffusion models, which outperforms Diffusion-DPO in image alignment and aesthetics while offering 20x faster training efficiency. This study identifies numerous errors in the widely-used MMLU benchmark, creates a re-annotated subset called MMLU-Redux that reveals significant discrepancies in reported model performance, and advocates for revising MMLU to improve its reliability. The study analyzes information propagation in LLMs (specifically: decoder-only transformers), revealing a representational collapse phenomenon where distinct input sequences can yield arbitrarily close final token representations, leading to errors in tasks like counting or copying and loss of sensitivity to specific input tokens. This 76-page paper aims to provide a clear and organized framework for understanding prompts and prompting techniques. This Buffer of Thoughts approach improves LLMs by retrieving and instantiating thought-templates, which are generic problem-solving blueprints, for reasoning across various domains. The proposed Block Transformer improves inference throughput 10-20x by isolating expensive global attention to lower layers on fixed-size token blocks and applying fast local attention in upper layers. This paper presents a scalable MatMul-free language model architecture that replaces matrix multiplications with element-wise products and accumulations using ternary weights that works well even in billion-parameter scales. This paper reviews the recent and emerging automated alignment methods for LLMs that typically follow the instruction finetuning step in an LLM development pipeline. Using a Gemma LLM, this paper extends the linear representation hypothesis, showing that categorical concepts are simplices, hierarchical relations are orthogonal, and complex concepts are polytopes, validated with 957 WordNet concepts. OLoRA is an enhancement of Low-Rank Adaptation (LoRA) using orthonormal matrix initialization via QR decomposition, which accelerates the convergence of LLM training compared to regular LoRA. A report that describes some of the approaches and methods behind developing a 146B parameter mixture-of-experts LLM from an existing 13B parameter dense (non-mixture-of-experts) model. The proposed method aligns method aligns LLM outputs to specific user behaviors using fewer than 10 demonstrations as feedback, leveraging imitation learning.

Machine Learning

Sql

Nlp

0 views