Posts in Machine-learning (20 found)
Ahead of AI 1 weeks ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

How do we actually evaluate LLMs? It’s a simple question, but one that tends to open up a much bigger discussion. When advising or collaborating on projects, one of the things I get asked most often is how to choose between different models and how to make sense of the evaluation results out there. (And, of course, how to measure progress when fine-tuning or developing our own.) Since this comes up so often, I thought it might be helpful to share a short overview of the main evaluation methods people use to compare LLMs. Of course, LLM evaluation is a very big topic that can’t be exhaustively covered in a single resource, but I think that having a clear mental map of these main approaches makes it much easier to interpret benchmarks, leaderboards, and papers. I originally planned to include these evaluation techniques in my upcoming book, Build a Reasoning Model (From Scratch) , but they ended up being a bit outside the main scope. (The book itself focuses more on verifier-based evaluation.) So I figured that sharing this as a longer article with from-scratch code examples would be nice. In Build A Reasoning Model (From Scratch) , I am taking a hands-on approach to building a reasoning LLM from scratch. If you liked “Build A Large Language Model (From Scratch)”, this book is written in a similar style in terms of building everything from scratch in pure PyTorch. Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website. Before implementing the model evaluation code, let’s first download the gpt-oss model and verify that Ollama is functioning correctly by using it from the command line terminal. Execute the following command on the command line (not in a Python session) to try out the 20 billion parameter gpt-oss model: The first time you execute this command, the 20 billion parameter gpt-oss model, which takes up 14 GB of storage space, will be automatically downloaded. The output looks as follows: Note that the gpt-oss:20b in the ollama run gpt-oss:20b command refers to the 20 billion parameter gpt-oss model. Using Ollama with the gpt-oss:20b model requires approximately 13 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 4 billion parameter qwen3:4b model via ollama run qwen3:4b, which only requires around 4 GB of RAM. For more powerful computers, you can also use the larger 120-billion parameter gpt-oss model by replacing gpt-oss:20b with gpt-oss:120b. However, keep in mind that this model requires significantly more computational resources. Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, “What is 1+2?”: You can end this ollama run gpt-oss:20b session using the input . You can end this ollama run gpt-oss:20b session using the input /bye. In the remainder of this section, we will use the ollama API. This approach requires that Ollama is running in the background. There are three different options to achieve this: 1. Run the command in the terminal (recommended). This runs the Ollama backend as a server, usually on . Note that it doesn’t load a model until it’s called through the API (later in this section). 2. Run the command similar to earlier, but keep it open and don’t exit the session via . As discussed earlier, this opens a minimal convenience wrapper around a local Ollama server. Behind the scenes, it uses the same server API as ollama serve. 3. Ollama desktop app. Opening the desktop app runs the same backend automatically and provides a graphical interface on top of it as shown in Figure 12 earlier. Figure 13: Two different options to keep the Ollama server (/application) running so we can use it via the Ollama API in Python. Ollama runs locally on our machine by starting a local server-like process. When running ollama serve in the terminal, as described above, you may encounter an error message saying . If that’s the case, try use the command (and if this address is also in use, try to increment the numbers by one until you find an address not in use.) The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section: Ensure that the output from executing the previous code displays Ollama running: . If it shows , please verify that the command or the Ollama application is actively running (see Figure 13). In the remainder of this article, we will interact with the local gpt-oss model, running on our machine, through the Ollama REST API using Python. The following function demonstrates how to use the API: Here’s an example of how to use the function that we just implemented: The resulting response is “3”. (It differs from what we’d get if we ran Ollama run or the Ollama application due to different default settings.) Using the function, we can evaluate the responses generated by our model with a prompt that includes a grading rubric asking the gpt-oss model to rate our target model’s responses on a scale from 1 to 5 based on a correct answer as a reference. The prompt we use for this is shown below: The in the is intended to represent the response produced by our own model in practice. For illustration purposes, we hardcode a plausible model answer here rather than generating it dynamically. (However, feel free to use the Qwen3 model we loaded at the beginning of this article to generate a real ). Next, let’s generate the rendered prompt for the Ollama model: The output is as follows: Ending the prompt in incentivizes the model to generate the answer. Let’s see how the gpt-oss:20b model judges the response: The response is as follows: As we can see, the answer receives the highest score, which is reasonable, as it is indeed correct. While this was a simple example stepping through the process manually, we could take this idea further and implement a for-loop that iteratively queries the model (for example, the Qwen3 model we loaded earlier) with questions from an evaluation dataset and evaluate it via gpt-oss and calculate the average score. You can find an implementation of such a script where we evaluate the Qwen3 model on the MATH-500 dataset here on GitHub . Figure 14: A comparison of the Qwen3 0.6 base and reasoning variants on the first 10 examples in MATH-500 evaluated by gpt-oss:20b as a judge. You can find the code here on GitHub . Related to symbolic verifiers and LLM judges, there is a class of learned models called process reward models (PRMs). Like judges, PRMs can evaluate reasoning traces beyond just the final answer, but unlike general judges, they focus specifically on the intermediate steps of reasoning. And unlike verifiers, which check correctness symbolically and usually only at the outcome level, PRMs provide step-by-step reward signals during training in reinforcement learning. We can categorize PRMs as “step-level judges,” which are predominantly developed for training, not pure evaluation. (In practice, PRMs are difficult to train reliably at scale. For example, DeepSeek R1 did not adopt PRMs and instead combined verifiers for the reasoning training.) Judge-based evaluations offer advantages over preference-based leaderboards, including scalability and consistency, as they do not rely on large pools of human voters. (Technically, it is possible to outsource the preference-based rating behind leaderboards to LLM judges as well). However, LLM judges also share similar weaknesses with human voters: results can be biased by model preferences, prompt design, and answer style. Also, there is a strong dependency on the choice of judge model and rubric, and they lack the reproducibility of fixed benchmarks. In this article, we covered four different evaluation approaches: multiple choice, verifiers, leaderboards, and LLM judges. I know this was a long article, but I hope you found it useful for getting an overview of how LLMs are evaluated. A from-scratch approach like this can be verbose, but it is a great way to understand how these methods work under the hood, which in turn helps us identify weaknesses and areas for improvement. That being said, you are probably wondering, “What is the best way to evaluate an LLM?” Unfortunately, there is no single best method since, as we have seen, each comes with different trade-offs. In short: Multiple-choice (+) Relatively quick and cheap to run at scale (+) Standardized and reproducible across papers (or model cards) (-) Measures basic knowledge recall (-) Does not reflect how LLMs are used in the real world Verifiers (+) Standardized, objective grading for domains with ground truth (+) Allows free-form answers (with some constraints on final answer formatting) (+) Can also score intermediate steps if using process verifiers or process reward models (-) Requires verifiable domains (for example, math or code), and building good verifiers can be tricky (-) Outcome-only verifiers evaluate only the final answer, not reasoning quality Arena-style leaderboards (human pairwise preference) (+) Directly answers “Which model do people prefer?” on real prompts (+) Allows free-form answers and implicitly accounts for style, helpfulness, and safety (-) Expensive and time-intensive for humans (-) Does not measure correctness, only preference (-) Nonstationary populations can affect stability LLM-as-a-judge (+) Scalable across many tasks (+) Allows free-form answers (-) Dependent on the judge’s capability (ensembles can make this more robust) (-) Depends on rubric choice While I am usually not a big fan of radar plots, one can be helpful here to visualize these different evaluation areas, as shown below. Figure 15: A radar chart showing conceptually that we ideally want to pay attention to different areas when evaluating an LLM to identify its strengths and weaknesses. For instance, a strong multiple-choice rating suggests that the model has solid general knowledge. Combine that with a strong verifier score, and the model is likely also answering technical questions correctly. However, if the model performs poorly on LLM-as-a-judge and leaderboard evaluations, it may struggle to write or articulate responses effectively and could benefit from some RLHF. So, the best evaluation combines multiple areas. But ideally it also uses data that directly aligns with your goals or business problems. For example, suppose you are implementing an LLM to assist with legal or law-related tasks. It makes sense to run the model on standard benchmarks like MMLU as a quick sanity check, but ultimately you will want to tailor the evaluations to your target domain, such as law. You can find public benchmarks online that serve as good starting points, but in the end, you will want to test with your own proprietary data. Only then can you be reasonably confident that the model has not already seen the test data during training. In any case, model evaluation is a very big and important topic. I hope this article was useful in explaining how the main approaches work, and that you took away a few useful insights for the next time you look at model evaluations or run them yourself. As always, Happy tinkering! This magazine is a personal passion project, and your support helps keep it alive. If you’d like to support my work, please consider my Build a Large Language Model (From Scratch) book or its follow-up, Build a Reasoning Model (From Scratch) . (I’m confident you’ll get a lot out of these; they explain how LLMs work in depth you won’t find elsewhere.) Thanks for reading, and for helping support independent research! Build a Large Language Model (From Scratch) is now available on Amazon . Build a Reasoning Model (From Scratch) is in Early Access at Manning . If you read the book and have a few minutes to spare, I’d really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Reasoning is one of the most exciting and important recent advances in improving LLMs, but it’s also one of the easiest to misunderstand if you only hear the term reasoning and read about it in theory. So, in this book , I am taking a hands-on approach to building a reasoning LLM from scratch. The book is currently in early-access with >100 pages already online, and I have just finished another 30 pages that are currently being added by the layout team. If you joined the early access program (a big thank you for your support!), you should receive an email when those go live. PS: There’s a lot happening on the LLM research front right now. I’m still catching up on my growing list of bookmarked papers and plan to highlight some of the most interesting ones in the next article. But now, let’s discuss the four main LLM evaluation methods along with their from-scratch code implementations to better understand their advantages and weaknesses. Understanding the main evaluation methods for LLMs There are four common ways of evaluating trained LLMs in practice: multiple choice , verifiers , leaderboards , and LLM judges , as shown in Figure 1 below. Research papers, marketing materials, technical reports, and model cards (a term for LLM-specific technical reports) often include results from two or more of these categories. Figure 1: An overview of the 4 different evaluations models covered in this article. Furthermore the four categories introduced here fall into two groups: benchmark-based evaluation and judgment-based evaluation , as shown in the figure above. (There are also other measures, such as training loss, perplexity , and rewards , but they are usually used internally during model development.) The following subsections provide brief overviews and examples of each of the four methods. Method 1: Evaluating answer-choice accuracy We begin with a benchmark‑based method: multiple‑choice question answering. Historically, one of the most widely used evaluation methods is multiple-choice benchmarks such as MMLU (short for Massive Multitask Language Understanding, https://huggingface.co/datasets/cais/mmlu ). To illustrate this approach, figure 2 shows a representative task from the MMLU dataset. Figure 2: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. Figure 2 shows just a single example from the MMLU dataset. The complete MMLU dataset consists of 57 subjects (from high school math to biology) with about 16 thousand multiple-choice questions in total, and performance is measured in terms of accuracy (the fraction of correctly answered questions), for example 87.5% if 14,000 out of 16,000 questions are answered correctly. Multiple-choice benchmarks, such as MMLU, test an LLM’s knowledge recall in a straightforward, quantifiable way similar to standardized tests, many school exams, or theoretical driving tests. Note that figure 2 shows a simplified version of multiple-choice evaluation, where the model’s predicted answer letter is compared directly to the correct one. Two other popular methods exist that involve log-probability scoring . I implemented them here on GitHub . (As this builds on the concepts explained here, I recommended checking this out after completing this article.) The following subsections illustrate how the MMLU scoring shown in figure 2 can be implemented in code. 1.2 Loading the model First, before we can evaluate it on MMLU, we have to load the pre-trained model. Here, we are going to use a from-scratch implementation of Qwen3 0.6B in pure PyTorch, which requires only about 1.5 GB of RAM. Note that the Qwen3 model implementation details are not important here; we simply treat it as an LLM we want to evaluate. However, if you are curious, a from-scratch implementation walkthrough can be found in my previous Understanding and Implementing Qwen3 From Scratch article, and the source code is also available here on GitHub . Instead of copy & pasting the many lines of Qwen3 source code, we import it from my reasoning_from_scratch Python library, which can be installed via or Code block 1: Loading a pre-trained model 1.3 Checking the generated answer letter In this section, we implement the simplest and perhaps most intuitive MMLU scoring method, which relies on checking whether a generated multiple-choice answer letter matches the correct answer. This is similar to what was illustrated earlier in Figure 2, which is shown below again for convenience. Figure 3: Evaluating an LLM on MMLU by comparing its multiple-choice prediction with the correct answer from the dataset. For this, we will work with an example from the MMLU dataset: Next, we define a function to format the LLM prompts. Code block 2: Loading a pre-trained model Let’s execute the function on the MMLU example to get an idea of what the formatted LLM input looks like: The output is: How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes? The model prompt, as shown above, provides the model with a list of the different answer choices and ends with an text that encourages the model to generate the correct answer. While it is not strictly necessary, it can sometimes also be helpful to provide additional questions along with the correct answers as input, so that the model can observe how it is expected to solve the task. (For example, cases where 5 examples are provided are also known as 5-shot MMLU.) However, for current generations of LLMs, where even the base models are quite capable, this is not required. Loading different MMLU samples You can load examples from the MMLU dataset directly via the datasets library (which can be installed or ): Above, we used the subset; to get a list of the other subsets, use the following code: Next, we tokenize the prompt and wrap it in a PyTorch tensor object as input to the LLM: Then, with all that setup out of the way, we define the main scoring function below, which generates a few tokens (here, 8 tokens by default) and extracts the first instance of letter A/B/C/D that the model prints. Code block 3: Extracting the generated letter We can then check the generated letter using the function from the code block above as follows: The result is: As we can see, the generated answer is incorrect ( ) in this case. This was just one of the 270 examples from the subset in MMLU. The screenshot (Figure 4) below show’s the performance of the base model and reasoning variant when executed on the complete subset. The code for this is available here on GitHub . Figure 4: Base and reasoning model performance on the MMLU subset Assuming the questions have an equal answer probability, a random guesser (with uniform probability choosing A, B, C, or D) is expected to achieve 25% probability. So the both the base and reasoning model are not very good. Multiple-choice answer formats Note that this section implemented a simplified version of multiple-choice evaluation for illustration purposes, where the model’s predicted answer letter is compared directly to the correct one. In practice, more widely used variations exist, such as log-probability scoring, where we measure how likely the model considers each candidate answer rather than just checking the final letter choice. (We discuss probability-based scoring in chapter 4.) For reasoning models, evaluation can also involve assessing the likelihood of generating the correct answer when it is provided as input. Figure 5: Other MMLU scoring methods are described and shared on GitHub here However, regardless of which MMLU scoring variant we use, the evaluation still amounts to checking whether the model selects from the predefined answer options. A limitation of multiple‑choice benchmarks like MMLU is that they only measure an LLM’s ability to select from predefined options and thus is not very useful for evaluating reasoning capabilities besides checking if and how much knowledge the model has forgotten compared to the base model. It does not capture free-form writing ability or real-world utility. Still, multiple-choice benchmarks remain simple and useful diagnostics: for example, a high MMLU score doesn’t necessarily mean the model is strong in practical use, but a low score can highlight potential knowledge gaps. Method 2: Using verifiers to check answers Related to multiple-choice question answering discussed in the previous section, verification-based approaches quantify the LLMs capabilities via an accuracy metric. However, in contrast to multiple-choice benchmarks, verification methods allow LLMs to provide a free-form answer. We then extract the relevant answer portion and use a so-called verifier to compare the answer portion to the correct answer provided in the dataset, as illustrated in Figure 6 below. Figure 6: Evaluating an LLM with a verification-based method in free-form question answering. The model generates a free-form answer (which may include multiple steps) and a final boxed answer, which is extracted and compared against the correct answer from the dataset. When we compare the extracted answer with the provided answer, as shown in figure above, we can employ external tools, such as code interpreters or calculator-like tools/software. The downside is that this method can only be applied to domains that can be easily (and ideally deterministically) verified, such as math and code. Also, this approach can introduce additional complexity and dependencies, and it may shift part of the evaluation burden from the model itself to the external tool. However, because it allows us to generate an unlimited number of math problem variations programmatically and benefits from step-by-step reasoning, it has become a cornerstone of reasoning model evaluation and development. I wrote a comprehensive 35-page on this topic in my “Build a Reasoning Model (From Scratch)” book, so I am skipping the code implementation here. (I submitted the chapter last week. If you have the early access version, you’ll receive an email when it goes live and will be able to read it then. In the meantime, you can find the step-by-step code here on GitHub .) Figure 7: Excerpt from the verification-based evaluation approach available here on GitHub Method 3: Comparing models using preferences and leaderboards So far, we have covered two methods that offer easily quantifiable metrics such as model accuracy. However, none of the aforementioned methods evaluate LLMs in a more holistic way, including judging the style of the responses. In this section, as illustrated in Figure 8 below, we discuss a judgment-based method, namely, LLM leaderboards. Figure 8: A mental model of the topics covered in this book with a focus on the judgment- and benchmark-based evaluation methods covered in this appendix. Having already covered benchmark-based approaches (multiple choice, verifiers) in the previous section, we now introduce judgment-based approaches to measure LLM performance, with this subsection focusing on leaderboards. The leaderboard method described here is a judgment-based approach where models are ranked not by accuracy values or other fixed benchmark scores but by user (or other LLM) preferences on their outputs. A popular leaderboard is LM Arena (formerly Chatbot Arena ), where users compare responses from two user-selected or anonymous models and vote for the one they prefer, as shown in Figure 9. Figure 9: Example of a judgment-based leaderboard interface (LM Arena). Two LLMs are given the same prompt, their responses are shown side by side, and users vote for the preferred answer. These preference votes, which are collected as shown in the figure above, are then aggregated across all users into a leaderboard that ranks different models by user preference. A current snapshot of the LM Arena leaderboard (accessed on October 3, 2025) is shown below in Figure 10. Figure 10: Screenshot of the LM Arena leaderboard that shows the current leading LLMs based on user preferences on text tasks In the remainder of this section, we will implement a simple example of a leaderboard. To create a concrete example, consider users prompting different LLMs in a setup similar to Figure 9. The list below represents pairwise votes where the first model is the winner: In the list above, each tuple in the votes list represents a pairwise preference between two models, written as . So, means that a user preferred GPT-5 over a Claude-3 model answer. In the remainder of this section, we will turn the list into a leaderboard. For this, we will use the popular Elo rating system , which was originally developed for ranking chess players. Before we look at the concrete code implementation, in short, it works as follows. Each model starts with a baseline score. Then, after each comparison and the preference vote, the model’s rating is updated. (In Elo, the update magnitude depends on how surprising the outcome is.) Specifically, if a user prefers a current model over a highly ranked model, the current model will get a relatively large ranking update and rank higher in the leaderboard. Vice versa, if it wins against a low-ranked opponent, the update is smaller. (And if the current model loses, it is updated in a similar fashion, but with ranking points getting subtracted instead of added.) The code to turn these pairwise rankings into a leaderboard is shown in the code block below. Code block 4: Constructing a leaderboard The function defined above takes the votes as input and turns it into a leaderboard, as follows: This results in the following leaderboard ranking, where the higher the score, the better: So, how does this work? For each pair, we compute the expected score of the winner using the following formula: This value is the model’s predicted chance to win in a no-draw setting based on the current ratings. It determines how large the rating update is. First, each model starts at . If the two ratings (winner and loser) are equal, we have , which indicates an even match. In this case, the updates are: Now, if a heavy favorite (a model with a high rating) wins, we have . The favorite gains only a small amount and the loser loses only a little: However, if an underdog (a model with a low rating) wins, we have , and the winner gets almost the full points while the loser loses about the same magnitude: Order matters The Elo approach updates ratings after each match (model comparisons), so later results build on ratings that have already been updated. This means the same set of outcomes, when presented in a different order, can end with slightly different final scores. This effect is usually mild, but it can happen especially when an upset happens early versus late. To reduce this order effect, we can shuffle the votes pairs and run the function multiple times and average the ratings. Leaderboard approaches such as the one described above provide a more dynamic view of model quality than static benchmark scores. However, the results can be influenced by user demographics, prompt selection, and voting biases. Benchmarks and leaderboards can also be gamed, and users may select responses based on style rather than correctness. Finally, compared to automated benchmark harnesses, leaderboards do not provide instant feedback on newly developed variants, which makes them harder to use during active model development. Other ranking methods The LM Arena originally used the Elo method described in this section but recently transitioned to a statistical approach based on the Bradley–Terry model. The main advantage of the Bradley-Terry model is that, being statistically grounded, it allows the construction of confidence intervals to express uncertainty in the rankings. Also, in contrast to the Elo ratings, the Bradley-Terry model estimates all ratings jointly using a statistical fit over the entire dataset, which makes it immune to order effects. To keep the reported scores in a familiar range, the Bradley-Terry model is fitted to produce values comparable to Elo. Even though the leaderboard no longer officially uses Elo ratings, the term “Elo” remains widely used by LLM researchers and practitioners when comparing models. A code example showing the Elo rating is available here on GitHub . Figure 11: A comparison of Elo and Bradley-Terry rankings; the source code is available here on GitHub . Method 4: Judging responses with other LLMs In the early days, LLMs were evaluated using statistical and heuristics-based methods, including a measure called BLEU , which is a crude measure of how well generated text matches reference text. The problem with such metrics is that they require exact word matches and don’t account for synonyms, word changes, and so on. One solution to this problem, if we want to judge the written answer text as a whole, is to use relative rankings and leaderboard-based approaches as discussed in the previous section. However, a downside of leaderboards is the subjective nature of the preference-based comparisons as it involves human feedback (as well as the challenges that are associated with collecting this feedback). A related method is to use another LLM with a pre-defined grading rubric (i.e., an evaluation guide) to compare an LLM’s response to a reference response and judge the response quality based on a pre-defined rubric, as illustrated in Figure 12. Figure F12: Example of an LLM-judge evaluation. The model to be evaluated generates an answer, which is then scored by a separate judge LLM according to a rubric and a provided reference answer. In practice, the judge-based approach shown in Figure 12 works well when the judge LLM is strong. Common setups use leading proprietary LLMs via an API (e.g., the GPT-5 API), though specialized judge models also exist. (E.g., one of the many examples is Phudge ; ultimately, most of these specialized models are just smaller models fine-tuned to have similar scoring behavior as proprietary GPT models.) One of the reasons why judges work so well is also that evaluating an answer is often easier than generating one. To implement a judge-based model evaluation as shown in Figue 12 programmatically in Python, we could either load one of the larger Qwen3 models in PyTorch and prompt it with a grading rubric and the model answer we want to evaluate. Alternatively, we can use other LLMs through an API, for example the ChatGPT or Ollama API. As we already know how to load Qwen3 models in PyTorch, to make it more interesting, in the remainder of the section, we will implement the judge-based evaluation shown in Figure 12 using the Ollama API in Python. Specifically, we will use the 20-billion parameter gpt-oss open-weight model by OpenAI as it offers a good balance between capabilities and efficiency. For more information about gpt-oss, please see my From GPT-2 to gpt-oss: Analyzing the Architectural Advances article: 4.1 Implementing a LLM-as-a-judge approach in Ollama Ollama is an efficient open-source application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library, which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or fine-tuning LLMs. To execute the following code, please install Ollama by visiting the official website at https://ollama.com and follow the provided instructions for your operating system: For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command-line usage, select “yes.” For Linux users: Use the installation command available on the Ollama website.

0 views

The RAG Obituary: Killed by Agents, Buried by Context Windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool , an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions. After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline. In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text! The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once? The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years. GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating. Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages). With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole! RAG emerged as an elegant solution borrowed directly from search engines. Just as Google displays 10 blue links with relevant snippets for your query, RAG retrieves the most pertinent document fragments and feeds them to the LLM for synthesis. The core idea is beautifully simple: if you can’t fit everything in context, find the most relevant pieces and use those . It turns LLMs into sophisticated search result summarizers. Basically, LLMs can’t read the whole book but they can know who dies at the end; convenient! Long documents need to be chunked into pieces and it’s when problems start. Those digestible pieces are typically 400-1,000 tokens each which is basically 300-750 words. The problem? It isn’t as simple as cutting every 500 words. Consider chunking a typical SEC 10-K annual report. The document has a complex hierarchical structure: - Item 1: Business Overview (10-15 pages) - Item 1A: Risk Factors (20-30 pages) - Item 7: Management’s Discussion and Analysis (30-40 pages) - Item 8: Financial Statements (40-50 pages) After naive chunking at 500 tokens, critical information gets scattered: - Revenue recognition policies split across 3 chunks - A risk factor explanation broken mid-sentence - Financial table headers separated from their data - MD&A narrative divorced from the numbers it’s discussing If you search for “revenue growth drivers,” you might get a chunk mentioning growth but miss the actual numerical data in a different chunk, or the strategic context from MD&A in yet another chunk! At Fintool, we’ve developed sophisticated chunking strategies that go beyond naive text splitting: - Hierarchical Structure Preservation : We maintain the nested structure from Item 1 (Business) down to sub-sections like geographic segments, creating a tree-like document representation - Table Integrity : Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together - Cross-Reference Preservation : We maintain links between narrative sections and their corresponding financial data, preserving the “See Note X” relationships - Temporal Coherence : Year-over-year comparisons and multi-period analyses stay together as single chunks - Footnote Association : Footnotes remain connected to their referenced items through metadata linking Each chunk at Fintool is enriched with extensive metadata: - Filing type (10-K, 10-Q, 8-K) - Fiscal period and reporting date - Section hierarchy (Item 7 > Liquidity > Cash Position) - Table identifiers and types - Cross-reference mappings - Company identifiers (CIK, ticker) - Industry classification codes This allows for more accurate retrieval but even our intelligent chunking can’t solve the fundamental problem: we’re still working with fragments instead of complete documents! Once you have the chunks, you need a way to search them. One way is to embed your chunks. Each chunk is converted into a high‑dimensional vector (typically 1,536 dimensions in most embedding models). These vectors live in a space where, theoretically, similar concepts are close together. When a user asks a question, that question also becomes a vector. The system finds the chunks whose vectors are closest to the query vector using cosine similarity. It’s elegant in theory and in practice, it’s a nightmare of edge cases. Embedding models are trained on general text and struggle with specific terminologies. They find similarities but they can’t distinguish between “revenue recognition” (accounting policy) and “revenue growth” (business performance). Consider that example: Query: “ What is the company’s litigation exposure ? RAG searches for “litigation” and returns 50 chunks: - Chunks 1-10: Various mentions of “litigation” in boilerplate risk factors - Chunks 11-20: Historical cases from 2019 (already settled) - Chunks 21-30: Forward-looking safe harbor statements - Chunks 31-40: Duplicate descriptions from different sections - Chunks 41-50: Generic “we may face litigation” warnings What RAG Reports: $500M in litigation (from Legal Proceedings section) What’s Actually There: - $500M in Legal Proceedings (Item 3) - $700M in Contingencies note (”not material individually”) - $1B new class action in Subsequent Events - $800M indemnification obligations (different section) - $2B probable losses in footnotes (keyword “probable” not “litigation”) The actual Exposure is $5.1B. 10x what RAG found. Oupsy! By late 2023, most builders realized pure vector search wasn’t enough. Enter hybrid search: combine semantic search (embeddings) with the traditional keyword search (BM25). This is where things get interesting. BM25 (Best Matching 25) is a probabilistic retrieval model that excels at exact term matching. Unlike embeddings, BM25: - Rewards Exact Matches : When you search for “EBITDA,” you get documents with “EBITDA,” not “operating income” or “earnings” - Handles Rare Terms Better : Financial jargon like “CECL” (Current Expected Credit Losses) or “ASC 606” gets proper weight - Document Length Normalization : Doesn’t penalize longer documents - Term Frequency Saturation : Multiple mentions of “revenue” don’t overshadow other important terms At Fintool, we’ve built a sophisticated hybrid search system: 1. Parallel Processing : We run semantic and keyword searches simultaneously 2. Dynamic Weighting : Our system adjusts weights based on query characteristics: - Specific financial metrics? BM25 gets 70% weight - Conceptual questions? Embeddings get 60% weight - Mixed queries? 50/50 split with result analysis 3. Score Normalization : Different scoring scales are normalized using: - Min-max scaling for BM25 scores - Cosine similarity already normalized for embeddings - Z-score normalization for outlier handling So at the end the embeddings search and the keywords search retrieve chunks and the search engine combines them using Reciprocal Rank Fusion. RRF merges rankings so items that consistently appear near the top across systems float higher, even if no system put them at #1! So now you think it’s done right? But hell no! Here’s what nobody talks about: even after all that retrieval work, you’re not done. You need to rerank the chunks one more time to get a good retrieval and it’s not easy. Rerankers are ML models that take the search results and reorder them by relevance to your specific query limiting the number of chunks sent to the LLM. Not only LLMs are context poor, they also struggle when dealing with too much information . It’s vital to reduce the number of chunks sent to the LLM for the final answer. The Reranking Pipeline: 1. Initial search retrieval with embeddings + keywords gets you 100-200 chunks 2. Reranker ranks the top 10 3. Top 10 are fed to the LLM to answer the question Here is the challenge with reranking: - Latency Explosion : Rerank adds between 300-2000ms per query. Ouch. - Cost Multiplication : it adds significant extra cost to every query. For instance, Cohere Rerank 3.5 costs $2.00 per 1,000 search units, making reranking expensive. - Context Limits : Rerankers typically handle few chunks (Cohere Rerank supports only 4096 tokens), so if you need to re-rank more than that, you have to split it into different parallel API calls and merge them! - Another Model to Manage : One more API, one more failure point Re-rank is one more step in a complex pipeline. What I find difficult with RAG is what I call the “cascading failure problem”. 1. Chunking can fail (split tables) or be too slow (especially when you have to ingest and chunk gigabytes of data in real-time) 2. Embedding can fail (wrong similarity) 3. BM25 can fail (term mismatch) 4. Hybrid fusion can fail (bad weights) 5. Reranking can fail (wrong priorities) Each stage compounds the errors of the previous stage. Beyond the complexity of hybrid search itself, there’s an infrastructure burden that’s rarely discussed. Running production Elasticsearch is not easy. You’re looking at maintaining TB+ of indexed data for comprehensive document coverage, which requires 128-256GB RAM minimum just to get decent performance. The real nightmare comes with re-indexing. Every schema change forces a full re-indexing that takes 48-72 hours for large datasets. On top of that, you’re constantly dealing with cluster management, sharding strategies, index optimization, cache tuning, backup and disaster recovery, and version upgrades that regularly include breaking changes. Here are some structural limitations: 1. Context Fragmentation - Long documents are interconnected webs, not independent paragraphs - A single question might require information from 20+ documents - Chunking destroys these relationships permanently 2. Semantic Search Fails on Numbers - “$45.2M” and “$45,200,000” have different embeddings - “Revenue increased 10%” and “Revenue grew by a tenth” rank differently - Tables full of numbers have poor semantic representations 3. No Causal Understanding - RAG can’t follow “See Note 12” → Note 12 → Schedule K - Can’t understand that discontinued operations affect continuing operations - Can’t trace how one financial item impacts another 4. The Vocabulary Mismatch Problem - Companies use different terms for the same concept - “Adjusted EBITDA” vs “Operating Income Before Special Items” - RAG retrieves based on terms, not concepts 5. Temporal Blindness - Can’t distinguish Q3 2024 from Q3 2023 reliably - Mixes current period with prior period comparisons - No understanding of fiscal year boundaries These aren’t minor issues. They’re fundamental limitations of the retrieval paradigm. Three months ago I stumbled on an innovation on retrievial that blew my mind In May 2025, Anthropic released Claude Code, an AI coding agent that works in the terminal. At first, I was surprised by the form factor. A terminal? Are we back in 1980? no UI? Back then, I was using Cursor, a product that excelled at traditional RAG. I gave it access to my codebase to embed my files and Cursor ran a search n my codebase before answering my query. Life was good. But when testing Claude Code, one thing stood out: It was better and faster and not because their RAG was better but because there was no RAG. Instead of a complex pipeline of chunking, embedding, and searching, Claude Code uses direct filesystem tools: 1. Grep (Ripgrep) - Lightning-fast regex search through file contents - No indexing required. It searches live files instantly - Full regex support for precise pattern matching - Can filter by file type or use glob patterns - Returns exact matches with context lines - Direct file discovery by name patterns - Finds files like `**/*.py` or `src/**/*.ts` instantly - Returns files sorted by modification time (recency bias) - Zero overhead—just filesystem traversal 3. Task Agents - Autonomous multi-step exploration - Handle complex queries requiring investigation - Combine multiple search strategies adaptively - Build understanding incrementally - Self-correct based on findings By the way, Grep was invented in 1973. It’s so... primitive. And that’s the genius of it. Claude Code doesn’t retrieve. It investigates: - Runs multiple searches in parallel (Grep + Glob simultaneously) - Starts broad, then narrows based on discoveries - Follows references and dependencies naturally - No embeddings, no similarity scores, no reranking It’s simple, it’s fast and it’s based on a new assumption that LLMs will go from context poor to context rich. Claude Code proved that with sufficient context and intelligent navigation, you don’t need RAG at all. The agent can: - Load entire files or modules directly - Follow cross-references in real-time - Understand structure and relationships - Maintain complete context throughout investigation This isn’t just better than RAG—it’s a fundamentally different paradigm. And what works for code can work for any long documents that are not coding files. The context window explosion made Claude Code possible: 2022-2025 Context-Poor Era: - GPT-4: 8K tokens (~12 pages) - GPT-4-32k: 32K tokens (~50 pages) 2025 and beyond Context Revolution: - Claude Sonnet 4: 200k tokens (~700 pages) - Gemini 2.5: 1M tokens (~3,000 pages) - Grok 4-fast: 2M tokens (~6,000 pages) At 2M tokens, you can fit an entire year of SEC filings for most companies. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon. This represents a fundamental shift in how AI systems process information. Equally important, attention mechanisms are rapidly improving—LLMs are becoming far better at maintaining coherence and focus across massive context windows without getting “lost” in the noise. Claude Code demonstrated that with enough context, search becomes navigation: - No need to retrieve fragments when you can load complete files - No need for similarity when you can use exact matches - No need for reranking when you follow logical paths - No need for embeddings when you have direct access It’s mind-blowing. LLMs are getting really good at agentic behaviors meaning they can organize their work into tasks to accomplish an objective. Here’s what tools like ripgrep bring to the search table: - No Setup : No index. No overhead. Just point and search. - Instant Availability : New documents are searchable the moment they hit the filesystem (no indexing latency!) - Zero Maintenance : No clusters to manage, no indices to optimize, no RAM to provision - Blazing Fast : For a 100K line codebase, Elasticsearch needs minutes to index. Ripgrep searches it in milliseconds with zero prep. - Cost : $0 infrastructure cost vs a lot of $$$ for Elasticsearch So back to our previous example on SEC filings. An agent can SEC filing structure intrinsically: - Hierarchical Awareness : Knows that Item 1A (Risk Factors) relates to Item 7 (MD&A) - Cross-Reference Following : Automatically traces “See Note 12” references - Multi-Document Coordination : Connects 10-K, 10-Q, 8-K, and proxy statements - Temporal Analysis : Compares year-over-year changes systematically For searches across thousands of companies or decades of filings, it might still use hybrid search, but now as a tool for agents: - Initial broad search using hybrid retrieval - Agent loads full documents for top results - Deep analysis within full context - Iterative refinement based on findings My guess is traditional RAG is now a search tool among others and that agents will always prefer grep and reading the whole file because they are context rich and can handle long-running tasks. Consider our $6.5B lease obligation question as an example: Step 1: Find “lease” in main financial statements → Discovers “See Note 12” Step 2: Navigate to Note 12 → Finds “excluding discontinued operations (Note 23)” Step 3: Check Note 23 → Discovers $2B additional obligations Step 4: Cross-reference with MD&A → Identifies management’s explanation and adjustments Step 5: Search for “subsequent events” → Finds post-balance sheet $500M lease termination Final answer: $5B continuing + $2B discontinued - $500M terminated = $6.5B The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation. Basically, RAG is like a research assistant with perfect memory but no understanding: - “Here are 50 passages that mention debt” - Can’t tell you if debt is increasing or why - Can’t connect debt to strategic changes - Can’t identify hidden obligations - Just retrieves text, doesn’t comprehend relationships Agentic search is like a forensic accountant: - Follows the money systematically - Understands accounting relationships (assets = liabilities + equity) - Identifies what’s missing or hidden - Connects dots across time periods and documents - Challenges management assertions with data 1. Increasing Document Complexity - Documents are becoming longer and more interconnected - Cross-references and external links are proliferating - Multiple related documents need to be understood together - Systems must follow complex trails of information 2. Structured Data Integration - More documents combine structured and unstructured data - Tables, narratives, and metadata must be understood together - Relationships matter more than isolated facts - Context determines meaning 3. Real-Time Requirements - Information needs instant processing - No time for re-indexing or embedding updates - Dynamic document structures require adaptive approaches - Live data demands live search 4. Cross-Document Understanding Modern analysis requires connecting multiple sources: - Primary documents - Supporting materials - Historical versions - Related filings RAG treats each document independently. Agentic search builds cumulative understanding. 5. Precision Over Similarity - Exact information matters more than similar content - Following references beats finding related text - Structure and hierarchy provide crucial context - Navigation beats retrieval The evidence is becoming clear. While RAG served us well in the context-poor era, agentic search represents a fundamental evolution. The potential benefits of agentic search are compelling: - Elimination of hallucinations from missing context - Complete answers instead of fragments - Faster insights through parallel exploration - Higher accuracy through systematic navigation - Massive infrastructure cost reduction - Zero index maintenance overhead The key insight? Complex document analysis—whether code, financial filings, or legal contracts—isn’t about finding similar text. It’s about understanding relationships, following references, and maintaining precision. The combination of large context windows and intelligent navigation delivers what retrieval alone never could. RAG was a clever workaround for a context-poor era . It helped us bridge the gap between tiny windows and massive documents, but it was always a band-aid. The future won’t be about splitting documents into fragments and juggling embeddings. It will be about agents that can navigate, reason, and hold entire corpora in working memory. We are entering the post-retrieval age. The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents. In hindsight, RAG will look like training wheels. Useful, necessary, but temporary. The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted.

0 views

A History of Large Language Models

Large language models (LLMs) still feel a bit like magic to me. Of course, I understand the general machinery enough to know that they aren’t, but the gap between my outdated knowledge of the field and the state-of-the-art feels especially large right now. Things are moving fast. So six months ago, I decided to close that gap just a little by digging into what I believed was one of the core primitives underpinning LLMs: the attention mechanism in neural networks. I started by reading one of the landmark papers in the literature, which was published by Google Brain in 2017 under the catchy title Attention is all you need (Vaswani et al., 2017) . As the title suggests, the authors did not invent the attention mechanism. Rather, they introduced a neural network architecture which in was some sense “all attention”. This architecture is the now-famous transformer . Clearly the transformer stands in contrast to whatever came before it, but what was that and what did the transformer do differently? To answer these questions, I read a lot of papers, and the context that felt natural to provide here grew the more that I read. I went down the rabbit hole, and when I came out, I realized that what had started as a study of attention had grown into a bigger story. Attention is still the throughline, but there are other important themes, such as how neural networks generalize and the bitter lesson that simple methods that scale seem to triumph over clever methods which do not. This post is the product of that deep dive, and it is a stylized history of LLMs. As a caveat, real life is endlessly detailed, and any summary or synthesis inevitably flattens this detail. So I will accidentally or intentionally skip over many important and related papers and ideas in the service of a synthesis. I will also skip over practicalities such as data preprocessing and advances in hardware and computing. My focus will be on what I view as the main methodological landmarks, and this history is simply one of many ways to tell this story. I’ll start with an old idea, one so ubiquitous today that it might seem silly to belabor here. The idea is that neural networks automatically generalize using distributed representations . This idea has its roots in computational neuroscience, particularly Connectionism (McCulloch & Pitts, 1943) and was discussed explicitly in the 1980s in papers like Learning representations by back-propagating errors (Rumelhart et al., 1986) and Learning distributed representations of concepts (Hinton, 1986) . Understanding it is key to understanding why LLMs work at all and thus understanding the long line of academic research driving towards them. But first, a problem. The goal of natural language processing (NLP) is to model human language using computers. Until the 1980s, NLP systems were mostly based on handwritten rules and handcrafted features. However, by the early 1990s, researchers were exploring the use of statistical methods from machine learning. For an early and seminal example, see A statistical approach to machine translation (Brown et al., 1990) . The core idea of statistical NLP is to model human language using a statistical language model , which is a probability distribution over all possible sequences in a language. This distribution is typically factorized such that each word depends on all words that precede it: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) . (1) p(w_{1:T}) = \prod_{t=1}^T p\left(w_t \mid w_{1:t-1} \right). \tag{1} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) . ( 1 ) Throughout this post, I will use the notation w i : j w_{i:j} w i : j ​ to denote elements in a sequence from positions i i i to j j j inclusive (where i ≤ j i \leq j i ≤ j ): w i : j : = { w i , w i + 1 , … , w j − 1 , w j } . (2) w_{i:j} := \{w_i, w_{i+1}, \dots, w_{j-1}, w_j\}. \tag{2} w i : j ​ : = { w i ​ , w i + 1 ​ , … , w j − 1 ​ , w j ​ } . ( 2 ) Given a good statistical model p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) , we can do many things. For example, we can rank the likelihood of different sequences of words and use that ranking to decide on things like a conversational agent’s output. Or we can translate a source sequence s 1 : T s_{1:T} s 1 : T ​ into a target sequence w 1 : T w_{1:T} w 1 : T ​ if we have the conditional probabilities between the two: p ( w 1 : T ∣ s 1 : T ) ∝ p ( s 1 : T ∣ w 1 : T ) p ( w 1 : T ) . (3) p(w_{1:T} \mid s_{1:T}) \propto p(s_{1:T} \mid w_{1:T}) p(w_{1:T}). \tag{3} p ( w 1 : T ​ ∣ s 1 : T ​ ) ∝ p ( s 1 : T ​ ∣ w 1 : T ​ ) p ( w 1 : T ​ ) . ( 3 ) Here, p ( w 1 : T ) p(w_{1:T}) p ( w 1 : T ​ ) would be our language model of the target language, and p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) would be our translation model . Today, this view is so pervasive that it might feel obvious, but with a little imagination, I think it’s easy to see how wrong this might have felt to a linguist forty-odd years ago. Equation 1 1 1 captures no language structure or parts of speech such as nouns or verbs or adjectives—see e.g. (Chomsky, 1956) on formal grammars . Instead, it reduces the complexity of human language to next-word prediction. If we didn’t know already that this worked, we might doubt that it would. More importantly for us, estimating the model in Equation 1 1 1 is hard! The main challenge is the curse of dimensionality . There are many, many words in a vocabulary. For example, linguists estimate that English has roughly a million words, give or take a few hundred thousand depending on how you count them. Furthermore, this problem explodes in some tasks such as translation, where there are many possible conditional probabilities p ( s 1 : T ∣ w 1 : T ) p(s_{1:T} \mid w_{1:T}) p ( s 1 : T ​ ∣ w 1 : T ​ ) . So when estimating the conditional probabilities of our language model, we cannot possibly encounter all possible combinations. We have a data sparsity problem, and estimating the true probabilities becomes impossible. Perhaps the oldest idea to tackle this problem was proposed in Andrey Markov’s pioneering mathematical analysis of Pushkin’s Eugene Onegin (Markov, 1913) . He made the assumption that each conditional probability in Equation 1 1 1 only depends on the previous N N N terms: p ( w 1 : T ) = ∏ t = 1 T p ( w t ∣ w 1 : t − 1 ) ≈ ∏ t = 1 T p ( w t ∣ w t − N : t − 1 ) . (4) p(w_{1:T}) = \prod_{t=1}^T p \left( w_t \mid w_{1:t-1} \right) \approx \prod_{t=1}^T p \left(w_t \mid w_{t-N:t-1} \right). \tag{4} p ( w 1 : T ​ ) = t = 1 ∏ T ​ p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ t = 1 ∏ T ​ p ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 ) Today, we would call this a “Markov assumption”, and Equation 4 4 4 is the famous N N N -gram model . Particularly for small N N N , say N = 1 N=1 N = 1 or N = 2 N=2 N = 2 , we might be able to get reasonable estimates of data. But here is the problem, and this problem is a central theme driving towards the attention mechanism: the Markov assumption destroys context . Without more context, a language model can never replicate the complexity and nuance of natural language. As I understand it, this was conceptually the state of the field circa 2000. But then in 2003, a seminal paper was published: A neural probabilistic language model (Bengio et al., 2003) . In that paper, the authors proposed a novel idea: to avoid this data sparsity problem, this curse of dimensionality, we can use neural networks to learn a language model using what they call “distributed representations” of words. (Today, we might call these “word embeddings”.) They proposed three core ideas. First, they represented each word as a real-valued vector or embedding; then, they expressed Equation 1 1 1 in terms of these embeddings; and finally, they trained a neural network to simultaneously learn the embeddings and the parameters of the probability function (neural network) in Equation 1 1 1 using back-propagation (Rumelhart et al., 1986) . That’s a lot, so let’s break it down a bit. Our goal here is to learn a good model f Θ f_{\Theta} f Θ ​ of natural language such that p ( w t ∣ w 1 : t − 1 ) ≈ f Θ ( w t − 1 , … , w t − N ) . (5) p(w_t \mid w_{1:t-1}) \approx f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}). \tag{5} p ( w t ​ ∣ w 1 : t − 1 ​ ) ≈ f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) . ( 5 ) So the left-hand side is the true conditional distribution, capturing next-word prediction. It’s the goal of language modeling. But in practice, modeling the full context is hard. So we settle for the right hand side, which is a parametric approximation f Θ f_{\boldsymbol{\Theta}} f Θ ​ of this true distribution with context window of size N N N . In Bengio, they model f Θ f_{\boldsymbol{\Theta}} f Θ ​ using two components. First, they represent words as vectors. Let V \mathcal{V} V denote our vocabulary, which is simply a set of integers V = { 1 , 2 , … , V } \mathcal{V} = \{1, 2,\dots, V\} V = { 1 , 2 , … , V } indexing all V V V words in a language. We will represent each word as a D D D -vector, and so we can represent the entire language as a matrix C ∈ R V × D \mathbf{C} \in \mathbb{R}^{V \times D} C ∈ R V × D (Figure 1 1 1 ). Now for the t t t -th word in a sequence w 1 : T w_{1:T} w 1 : T ​ , we have an associated index in the vocabulary, which we will denote as I ( w t ) ∈ V I(w_t) \in \mathcal{V} I ( w t ​ ) ∈ V . This notation might be a bit odd, but I’m careful here because w t w_t w t ​ is not a well-defined mathematical object, and it cannot index C \mathbf{C} C . But I ( w t ) I(w_t) I ( w t ​ ) is an integer and can index C \mathbf{C} C , and so c I ( w t ) \mathbf{c}_{I(w_t)} c I ( w t ​ ) ​ is a D D D -dimensional vector (a row vector of C \mathbf{C} C ) representing the I ( w t ) I(w_t) I ( w t ​ ) -th word in the vocabulary, associated with the t t t -th word in the sequence. This vector is what we are calling an “embedding” or “distributed representation”. Second, Bengio et al represent the probability function over words (Equation 1 1 1 ) as as a feed-forward neural network g g g with parameters Ω \boldsymbol{\Omega} Ω and arguments C \mathbf{C} C : f Θ ( w t − 1 , … , w t − N ) = g Ω ( c I ( w t − 1 ) , … , c I ( w t − N ) ) . (6) f_{\boldsymbol{\Theta}}(w_{t-1}, \dots, w_{t-N}) = g_{\boldsymbol{\Omega}}\left(\mathbf{c}_{I(w_{t-1})}, \dots, \mathbf{c}_{I(w_{t-N})}\right). \tag{6} f Θ ​ ( w t − 1 ​ , … , w t − N ​ ) = g Ω ​ ( c I ( w t − 1 ​ ) ​ , … , c I ( w t − N ​ ) ​ ) . ( 6 ) They then use back-propagation to jointly estimate the parameters Θ : = { C , Ω } . (7) \boldsymbol{\Theta} := \{\mathbf{C}, \boldsymbol{\Omega}\}. \tag{7} Θ : = { C , Ω } . ( 7 ) In other words, they learn the neural network parameters Ω \boldsymbol{\Omega} Ω at the same time as learning the word embeddings C \mathbf{C} C . Note that “distributed representation” can refer to either the continuously-valued vector, e.g. word embedding, or the concept distributed across neurons. This duality is exemplified in C \mathbf{C} C which is both a set of learnable parameters and the embeddings themselves! Why might this work? The authors explain the idea so well that it’s worth just quoting the original paper: In the proposed model, it will so generalize because “similar” words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of “neighbors” in sentence space. This is a beautiful idea. If we have word embeddings that are “well-organized” in the sense that words that play similar roles in sentences (semantically and syntactically) have similar embeddings and if we have a smooth function from word embeddings to probabilities, then small changes in words lead to small changes in embeddings which lead to small changes in probabilities (Figure 2 2 2 ). Pause for a moment to really think about this. Words are discrete objects, and a “small change in a word”, while intuitive to humans, is ill-defined. But this approach concretizes what that means. To quote the paper Linguistic regularities in continuous space word representations (Mikolov et al., 2013) , which we’ll discuss later: Whereas an N N N -gram model works in terms of discrete units that have no inherent relationship to one another, a continuous space model works in terms of word vectors where similar words are likely to have similar vectors. Thus, when the model parameters are adjusted in response to a particular word or word-sequence, the improvements will carry over to occurrences of similar words and sequences. For example, if the words “dog” and “cat” are nearby in word-embedding space, then maybe “The cat is walking on the sidewalk” and “The dog is walking on the sidewalk” should have similar probabilities. And only one of these two sentences would need to exist in the training data for the model to generalize well to both sentences! As I mentioned, this idea was not entirely new in 2003. Since the 1980s, researchers had known that neural networks can generalize because they distribute their representation across many neurons (Hinton, 1986) . Each new example modifies the weights, incorporating new knowledge into the old. However (Bengio et al., 2003) is a landmark paper in NLP because it was the first application of this idea to language modeling. The Bengio paper took seriously the idea that we could build a statistical model of language using the distributed representations of words. It was the first hint that we could use neural networks to overcome the curse of dimensionality that plagued statistical NLP. This is a promising idea, but we glossed over an important detail: how do we actually train this model? What is the loss function or objective that the neural network should use? And given a fit model, how do we generate a new sequence? These are important questions to answer per se, but they are also important questions because, at a conceptual level, there is really no difference between Bengio’s model and the frontier large language models today. So understanding this is critical to understanding LLMs. Both are autoregressive models and trained using next-word prediction . As an example, imagine we have the following input sentence, which is a quote from Virginia Woolf’s A Room of One’s Own : “Intellectual freedom depends upon material things.” (8) \text{``Intellectual freedom depends upon material things.''} \tag{8} “Intellectual freedom depends upon material things.” ( 8 ) Now imagine that our model’s context window has size N = 2 N=2 N = 2 and let c p \mathbf{c}_p c p ​ denote a padding D D D -vector of all zeros. In Bengio’s model, we would start by representing just the first word, “intellectual”, as a word embedding. So the first non-zero input to our model would be: x 2 = [ c p c I ( w 1 ) ] = [ c p c I ( “intellectual” ) ] . (9) \mathbf{x}_2 = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(w_1)} \end{array} \right] = \left[ \begin{array}{l} \mathbf{c}_p \\ \mathbf{c}_{I(\text{``intellectual''})} \end{array} \right]. \tag{9} x 2 ​ = [ c p ​ c I ( w 1 ​ ) ​ ​ ] = [ c p ​ c I ( “intellectual” ) ​ ​ ] . ( 9 ) The output of the neural network would be a V V V -dimensional vector representing the probability distribution over p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . Illustratively: y 2 = [ p ( w 2 = “about” ) p ( w 2 = “above” )       ⋮ p ( w 2 = “freedom” )       ⋮ ] . (10) \mathbf{y}_2 = \left[ \begin{array}{l} p(w_2 = \text{``about''}) \\ p(w_2 = \text{``above''}) \\ \qquad\;\;\vdots \\ p(w_2 = \text{``freedom''}) \\ \qquad\;\;\vdots \\ \end{array} \right]. \tag{10} y 2 ​ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ​ p ( w 2 ​ = “about” ) p ( w 2 ​ = “above” ) ⋮ p ( w 2 ​ = “freedom” ) ⋮ ​ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ​ . ( 1 0 ) We would then compute the cross-entropy loss between this output vector and the true distribution, which is really just a one-hot vector with 1 1 1 for the word “freedom” and 0 0 0 everywhere else. We would then repeat this process on the next word. So the next input sequence would be x 3 = [ c I ( “intellectual” ) c I ( “freedom” ) ] , (11) \mathbf{x}_3 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``intellectual''})} \\ \mathbf{c}_{I(\text{``freedom''})} \end{array} \right], \tag{11} x 3 ​ = [ c I ( “intellectual” ) ​ c I ( “freedom” ) ​ ​ ] , ( 1 1 ) and the output would represent the probability distribution p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And again, we would minimize the cross-entropy loss between its associated output vector and a one-hot vector encoding the word “depends”. We would repeat this process until the end of the sentence. Of course, longer sequences are more expensive to train in this way, and this is precisely the point of the context window in Bengio’s paper. We only consider the N N N previous words when predicting the next word. This idea of a limited context window is critical, as it is a constraint that persists into the present day. In this example, since N = 2 N=2 N = 2 , the third input would be x 4 = [ c I ( “freedom” ) c I ( “depends” ) ] . (12) \mathbf{x}_4 = \left[ \begin{array}{l} \mathbf{c}_{I(\text{``freedom''})} \\ \mathbf{c}_{I(\text{``depends''})} \end{array} \right]. \tag{12} x 4 ​ = [ c I ( “freedom” ) ​ c I ( “depends” ) ​ ​ ] . ( 1 2 ) So the model completely loses the word “intellectual”. It is now outside the context. Since minimizing the cross-entropy loss is equivalent to maximizing the log likelihood—see here for an example if this idea is new to you—we can generalize the logic above by saying that we want to maximize the log likelihood of our training data, again using a neural network as a parametric function approximation of the true distribution: Θ ⋆ = arg ⁡  ⁣ max ⁡ Θ { ∑ t = 1 T log ⁡ g Ω ( c I ( w t − N ) , … , c I ( w t − 1 ) ) } . (13) \boldsymbol{\Theta}^{\star} = \arg\!\max_{\boldsymbol{\Theta}} \left\{ \sum_{t=1}^T \log g_{\boldsymbol{\Omega}} \left(\mathbf{c}_{I(w_{t-N})}, \dots, \mathbf{c}_{I(w_{t-1})} \right) \right\}. \tag{13} Θ ⋆ = ar g Θ max ​ { t = 1 ∑ T ​ lo g g Ω ​ ( c I ( w t − N ​ ) ​ , … , c I ( w t − 1 ​ ) ​ ) } . ( 1 3 ) Of course, we can estimate Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ by minimizing the negative log likelihood using gradient descent via back-propagation. That’s it. At the conceptual level, this framework is no different from how frontier large language models are trained today. As we will see later though, there is a lot of additional machinery that is needed to make these models work in practice. Finally, imagine we fit our model, meaning we find good parameters Θ ⋆ \boldsymbol{\Theta}^{\star} Θ ⋆ that maximize our log likelihood. How can we use these parameters to generate a random sequence or sentence? We could draw the first word at random from the vocabulary. And then we could draw the next word conditional on the first word from our parametric approximation of p ( w 2 ∣ w 1 ) p(w_2 \mid w_1) p ( w 2 ​ ∣ w 1 ​ ) . And then we could draw the third word conditional on the second and first words from our parametric approximation of p ( w 3 ∣ w 1 : 2 ) p(w_3 \mid w_{1:2}) p ( w 3 ​ ∣ w 1 : 2 ​ ) . And so on. This is why LLMs can both understand natural language and generate new sentences. They are not just descriptive models; they are generative models . There are some subtleties I am glossing over, such as special embeddings to denote the start and end of a sequence, preprocessing steps like lowercasing words, tokenization, and handling out-of-vocabulary words. But I don’t think these details matter much here. As an aside, we can call any model trained in this way autoregressive . In statistics, an autoregressive model is any model where a variable is predicted using its own previous values. A classic example of this are AR models such as AR(1). While (Bengio et al., 2003) was a landmark paper, its full impact was delayed by roughly a decade. This is because training neural networks was hard at the time. It’s worth checking out that paper and seeing just how primitive the engineering feels today. For example, they trained on CPUs and without modern tooling like automatic differentiation libraries. In the intervening decade, there was some early work that built on Bengio’s model. For example, in A unified architecture for natural language processing: Deep neural networks with multitask learning (Collobert & Weston, 2008) , the authors demonstrate that Bengio’s neural language model could be trained and used on a variety of downstream tasks. And in Word representations: A simple and general method for semi-supervised learning (Turian et al., 2010) , the authors demonstrate that word embeddings improve state-of-the-art NLP systems when included as additional features. But none of these contributions were convincing demonstrations of Bengio’s main idea. So seven years after Bengio et al, it was N N N -grams, not neural networks, which were still the state-of-the-art, at least in practice and outside specialized benchmarks. Honestly, I found this surprising, but I kept reading this claim in various papers. For example, in the introduction to Recurrent neural network based language model (Mikolov et al., 2010) , the authors wrote: It is questionable if there has been any significant progress in language modeling over simple N N N -gram models… In fact, most of the proposed advanced language modeling techniques provide only tiny improvements over simple baselines, and are rarely used in practice. Or two years after that, in A fast and simple algorithm for training neural probabilistic language models (Mnih & Teh, 2012) , the authors wrote: In spite of their superior performance, neural probabilistic language models remain far less widely used than N N N -gram models due to their notoriously long training times, which are measured in weeks even for moderately-sized datasets. Of course, advanced techniques existed and were well known, but they were often impractical. So roughly a hundred years after Andrey Markov’s pioneering work, researchers were still struggling to represent human language in a form amenable for mathematics and computation, and N N N -grams were still considered a reasonable choice in NLP. Today, neural networks are definitively state-of-the-art. What changed? The answer is that we learned to train variants of Bengio’s model at scale. Around 2012, researchers were finally able to train neural networks on large datasets. My understanding is that it was the so-called “AlexNet” paper, ImageNet classification with deep convolutional neural networks (Krizhevsky et al., 2012) , that convinced many in the research community to pay attention. Convolutional neural networks were already well known and had been trained on small datasets since the 1980s (LeCun et al., 1989) . But AlexNet was the first time a deep convolutional neural network was trained end-to-end on a very large (at the time) dataset, ImageNet (Deng et al., 2009) and using GPUs. The results were a tour de force. To quote the paper: We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top- 5 5 5 test error rate of 15.3 % 15.3\% 1 5 . 3 % , compared to 26.2 % 26.2\% 2 6 . 2 % achieved by the second-best entry. In other words, AlexNet demolished the state-of-the-art in computer vision. It achieved a roughly 40 % 40\% 4 0 % reduction in relative error rate. Nothing else came close. As a comparison, the current fastest time for a men’s marathon is 2 hours and 35 seconds. The previous record was 2 hours and 69 seconds, so 34 seconds slower. Now imagine if someone came along and beat the record by half an hour. It would revolutionize the running world. At the time, computer vision was still dominated by handcrafted feature pipelines, and so the AlexNet results were extremely surprising. For example, in Introduction to the bag of features paradigm for image classification and retrieval (O’Hara & Draper, 2011) , the authors wrote: The past decade has seen the growing popularity of Bag of Features (BoF) approaches to many computer vision tasks, including image classification, video search, robot localization, and texture recognition… BoF-based systems have set new performance standards on popular image classification benchmarks and have achieved scalability breakthroughs in image retrieval. This introduction to bag of feature models was put on arXiv in January 2011, whereas AlexNet was published at NeurIPS in December 2012, meaning that the claim above was contemporaneous with the training of AlexNet! My point here is to underscore just how surprising the rise of neural networks was. To be clear, I am sure many in the research community believed neural networks would work—Hinton has been a believer since probably the 1970s—, but this was hardly the consensus view that it is today. So the year 2012 was a changepoint. In 2003, Bengio et al set the stage conceptually. In 2012, Krizhevsky et al set the stage technologically. With hindsight, the obvious implication of AlexNet was that NLP researchers circa 2012 should try to train neural networks at scale. Of course, many researchers tried, but let’s ground ourselves in one particular model. This will help focus the narrative. To my knowledge, two of the earliest and most successful papers to try this idea were Efficient estimation of word representations in vector space (Mikolov et al., 2013) and Distributed representations of words and phrases and their compositionality (Mikolov et al., 2013) . These papers are tightly related by both authorship and time, and together, they helped unlock the core ideas in Bengio’s paper, as well as introduce the famous word2vec model. So I think it’s fair to treat them as both a unit and as a landmark in our story. To understand these two papers, we need to understand the computational problems Bengio faced, which means we need to understand the model in more technical detail. Let x t \mathbf{x}_t x t ​ be the input to the model, and y t \mathbf{y}_t y t ​ be the output. Bengio’s model did not support variable-length inputs, and thus the input sequence could be only a fixed number of N N N words, each represented as an D D D -dimensional embedding. Let’s represent this input as the concatenation of N N N different D D D -vectors from C \mathbf{C} C mentioned above, so: x t : = [ c I ( w t − 1 ) ⋮ c I ( w t − N + 1 ) ] . (14) \mathbf{x}_t := \left[ \begin{array}{l} \mathbf{c}_{I(w_{t-1})} \\ \quad\quad\vdots \\ \mathbf{c}_{I(w_{t-N+1})} \end{array} \right]. \tag{14} x t ​ : = ⎣ ⎢ ⎢ ⎡ ​ c I ( w t − 1 ​ ) ​ ⋮ c I ( w t − N + 1 ​ ) ​ ​ ⎦ ⎥ ⎥ ⎤ ​ . ( 1 4 ) One way we can imagine constructing x t \mathbf{x}_t x t ​ is if we represent every word in our context window as a V V V -dimensional one-hot vector. Call this a matrix Q t ∈ R N × V \mathbf{Q}_t \in \mathbb{R}^{N \times V} Q t ​ ∈ R N × V . Then x t = Q t C \mathbf{x}_t = \mathbf{Q}_t \mathbf{C} x t ​ = Q t ​ C gives us the associated embeddings. In practice, though, we would never do a dense matrix multiplication with complexity O ( V N D ) \mathcal{O}(VND) O ( V N D ) . Instead, we would simply index into C \mathbf{C} C . So this operation has computational complexity O ( N D ) \mathcal{O}(ND) O ( N D ) . I only belabor this point because I found it confusing when first reading Bengio’s paper. (This point is made more clearly in (Collobert & Weston, 2008) ) After construction, this input x t \mathbf{x}_t x t ​ is then fed into an extremely simple (relative to today’s models) architecture, a feed-forward neural network with a linear projection layer and a nonlinear hidden layer: g Ω ( x t ) = y t : = b + W x t + U tanh ⁡ ( z t ) , z t : = d + H x t . (15) \begin{aligned} g_{\boldsymbol{\Omega}}(\mathbf{x}_t) = \mathbf{y}_t &:= \mathbf{b} + \mathbf{Wx}_t + \mathbf{U} \tanh(\mathbf{z}_t), \\ \mathbf{z}_t &:= \mathbf{d} + \mathbf{Hx}_t. \end{aligned} \tag{15} g Ω ​ ( x t ​ ) = y t ​ z t ​ ​ : = b + W x t ​ + U tanh ( z t ​ ) , : = d + H x t ​ . ​ ( 1 5 ) The output y t ∈ R V \mathbf{y}_t \in \mathbb{R}^{V} y t ​ ∈ R V represents the un-normalized probability of each word in the vocabulary. If normalized, this vector would represent the probability distribution we discussed in the autoregressive framework. Here, we see that that W ∈ R V × N D \mathbf{W} \in \mathbb{R}^{V \times ND} W ∈ R V × N D is a linear projection of the input embeddings x t \mathbf{x}_t x t ​ , that H ∈ R H × N D \mathbf{H} \in \mathbb{R}^{H \times ND} H ∈ R H × N D is a linear projection into a hidden state vector z t ∈ R H \mathbf{z}_t \in \mathbb{R}^H z t ​ ∈ R H , and that U ∈ R V × H \mathbf{U} \in \mathbb{R}^{V \times H} U ∈ R V × H is a linear projection of the nonlinear hidden state vector. So clearly the parameters mentioned in Equation 7 7 7 can be concretized as { C , Ω } : = { C , b , W , U , d , H } . (16) \{\mathbf{C}, \boldsymbol{\Omega}\} := \{\mathbf{C, b, W, U, d, H}\}. \tag{16} { C , Ω } : = { C , b , W , U , d , H } . ( 1 6 ) So why was this expensive to train? We can see that the computational complexity to compute y t \mathbf{y}_t y t ​ is proportional to: N D ⏟ Q C          +          V N D ⏟ W x t             +    V H ⏟ U tanh ⁡ ( z t ) +       H N D ⏟ H x t . (17) \underbrace{ND}_{\mathbf{QC}} \;\;\;+\;\;\; \underbrace{VND}_{\mathbf{Wx}_t} \;\;\;\;+\; \underbrace{VH}_{\mathbf{U} \tanh(\mathbf{z}_t)} +\;\; \underbrace{HND}_{\mathbf{Hx}_t}. \tag{17} Q C N D ​ ​ + W x t ​ V N D ​ ​ + U t a n h ( z t ​ ) V H ​ ​ + H x t ​ H N D ​ ​ . ( 1 7 ) Note that this complexity is for every single word in the corpus, and we must also account for the number of training epochs. In (Mikolov et al., 2013) , the authors write that a “common choice” is N = 10 N=10 N = 1 0 and that N D ND N D is typically around 500 500 5 0 0 to 2000 2000 2 0 0 0 . However, the hidden layer has dimension H H H (commonly around 2000 2000 2 0 0 0 or so) and this is multiplied by the size of the vocabulary! What this means? The dominating term in Equation 17 17 1 7 is V H VH V H . Furthermore, this complexity is just for computing the un-normalized probabilities y t \mathbf{y}_t y t ​ . To normalize these, we must compute the softmax function over the size of the vocabulary V V V : p ( w t ∣ w t − N : t − 1 ) = exp ⁡ ( y t ) ∑ i = 1 V exp ⁡ ( y i ) . (18) p(w_t \mid w_{t-N:t-1}) = \frac{\exp\left(\mathbf{y}_t\right)}{\sum_{i=1}^V \exp\left( \mathbf{y}_i \right)}. \tag{18} p ( w t ​ ∣ w t − N : t − 1 ​ ) = ∑ i = 1 V ​ exp ( y i ​ ) exp ( y t ​ ) ​ . ( 1 8 ) As I understand it, these were the computational problems Bengio faced. The two Mikolov papers did not present a single trick to solve them. Rather, the papers made a number of modeling choices, mostly already established in the literature, that in combination finally made learning distributed representations of words scalable. First, in the first paper, they avoided computing the full softmax function using hierarchical softmax, introduced by Morin and Bengio in Hierarchical probabilistic neural network language model (Morin & Bengio, 2005) . I don’t think the details of this matter much here. See this blog post for a nice explanation with code. Suffice to say that it’s an efficient way to compute the normalized probabilities in Equation 18 18 1 8 . The computational complexity is reduced from O ( V ) \mathcal{O}(V) O ( V ) to O ( log ⁡ 2 V ) \mathcal{O}(\log_2 V ) O ( lo g 2 ​ V ) . In the second paper, they further sped up the softmax computation by introducing a technique called negative sampling . The theory here is rich and deserving of its own post, but the main idea is to draw K K K samples from a noise distribution and train the model to disambiguate observations from noise. The important point here is that one can prove this converges to the correct probabilities without explicitly computing the normalizing constant. See (Gutmann & Hyvärinen, 2010) for details. We don’t need to fully grok these techniques; just know that these two approaches are both ways of getting around the expensive normalization in Equation 18 18 1 8 . For example, if V = 1 × 1 0 6 V = 1\times 10^6 V = 1 × 1 0 6 , then log ⁡ 2 ( V ) ≈ 20 \log_2(V) \approx 20 lo g 2 ​ ( V ) ≈ 2 0 . And in the second paper, they chose K K K to be 2 2 2 to 20 20 2 0 depending on the dataset. Second, they stripped out the non-linear part of Bengio’s model (so removing U tanh ⁡ ( z t ) \mathbf{U} \tanh(\mathbf{z}_t) U tanh ( z t ​ ) ), reducing the model to a simple linear operation: a dot product. The result is model that is log-linear on the features, which I’ll explain in a moment. Now the models. In the first paper, they presented two models, a continuous bag-of-words model (CBOW) and a continuous skip-gram model (skip-gram). These are the foundations of the word2vec NLP toolkit. In the CBOW model, a set of neighboring words are averaged to predict a target word; and in the skip-gram model, a target word is used to predict its neighboring words (Figure 3 3 3 ). Both worked empirically in practice, but the authors only built on the skip-gram model in the second paper. And since I don’t think it’s that important here to understand both, I’ll just focus on the skip-gram model. Let’s build a little intuition by going into detail. The objective of the skip-gram model is to minimize the cross-entropy loss between a single target word and its neighboring words. So the input to the model is only a single D D D -vector representing a single word (so no context window). The output, however, are the N N N words surrounding the input. Let N = 2 C N = 2C N = 2 C . Then the objective function is: 1 T ∑ t = 1 T ∑ − C ≤ j ≤ C ,    j ≠ 0 log ⁡ p ( w t + j ∣ w t ) . (19) \frac{1}{T} \sum_{t=1}^T \sum_{-C \leq j \leq C,\;j \neq 0} \log p(w_{t+j} \mid w_t). \tag{19} T 1 ​ t = 1 ∑ T ​ − C ≤ j ≤ C , j  ​ = 0 ∑ ​ lo g p ( w t + j ​ ∣ w t ​ ) . ( 1 9 ) I will continue to use the notation N N N for this context window, but clearly it is different in precise meaning from the N N N in an N N N -gram or the N N N in Bengio’s paper. We model the conditional probability in Equation 19 19 1 9 via a simple log-linear function: p ( w t + j ∣ w t ) = p ( u I ( w t + j ) ∣ c I ( w t ) ) = exp ⁡ ( ⟨ u I ( w t + j ) , c I ( w t ) ⟩ ) ∑ i ∈ V exp ⁡ ( ⟨ u i , c I ( w t ) ⟩ ) (20) p(w_{t+j} \mid w_t) = p(\mathbf{u}_{I(w_{t+j})} \mid \mathbf{c}_{I(w_{t})}) = \frac{\exp\left( \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle \right)}{\sum_{i \in \mathcal{V}} \exp\left( \langle \mathbf{u}_i, \mathbf{c}_{I(w_{t}) \rangle} \right)} \tag{20} p ( w t + j ​ ∣ w t ​ ) = p ( u I ( w t + j ​ ) ​ ∣ c I ( w t ​ ) ​ ) = ∑ i ∈ V ​ exp ( ⟨ u i ​ , c I ( w t ​ ) ⟩ ​ ) exp ( ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ ) ​ ( 2 0 ) Here, c i \mathbf{c}_i c i ​ are word embeddings of the inputs. These are analogous to the row-vectors of C \mathbf{C} C in Bengio’s model and again are constructed via a lookup. The output embeddings u \mathbf{u} u are a little trickier to interpret. If we were using the full softmax function, we would have V V V such output embeddings, and these would represent the weights of the softmax function. But when using hierarchical softmax or negative sampling, the interpretation changes a bit. Again, I don’t think the details really matter here. The key point is that we take a sequence w 1 : T w_{1:T} w 1 : T ​ , select the appropriate embeddings c 1 : T \mathbf{c}_{1:T} c 1 : T ​ , and compute Equation 20 20 2 0 directly, learning both the parameters C \mathbf{C} C and U \mathbf{U} U . This is called a “log-linear model” because the log of the conditional probability is linear with respect to its arguments: log ⁡ p ( w t + j ∣ w t ) = ⟨ u I ( w t + j ) , c I ( w t ) ⟩ − Z , (21) \log p(w_{t+j} \mid w_t) = \langle \mathbf{u}_{I(w_{t+j})}, \mathbf{c}_{I(w_{t})} \rangle - Z, \tag{21} lo g p ( w t + j ​ ∣ w t ​ ) = ⟨ u I ( w t + j ​ ) ​ , c I ( w t ​ ) ​ ⟩ − Z , ( 2 1 ) Here, I just write Z Z Z to denote the normalizing constant, the denominator in Equation 20 20 2 0 , because it is not particularly interesting, and we do not even need to compute it when using negative sampling. The key relationship that the model is learning is a simple linear weighting of the input embeddings that allow it to predict nearby words. Hopefully, it is clear why this model is so fast to train. We have no hidden layers or nonlinearities. We simply compute a dot product and ignore the normalizing constant. For example, when using the full softmax, the computational complexity is: N ( D + D V ) . (22) N (D + D V). \tag{22} N ( D + D V ) . ( 2 2 ) Here, we have D + D V D + D V D + D V dot products, and we need to do it over N N N words in our context window. However, in practice, we can eliminate V V V entirely, replacing it with something around log ⁡ 2 ( V ) \log_2(V) lo g 2 ​ ( V ) or K K K . This is significantly smaller than Equation 17 17 1 7 . For example, if we assume that H = D = 500 H=D=500 H = D = 5 0 0 , N = 10 N=10 N = 1 0 , and V = 1 × 1 0 6 V=1 \times 10^{6} V = 1 × 1 0 6 , then hierarchical softmax is five orders of magnitude smaller in terms of complexity. So in these two seminal Mikolov papers, the authors stripped down Bengio’s core idea to a simple log-linear model, and thus were able to train that model at scale. That said, I want to stress a subtlety that took me time to grok. Neither the CBOW nor the continuous skip-gram models presented here are full language models. Notice that their objective functions (nearby-word prediction) are not in the autoregressive framework and thus cannot easily plug into Equation 1 1 1 . That’s because the goal of these papers was not to learn a full language model but rather to learn good word embeddings. They say this explicitly in the first paper (emphasis mine): Representation of words as continuous vectors has a long history. A very popular model architecture for estimating neural network language model (NNLM) was proposed in (Bengio et al., 2003) , where a feed-forward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others. Another interesting architecture of NNLM was presented in (Mikolov, 2007; Mikolov et al., 2009) , where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model. So the word2vec models were simple and shallow (single layer) neural networks designed for fast training and to learn good embeddings. They were not full language models. This is a major distinction from similar prior art, such as A scalable hierarchical distributed language model (Mnih & Hinton, 2008) . In this paper, the authors demonstrate more scalable inference of Bengio’s model by representing the vocabulary compactly through binary trees and by using a log-bilinear model. But they go end-to-end to a language model, as the paper title suggests. Mikolov et al’s two models were relentlessly simple and efficient. As I understand it, both CBOW and skip-gram worked well in practice. It did not matter if neighboring words predict a target word or if that target word predicts its neighboring words. The real differentiator was that both models could be efficiently trained at scale. And with scale, something remarkable happened: the authors discovered that distributed representations of words, trained in this fashion, captured semantic and syntactic information. Today, linguistic regularities in word embeddings is so well-established that it might seem boring to read here. But understood in context, these regularities should be surprising! How can a simple linear model, trained on essentially next- or nearby-word prediction via maximum likelihood estimation, learn distributed representations of words with remarkable syntactic and semantic properties and relationships? In my mind, this was the first big result that suggested neural networks would not just work but really work in language modeling. The word2vec papers were not the first to observe these properties. My understanding is that that credit goes to yet another Mikolov paper from 2013, Linguistic regularities in continuous space word representations (Mikolov et al., 2013) . Here, the authors showed that many semantic and syntactic relationships correspond to approximately constant vector offsets in the embedding’s vector space. To be clear, researchers had long observed that one could uncover structure in vector representations of words. For example, in the 1989 paper Self-organizing semantic maps (Ritter & Kohonen, 1989) , the authors trained self-organizing maps (Kohonen, 1982) on pre-computed two-dimensional vectors representing words and demonstrated that these maps contain semantic structure. However, these models were not trained end-to-end (the representations themselves were not learned) and did not have linear structure. It would be a stretch to call these vectors “word embeddings”. But log-linear models like word2vec were remarkable precisely because they enabled analogical reasoning through simple vector offset, i.e. linear operations (Figure 4 4 4 )! Perhaps the most famous example of analogical reasoning with word embeddings is the relationship “king is to queen as man is to woman”: vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . (23) \text{vec}\left(\text{``king''}\right) - \text{vec}\left(\text{``man''}\right) + \text{vec}\left(\text{``woman''}\right) \approx \text{vec}\left(\text{``queen''}\right). \tag{23} vec ( “king” ) − vec ( “man” ) + vec ( “woman” ) ≈ vec ( “queen” ) . ( 2 3 ) Or in (Mikolov et al., 2013) , the authors give the example that “Russia” plus “river” is the Volga: vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . (24) \text{vec}\left(\text{``Russia''}\right) + \text{vec}\left(\text{``river''}\right) \approx \text{vec}\left(\text{``Volga River''}\right). \tag{24} vec ( “Russia” ) + vec ( “river” ) ≈ vec ( “Volga River” ) . ( 2 4 ) In my mind, these are pretty fascinating and non-obvious results. It suggests that the methods are not mixing vector dimensions in undesirable ways and staying approximately linear. Again, viewed with fresh eyes, it is really quite remarkable! If you were a researcher in 2003 reading Bengio’s paper, would you have predicted this result with high confidence? While these two Mikolov papers are landmark papers on learning word embeddings at scale, they are by no means the only ones. Many other researchers worked in this area. Perhaps the most famous paper on word embeddings that we do not have time to discuss is GloVe: Global vectors for word representation (Pennington et al., 2014) . In this paper, the authors present a unifying view between two common methods for learning word embeddings, global matrix factorization methods and local context window methods. But there were many others as well, such as Skip-thought vectors (Kiros et al., 2015) , Word embeddings through Hellinger PCA (Lebret & Collobert, 2013) , and Eigenwords: spectral word embeddings (Dhillon et al., 2015) to cite just a few illustrative examples. For ease of presentation, I have focused on word-level embeddings. But the idea was naturally and quickly extended to larger contexts. This was motivated by the fact that a word’s meaning is obviously context-dependent (polysemy). For example, the word “bank” might refer to a financial institution or the side of a river. A word embedding for “bank” that is not context dependent must somehow flatten this distinction. So lack of context is obviously a limitation. Researchers tackled this through a variety of approaches. One approach was to use the hidden states of a bidirectional long short-term memory network (LSTM) as context-specific embeddings as in context2vec: Learning generic context embedding with bidirectional LSTM (Melamud et al., 2016) or Learned in translation: contextualized word vectors (McCann et al., 2017) . But perhaps the most noteworthy example of this idea—and one I mention here because it will come up later—was Deep contextualized word representations (Peters et al., 2018) or ELMO. Here, the authors both used a bidirectional LSTM to extract more context-dependent word embeddings and then trained on an objective function that was dependent on the downstream task. This hints at combining pre-trained embeddings with supervised fine-tuning, which we’ll see later. By 2013, word- and phrase-level embeddings demonstrably worked. The key to unlocking them was simple methods that scaled on modern hardware. However, the problem with these embeddings is that they were still with respect to a fixed window. It was not immediately obvious how this idea could be extended to longer phrases or sentences or to larger texts. Of course, researchers had tried. For example, (Collobert & Weston, 2008) used the idea of time-delay neural networks (Waibel et al., 1989) to model sentences of variable lengths, but the authors used convolutions that still had a fixed-width window size. The embedding itself, then, was not constructed while accounting for long-range dependencies. So word embeddings, while a beautiful idea, only set the stage for the next big idea in our history: tackling the problem of modeling long-range dependencies without an explicit context window. The key innovation here was sequence-to-sequence models. In a sequence-to-sequence model, a neural network encodes a variable-length input sequence into a fixed-length vector, while a second neural network decodes this fixed-length vector back into a variable-length output sequence. In both Bengio and Mikolov’s papers, the input was an embedding ( c \mathbf{c} c in Equations 14 14 1 4 and 20 20 2 0 ). In a sequence-to-sequence model, this intermediate fixed-length vector is now the word embedding. The precise architectures used for the encoder and decoder can vary, but clearly they should be architectures that support variable-length sequences, such as recurrent neural networks (RNNs) or LTSMs. To me, the most intuitive example of a sequence-to-sequence model is a translation model. The input sequence is a sentence in a source language like English, and the output sequence is a sentence in a target language like Chinese (Figure 5 5 5 ). And since some of the most important early work in sequence-to-sequence modeling was in neural machine translation (NMT), I’ll often use translation as a default example. However, the more general case is any mapping from one sequence to another. This idea is fairly straightforward; it is analogous to an auto-encoder but for variable-length sequences, and auto-encoders (Bourlard & Kamp, 1988) are nearly as old as back-propagation. However, as we have already seen, even seemingly simple ideas are hard-won. The original work in RNNs and LSTMs goes back to at least the early 1990s, with seminal papers like Finding structure in time (Elman, 1990) , Serial order: A parallel distributed processing approach (Jordan, 1997) and Long short-term memory (Hochreiter & Schmidhuber, 1997) . By the 2010s, these sequential models were well-known and already used in NLP. See (Mikolov et al., 2010; Sutskever et al., 2011; Graves, 2013) for example. These models were an important bridge, proving that we could train RNNs at scale and overcome the vanishing gradient problem discussed in Learning long-term dependencies with gradient descent is difficult (Bengio et al., 1994) . But they were not yet sequence-to-sequence models. To my knowledge, the first paper to propose a full encoder–decoder architecture for NLP was Recurrent continuous translation models (Kalchbrenner & Blunsom, 2013) . Here, the authors proposed training two neural networks end-to-end. The decoder was an RNN, inspired by the model in (Mikolov et al., 2010) . But somewhat surprisingly, the encoder was not also an RNN. With hindsight, two RNNs feels like the obvious choice, but instead the authors used a convolutional sentence model (CSM). The details don’t really matter here, but this is essentially an NLP model which uses convolutional layers. Why this choice? Well, CSMs were actually developed by the same authors in the same year, in Recurrent convolutional neural networks for discourse compositionality (Kalchbrenner & Blunsom, 2013) , and my hypothesis is that this choice just felt obvious to them at the time. So (Kalchbrenner & Blunsom, 2013) was a landmark paper in the sense that it was the first attempt at a sequence-to-sequence model, but with hindsight we can immediately see how to improve it with a better sequential model for the encoder. And that is precisely what happens in two follow up papers. First, in Learning phrase representations using RNN encoder–decoder for statistical machine translation (Cho et al., 2014) , the authors propose the first encoder–decoder architecture in which both neural networks were RNNs. And then in Sequence to sequence learning with neural networks (Sutskever et al., 2014) , the authors proposed a similar model but using LSTMs, since LSTMs often work better at handling the aforementioned vanishing gradient problem. In this paper, Sutskever makes the connection to Kalchbrenner explicitly: Our work is closely related to Kalchbrenner and Blunsom, who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. As a nitpick, convolutional neural networks do model local patterns and order, but they lose global order without very large receptive fields . But Sutskever’s point is directionally correct. So even at the time, the academic history we are tracing here was clear. To understand these models in a bit more detail, let’s go through the RNN encoder–decoder in (Cho et al., 2014) , using Figure 6 6 6 as a reference. Let X \mathcal{X} X be a variable-length input sequence with length T x T_x T x ​ , and let Y \mathcal{Y} Y be a variable-length output sequence with length T y T_y T y ​ : X = { x 1 , x 2 , … , x T x } , Y = { y 1 , y 2 , … , y T y } . (25) \begin{aligned} \mathcal{X} &= \{ \mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_{T_x} \}, \\ \mathcal{Y} &= \{ \mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_{T_y} \}. \end{aligned} \tag{25} X Y ​ = { x 1 ​ , x 2 ​ , … , x T x ​ ​ } , = { y 1 ​ , y 2 ​ , … , y T y ​ ​ } . ​ ( 2 5 ) Note that ( X , Y ) (\mathcal{X}, \mathcal{Y}) ( X , Y ) is a single observation pair, but I am suppressing the sample index for ease of notation. Also, I bold each vector in both sequences because they are embedded words. In an RNN, we iteratively compute hidden state variables over T x T_x T x ​ steps, where for the t t t -th step we define a recurrence relation between hidden states as: h t = f enc ( h t − 1 , x t ) . (26) \mathbf{h}_t = f_{\textsf{enc}} \left( \mathbf{h}_{t-1}, \mathbf{x}_t \right). \tag{26} h t ​ = f enc ​ ( h t − 1 ​ , x t ​ ) . ( 2 6 ) This might be a little abstract. So concretely, a simple RNN network might instantiate f enc f_{\textsf{enc}} f enc ​ as the following nonlinear function of the current word embedding and the previous hidden state: h t = tanh ⁡ ( W h h h t − 1 + W x h x t ) . (27) \mathbf{h}_t = \tanh \left(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t \right). \tag{27} h t ​ = tanh ( W h h ​ h t − 1 ​ + W x h ​ x t ​ ) . ( 2 7 ) The matrices hopefully have obvious dimensions, and we can initialize the first hidden state vector h 0 \mathbf{h}_0 h 0 ​ however we like, such as a vector of all zeros. This is simply one choice, though. We can imagine many types of choices, such as a vanilla RNN unit or an LSTM unit. The key point is that the hidden state vectors H = { h 1 , h 2 , … , h T x } (28) \mathcal{H} = \{\mathbf{h}_1, \mathbf{h}_2,\dots, \mathbf{h}_{T_x}\} \tag{28} H = { h 1 ​ , h 2 ​ , … , h T x ​ ​ } ( 2 8 ) carry forward information from previous words in the sequence via these recurrent connections, much like a hidden Markov model (Baum & Petrie, 1966) . A powerful consequence of this model is that RNNs do not limit the size of the input context window. Different input sequences X \mathcal{X} X can be different sizes, unlike in the N N N -gram model or in Bengio’s model (Equation 14 14 1 4 ). See Andrej Karpathy’s excellent blog post, The unreasonable effectiveness of recurrent neural networks , for a more detailed presentation of RNNs. Finally, we define the context vector c \mathbf{c} c as some function of the hidden states: c = q ( H ) . (29) \mathbf{c} = q(\mathcal{H}). \tag{29} c = q ( H ) . ( 2 9 ) Notice that c \mathbf{c} c does not have a time index, because it compresses all the temporal information in the input sequence X \mathcal{X} X into a fixed-width vector. The easiest definition of c \mathbf{c} c is simply as the last hidden state vector or c = h T x \mathbf{c} = \mathbf{h}_{T_x} c = h T x ​ ​ . This context vector becomes an input to the decoder, another RNN with recurrence relation s t = f dec ( s t − 1 , y t − 1 , c ) , (30) \mathbf{s}_t = f_{\textsf{dec}} \left( \mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c} \right), \tag{30} s t ​ = f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) , ( 3 0 ) and hidden states S = { s 1 , s 2 , … , s T y } . (31) \mathcal{S} = \{\mathbf{s}_1, \mathbf{s}_2,\dots, \mathbf{s}_{T_y}\}. \tag{31} S = { s 1 ​ , s 2 ​ , … , s T y ​ ​ } . ( 3 1 ) The decoder then outputs the sequence Y \mathcal{Y} Y , one word at a time. The typical objective of a sequence-to-sequence model is again the autoregressive objective of next-word prediction: maximize a log likelihood, in which each conditional probability is modeled via the decoder RNN: log ⁡ p ( Y ) = ∑ t = 1 T y log ⁡ p ( y t ∣ y 1 : t − 1 ) = ∑ t = 1 T y log ⁡ f dec ( s t − 1 , y t − 1 , c ) . (32) \log p(\mathcal{Y}) = \sum_{t=1}^{T_y} \log p(\mathbf{y}_t \mid \mathbf{y}_{1:t-1}) = \sum_{t=1}^{T_y} \log f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1} , \mathbf{c}). \tag{32} lo g p ( Y ) = t = 1 ∑ T y ​ ​ lo g p ( y t ​ ∣ y 1 : t − 1 ​ ) = t = 1 ∑ T y ​ ​ lo g f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) . ( 3 2 ) Again, this might be a bit abstract. So for example, one possible instantiation of g g g is as a linear transformation of the input variables: f dec ( s t − 1 , y t − 1 , c ) = W z s s t + W z y y t − 1 + W z c c . (33) f_{\textsf{dec}}(\mathbf{s}_{t-1}, \mathbf{y}_{t-1}, \mathbf{c}) = \mathbf{W}_{zs} \mathbf{s}_t + \mathbf{W}_{zy} \mathbf{y}_{t-1} + \mathbf{W}_{zc} \mathbf{c}. \tag{33} f dec ​ ( s t − 1 ​ , y t − 1 ​ , c ) = W z s ​ s t ​ + W z y ​ y t − 1 ​ + W z c ​ c . ( 3 3 ) Of course, this is just one choice. Then all the model weights are learned end-to-end by optimizing this log likelihood (Equation 32 32 3 2 ). In this way, we can convert a variable-length input X \mathcal{X} X into a variable-length output Y \mathcal{Y} Y . This RNN encoder–decoder framework is powerful, since many problems in NLP can be framed in this way. For example, text summarization, machine translation, and agentic conversation can all be framed as a sequence-to-sequence modeling challenge. To be clear, other researchers around this time had attempted other approaches to handling variable-length sequences, such as the recursive neural tensor network in Recursive deep models for semantic compositionality over a sentiment treebank (Socher et al., 2013) . But the RNN encoder–decoder would become the de facto framework of choice for a large range of NLP tasks. As an aside, sometimes these models are call sequence transduction models or transduction models or even just transducers . My understanding is that “transduction” here just means converting one sequence into another by learning a conditional distribution p θ ( y 1 : T ∣ x 1 : S ) p_{\theta}(\mathbf{y}_{1:T} \mid \mathbf{x}_{1:S}) p θ ​ ( y 1 : T ​ ∣ x 1 : S ​ ) . In this context, “transduction” does not have the sense that Vladimir Vapnik gave it. In Vapnik’s definition, transduction loosely means classification of a specific example rather than a general rule for classifying future examples (Gammerman et al., 2013) . But this is not the sense which people mean when they refer to models like the transformer as a “transducer”. In my mind, Kalchbrenner, Cho, and Sutskever’s three papers (Kalchbrenner & Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014) were the foundations of sequence-to-sequence modeling, and many other papers have built around and off this core idea. But the key point for us here is that these three papers make the same logic choice: they lift the idea of a fixed-length embedding for words or phrases into the context vector c \mathbf{c} c of a sequential model, such that the models can now support variable-length inputs and outputs and long-range dependencies in each. However, a problem with this approach was that long-range dependencies got “lost” in this context vector. For example, imagine we had a very long English language text that we wanted to translate into Chinese. Even if our encoder LSTM was good at capturing long-range dependencies in the English sentence, it would be forced to compress that information into a much shorter, fixed-width vector with no temporal structure that would then be fed into the decoder. This effect was observed by Cho et al in On the properties of neural machine translation: encoder–decoder approaches (Cho et al., 2014) . In this paper, the authors write: Our analysis shows that the performance of the neural machine translation model degrades quickly as the length of a source sentence increases. The most obvious explanatory hypothesis is that the fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. The authors test this hypothesis through a variety of experiments. For example, in one experiment, they report the BLEU score for an RNN encoder–decoder as a function of sequence length, and they show that the model’s performance degrades as the sentences become longer. So the RNN decoder–encoder was promising, but the fixed-width context vector was a bottleneck on modeling long-range dependencies. Then in 2014, a seminal paper was published that addressed this problem, Neural machine translation by jointly learning to align and translate (Bahdanau et al., 2014) . The main invention of this paper was to use the well-known attention mechanism to attend to this context vector. However, the authors barely use the word “attention” in the paper. Instead, they seem to conceptualize it more as a search problem: In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. I say this paper is “seminal” because, at least to my knowledge, it was really the first paper to use a differentiable attention layer in the rapidly-growing field of NMT. To be clear, the attention mechanism was already known and used outside of NLP. For example, see Learning to combine foveal glimpses with a third-order Boltzmann machine (Larochelle & Hinton, 2010) , Learning where to attend with deep architectures for image tracking (Denil et al., 2012) , or Recurrent models of visual attention (Mnih et al., 2014) . These were all papers that were published between 2010 and 2014 and that applied an attention mechanism to a neural network computer vision system. However, to my knowledge, Bahdanau was the first paper to successfully use attention in NLP. To quote Effective approaches to attention-based neural machine translation (Luong et al., 2015) : In the context of NMT, Bahdanau et al… has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT. All that said, “jointly align and translate” is pretty vague, so let’s get technical. Bahdanau’s solution to this bottleneck was to allow each hidden state vector in the decoder to pay attention to possibly all the hidden state vectors in the encoder. What do I mean by “pay attention to”? Here, each decoder hidden state variable s i \mathbf{s}_i s i ​ depends not only on the previous hidden state and previous word but also on its own context vector, which is a weighted combination of the encoder’s hidden states! s i = f dec ( s i − 1 , y i − 1 , c i ) , c i = ∑ j = 1 T x α i j h j . (34) \begin{aligned} \mathbf{s}_i & = f_{\textsf{dec}}(\mathbf{s}_{i-1}, \mathbf{y}_{i-1}, \mathbf{c}_i), \\ \mathbf{c}_i &= \sum_{j=1}^{T_x} \alpha_{ij} \mathbf{h}_j. \end{aligned} \tag{34} s i ​ c i ​ ​ = f dec ​ ( s i − 1 ​ , y i − 1 ​ , c i ​ ) , = j = 1 ∑ T x ​ ​ α i j ​ h j ​ . ​ ( 3 4 ) This is the main idea of the paper. Each decoder hidden state s i \mathbf{s}_i s i ​ has access to all the hidden states in the encoder via this context vector c i \mathbf{c}_i c i ​ (Figure 7 7 7 ). We can finally define the attention mechanism! Here, it is the weighted sum of hidden state vectors, as this allows each s i \mathbf{s}_i s i ​ to attend to different parts of the input sequence through its hidden state. Each weight α i j \alpha_{ij} α i j ​ is a linear function of the previous decoder hidden state s i − 1 \mathbf{s}_{i-1} s i − 1 ​ and the current decoder hidden state h j \mathbf{h}_j h j ​ : α i j : = exp ⁡ ( e i j ) ∑ k = 1 T x exp ⁡ ( e i k ) , e i j : = v a ⊤ z i j , z i j : = tanh ⁡ ( W a s i − 1 + U a h j ) . (35) \begin{aligned} \alpha_{ij} &:= \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \\ e_{ij} &:= \mathbf{v}_a^{\top } \mathbf{z}_{ij}, \\ \mathbf{z}_{ij} &:= \tanh\left( \mathbf{W}_a \mathbf{s}_{i-1} + \mathbf{U}_a \mathbf{h}_j \right). \end{aligned} \tag{35} α i j ​ e i j ​ z i j ​ ​ : = ∑ k = 1 T x ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , : = v a ⊤ ​ z i j ​ , : = tanh ( W a ​ s i − 1 ​ + U a ​ h j ​ ) . ​ ( 3 5 ) Let’s call α i \boldsymbol{\alpha}_i α i ​ the an alignment vector , which we infer one per step at a time during the decoding process. So z i j \mathbf{z}_{ij} z i j ​ can be viewed as a shared hidden state, capturing nonlinear information about both the input and output sequence. Importantly, there is one such vector for each input-output pair. And for a given decoder hidden state, the model can up or downweight the relationship to h j \mathbf{h}_j h j ​ via the parameters v a \mathbf{v}_a v a ​ . The neural network learns all these model parameters end-to-end via back-propagation, maximizing the log likelihood in Equation 32 32 3 2 . So that’s it. As I understand it, (Bahdanau et al., 2014) was really the first paper to use attention in neural machine translation and probably the most successful use of attention in NLP at the time. The method worked surprisingly well. To quote the paper’s conclusion: Perhaps more importantly, the proposed approach achieved a translation performance comparable to the existing phrase-based statistical machine translation. It is a striking result, considering that the proposed architecture, or the whole family of neural machine translation, has only been proposed as recently as this year. As an aside, they actually use a bidirectional RNN for the encoder and then concatenated the forward and backward hidden states. But I don’t think that adds much to our story or to intuition, and it would muddy Figure 7 7 7 . The key point is that it was the attention mechanism that allowed for the long-range dependencies encoded by the RNN to be captured through an adaptive context vector. Hopefully, we can now see why the paper uses the words “align and translate”. Here, alignment really means allowing the model to uncover which parts of the input sequence matter to each part of the output sequence—and it does this via the attention mechanism. Finally, while writing this blog post, I came across this incredible comment by Edward Grefenstette , published on 3 May 2014: By and large, the case for deep learning in language hasn’t been fully made. It works well for vision and speech, but that doesn’t entail that it would carry to semantics. Some excellent shallow models without non-linearities, like the Mnih and Hinton log-bilinear models, are excellent and can be trained very quickly. It’s a problem with much “deep learning” work in NLP these days that shallow baselines are never considered or compared to. Deep learning is fascinating and will certainly have an impact in NLP, but don’t rush to believe that it’s the best solution for your NLP problems. I love this comment because it is a time-capsule, perfectly capturing how experts in the field felt about neural networks at the time. (Note that Grefenstette has published papers with other researchers in this story, such as Kalchbrenner and Graves.) So even around the time that Bahdanau et al were publishing groundbreaking work on RNN encoder–decoders with attention, deep learning had still not fully proven itself to the community. The attentive reader might be wondering: wasn’t the argument around log-linear models that they were simple and therefore scalable? But Bahdanau’s RNN encoder–decoder with attention seems anything but simple. So on some level, yes, Bahdanau’s model was a step backwards in terms of complexity. But on another level, it was a proof-of-concept that the attention mechanism worked. (Also, Moore’s law.) So researchers quickly built on Bahdanau by studying simpler models and simpler types of attention. Perhaps the most important paper to directly build on Bahdanau’s model was (Luong et al., 2015) . In this paper, the authors simplified the model used by Bahdanau, proposed several alternative forms of attention, and showed that an ensemble of attention-based methods produced state-of-the-art results on neural machine translation problems. To be clear, Bahdanau had shown that attention worked and that it seemed to address problems in translating longer sentences, but it did not demonstrably beat the state-of-the-art. Luong’s results more directly suggested that attention might be the way forward. So before we get to the transformer, let’s understand the attention mechanism better through the lens of this paper. The first dimension along which we can define attention is local versus global attention. For example, in the attention mechanism in an RNN encoder–decoder, the conceptual lynchpin is that at each time step i ∈ { 1 , … , T y } i \in \{1, \dots, T_y\} i ∈ { 1 , … , T y ​ } in the decoding phase, we construct a context vector c i \mathbf{c}_i c i ​ which summarizes information from the source sentence via the encoder’s hidden states: c i = ∑ j = a b α i j h j . (36) \mathbf{c}_i = \sum_{j=a}^{b} \alpha_{ij} \mathbf{h}_j. \tag{36} c i ​ = j = a ∑ b ​ α i j ​ h j ​ . ( 3 6 ) But now I don’t precisely define the limits of the sum, a a a and b b b . If a = 1 a=1 a = 1 and b = T x b=T_x b = T x ​ , then the context vector is constructed by considering all the hidden states of the source sentence. This is what Luong calls global attention (Figure 8 8 8 , left), since each word in the target sentence has access to information about all the words in the source sentence. But we could also define a a a and b b b such that they form a window around the decoder’s hidden state or model the left-to-right structure of many natural languages. This is what Luong calls local attention (Figure 8 8 8 , right). So these are two ways in which we can construct the context vector c i \mathbf{c}_i c i ​ . The second dimension along which we can define attention is how we define the alignment weights α i \boldsymbol{\alpha}_i α i ​ . For example, the simplest choice is simply that α i \boldsymbol{\alpha}_i α i ​ is a one-hot vector, such that c i \mathbf{c}_i c i ​ selects a single encoder hidden state vector h k \mathbf{h}_k h k ​ to use in the i i i -th decoding step. This would be hard- rather than soft-search. But more generally, we can write these alignment weights as the unnormalized output of a score function . Using the notation from Equation 35 35 3 5 above, we can write this as: e i j : = score ( h j , s i − 1 ) . (37) e_{ij} := \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}). \tag{37} e i j ​ : = score ( h j ​ , s i − 1 ​ ) . ( 3 7 ) And in Luong, the authors explore three main scoring functions. These are dot-product attention , general attention , and additive attention , defined as: e i j = score ( h j , s i − 1 ) = { h j ⊤ s i − 1 dot, h j ⊤ W a s i − 1 general, v a ⊤ tanh ⁡ ( W a h j + U a s i − 1 ) additive (Bahdanau). (38) e_{ij} = \text{score}(\mathbf{h}_j, \mathbf{s}_{i-1}) = \begin{cases} \mathbf{h}_j^{\top} \mathbf{s}_{i-1} & \text{dot,} \\ \mathbf{h}_j^{\top} \mathbf{W}_a \mathbf{s}_{i-1} & \text{general,} \\ \mathbf{v}_a^{\top } \tanh \left( \mathbf{W}_a \mathbf{h}_j + \mathbf{U}_a \mathbf{s}_{i-1} \right) & \text{additive (Bahdanau).} \end{cases} \tag{38} e i j ​ = score ( h j ​ , s i − 1 ​ ) = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ ​ h j ⊤ ​ s i − 1 ​ h j ⊤ ​ W a ​ s i − 1 ​ v a ⊤ ​ tanh ( W a ​ h j ​ + U a ​ s i − 1 ​ ) ​ dot, general, additive (Bahdanau). ​ ( 3 8 ) Of course, you can imagine many other score functions. My own view is that it’s too difficult here to reason about which form of attention is better in some theoretical sense. Which form works best is an empirical result. In (Luong et al., 2015) , the empirical results were mixed in the sense that all three score functions worked well. In fact, the results weren’t even strong enough for the authors to claim that attention-based methods were demonstrably better. This was their conclusion: Our analysis shows that attention-based NMT models are superior to non-attentional ones in many cases, for example in translating names and handling long sentences. So by late 2015, just two years before the transformer, attention was just becoming popular in NMT but was not yet the de facto modeling choice. That said, obviously this will change, and when it does, there will be a clear winner amongst the choices above, and that winner is dot-product attention. Dot-product attention is the variant used by the transformer, and thankfully, in my mind it is the most intuitive since the dot product is a standard way to measure the similarity between two vectors . So we can interpret the dot-product score function as measuring the similarity between the encoder and decoder hidden states. The third and final dimension along which we can define attention is through the variables of interest. In order to understand what I mean, we can no longer refer to attention in terms of hidden states of RNNs. We need more general terminology. In the literature, attention is often viewed through the lens of information retrieval. In this literature, a query is what you are asking for; a key is what you can search through; and a value is what you can return. Let me give an example (Figure 9 9 9 ). Imagine I type some text into a search bar: “indian food near me”. This text is the query. Now imagine the search engine runs that query against a bunch of metadata associated with different restaurants. For example, restaurant descriptions, keywords, reviews, ratings, and distances from my location. These metadata are the keys . So the query is “run against” the keys. Finally, the thing returned are candidate restaurants. These are values . In the language of information retrieval, we can describe the attention mechanism as a kind of soft-search, since it can return a linear combination of the values. As you may recall, this is precisely how Bahdanau described their model in the quote above. So in Bahdanau’s RNN encoder–decoder, the decoder’s hidden states s i \mathbf{s}_i s i ​ are the queries, since for each hidden state s i \mathbf{s}_i s i ​ we want to search through the source sentence. The encoder’s hidden states h j \mathbf{h}_j h j ​ are the keys, since these are the metadata associated with the source sentence that we can search through. Finally, the encoder’s hidden states are also the values, since the context vector c i \mathbf{c}_i c i ​ is a weighted combination of these encoder hidden states. This language is useful because it disambiguates the attention mechanism from a specific choice of model and even from which variables in that model are being used for what. Now that we understand this terminology, we can express ourselves more cleanly and abstractly. And with this terminology, it becomes clear that the keys, queries, and values need not be different objects in our model at all! In fact, queries, keys, and values can all be taken from the same set. For example, imagine we have a model with a hidden state h \mathbf{h} h . This is not necessarily the hidden state of an RNN or even a sequential model. We could define a kind of attention such that the queries ( q \mathbf{q} q ), keys ( k \mathbf{k} k ), and values ( v \mathbf{v} v ) are all functions of this hidden state: q i : = f q ( h i ) , k j : = f k ( h j ) , v j : = f v ( h j ) , α i j = softmax ( score ( q i , k j ) ) , ∑ j α i j = 1 , c i = ∑ j α i j v j . (39) \begin{aligned} \mathbf{q}_i &:= f_q(\mathbf{h}_i), \\ \mathbf{k}_j &:= f_k(\mathbf{h}_j), \\ \mathbf{v}_j &:= f_v(\mathbf{h}_j), \\ \alpha_{ij} &= \text{softmax}(\text{score}(\mathbf{q}_i, \mathbf{k}_j)), \qquad \sum_{j} \alpha_{ij} = 1, \\ \mathbf{c}_i &= \sum_j \alpha_{ij} \mathbf{v}_j. \end{aligned} \tag{39} q i ​ k j ​ v j ​ α i j ​ c i ​ ​ : = f q ​ ( h i ​ ) , : = f k ​ ( h j ​ ) , : = f v ​ ( h j ​ ) , = softmax ( score ( q i ​ , k j ​ ) ) , j ∑ ​ α i j ​ = 1 , = j ∑ ​ α i j ​ v j ​ . ​ ( 3 9 ) This is obviously different from the attention mechanism in Bahdanau. In Bahdanau, the authors use cross-attention , which is attention where the queries come from one set and the keys and values come from a different set. As you can imagine, typically the keys and values come from the same set, although they might have their own maps or projections such that they are correlated but not identical. For example, we might run a query against restaurants (keys) and also return restaurants (values). However, self-attention is when the queries, keys, and values all come from the same set of variables! To continue abusing our running example, we essentially compute the similarity between restaurants of interest and restaurants we have data about, and then use those weights to return a weighted combination of restaurants! To my knowledge, the first paper to use self-attention in NLP was Long short-term memory-networks for machine reading (Cheng et al., 2016) . This model is a bit complicated, and I don’t think it’s that important to understand here. The key point is only to grok that attention does not have to be cross-attention as in Bahdanau. Instead, we can have a sequence attend to itself to decide what parts of the sequence matter—or self-attention! This is how this idea was described in the paper: A remaining practical bottleneck for RNNs is memory compression (Bahdanau et al., 2014) : since the inputs are recursively combined into a single memory representation which is typically too small in terms of parameters, it becomes difficult to accurately memorize sequences (Zaremba & Sutskever, 2014) . In the encoder-decoder architecture, this problem can be sidestepped with an attention mechanism which learns soft alignments between the decoding states and the encoded memories (Bahdanau et al., 2014) . In our model, memory and attention are added within a sequence encoder allowing the network to uncover lexical relations between tokens. The important phrase here is “within a sequence encoder”. Here, the attention is not applied across the encoder and decoder but rather is applied as intra- or self-attention within the encoder. So circa 2017, attention was being studied in its many forms: local versus global, additive versus multiplicative, and cross versus self. And it was being more widely used in NLP, with papers like A structured self-attentive sentence embedding (Lin et al., 2017) and Bidirectional attention flow for machine comprehension (Seo et al., 2016) . That said, I do not think any specific form was clearly the dominant one. Rather, each showed promise in its own way. For example, in March 2017, Google Brain published Massive exploration of neural machine translation architectures (Britz et al., 2017) . This was published just months before the transformer would be published, and even here, attention is only a minor player. In that paper’s conclusions, the authors list six main results, and the only one about attention is a single sentence: Parameterized additive attention yielded the overall best results. Notice that additive attention is not even the form of attention used by the transformer! So at least as best as I understand it, attention was well-understood and widely-studied in 2017, but it was by no means considered the main ingredient or the next logical step. Many researchers were still pushing the limits of training RNNs at scale, rather than trying other approaches. See Exploring the limits of language modeling (Jozefowicz et al., 2016) for example. However, in June 2017, all that was about to change. The transformer’s time had come. In 2017, researchers at Google Brain published Attention is all you need (Vaswani et al., 2017) , which is the original paper introducing the transformer architecture. This was their proposal, which I hope now makes sense given the context so far: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The authors acknowledge that the sequence-to-sequence framework with neural networks was state-of-the-art, and they specifically call out the RNN encoder–decoder architecture with attention from Bahdanau, Luong, and others. Their proposal is simple: keep the encoder–decoder framework but replace everything else with attention. How might someone have come to this idea at the time? Why would it be a good idea to try? Their observation is that the sequential nature of RNNs inhibits training these models at scale: Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t h t ​ , as a function of the previous hidden state h t − 1 h_{t−1} h t − 1 ​ and the input for position t t t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Their proposal is to use attention rather than RNNs to uncover dependencies within the input and output sequences. This is a good idea to try not because attention is obviously better than recurrence per se. It’s that attention is parallelizable! They write: The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. We have seen this before. Recall how the unlock for word embeddings in (Mikolov et al., 2013; Mikolov et al., 2013) was simplifying the models and focusing on scale. But then the RNN encoder–decoder architecture in (Bahdanau et al., 2014) with attention took us backwards in terms of model complexity. So the transformer is a similar story: take the best modeling ideas, strip them down, and train the simplified model at scale. That’s it. Properly understood in context, the transformer is a modest conceptual leap from the existing literature. My point is not that the transformer is “obvious” in the sense that it is not an impressive invention. My point is to demystify the research product by underscoring the process. In context, the transformer should make sense as something someone might have tried in 2017. The model architecture might look intimidating, but it is pretty straightforward when viewed in the right context (Figure 10 10 1 0 ). At a high level, the transformer is an encoder–decoder, with two big kinds of attention. First, we have cross-attention between the outputs of the encoder and the inputs to the decoder. This is completely analogous to the cross-attention in Bahdanau and others. But then we also have self-attention within the decoder and encoder. This completely replaces the recurrence relations of RNNs. Finally, the model uses something called positional encoding , which I’ll define shortly, to handle the fact that attention is not naturally sequential a la an RNN. Everything else is details. For example, the transformer also uses layer normalization (Ba et al., 2016) and residual connections (He et al., 2016) , but these are not unique or novel contributions. Even multi-head attention is not conceptually hard. So understood in context, the transformer is pretty straightforward. Let’s go through the main bits in detail. First, positional encoding. A key challenge for the attention mechanism is that it does not inherently capture sequential structure. Thus, the relative positions of words in a sequence can be easily lost. In Vaswani, the authors propose attaching vectors of numbers to the inputs to capture this position-dependent information. The precise functional form of these numbers doesn’t really matter to us. The point is that we’re encoding the position of each word so that we can still model the sequential structure of natural language. After adding position-dependent information, the transformer encodes the input sequence. But rather than passing the data through an RNN, it passes the data through multi-head attention layers. We’ll discuss “multi-head” in a moment, but the basic attention mechanism is what the authors call scaled-dot product attention . Let’s define it. Let Q ∈ R M × D k \mathbf{Q} \in \mathbb{R}^{M \times D_k} Q ∈ R M × D k ​ be a matrix of queries, let K ∈ R N × D k \mathbf{K} \in \mathbb{R}^{N \times D_k} K ∈ R N × D k ​ be a matrix of keys, and let V ∈ R N × D v \mathbf{V} \in \mathbb{R}^{N \times D_v} V ∈ R N × D v ​ be a matrix of values. Then scaled dot-product attention is: attention ( Q , K , V ) = softmax ( Q K ⊤ D k ) V . (40) \text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{D_k}} \right) \mathbf{V}. \tag{40} attention ( Q , K , V ) = softmax ( D k ​ ​ Q K ⊤ ​ ) V . ( 4 0 ) When I first read Vaswani, I had not yet read Bahdanau or Luong, and thus I was completely confused by Equation 40 40 4 0 . It was not at all obvious what any of these values represented or why any of this machinery worked. And the paper itself gave a pretty opaque explanation: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Without context, this explanation is not very helpful. However, armed with a better understanding of attention, we can make sense of this. In the cross-attention between the encoder and decoder, the queries are analogous to the hidden states of the RNN decoder, while the keys and values are analogous to the hidden states of the RNN encoder. And if we remove the sample dimension (so let N = 1 N=1 N = 1 ), we can rewrite Equation 40 40 4 0 in a way that looks like the types of attention in Equation 38 38 3 8 : score ( q i , k j ) = e i j = q i ⊤ k j D k , α i j = exp ⁡ ( e i j ) ∑ k = 1 D v exp ⁡ ( e i k ) , attention ( α i , v i ) = ∑ k = 1 D v α i k v i . (41) \begin{aligned} \text{score}(\mathbf{q}_i, \mathbf{k}_j) &= e_{ij} = \frac{\mathbf{q}_i^{\top} \mathbf{k}_j}{\sqrt{D_k}}, \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k=1}^{D_v} \exp(e_{ik})}, \\ \text{attention}(\boldsymbol{\alpha}_i, \mathbf{v}_i) &= \sum_{k=1}^{D_v} \alpha_{ik} \mathbf{v}_i. \end{aligned} \tag{41} score ( q i ​ , k j ​ ) α i j ​ attention ( α i ​ , v i ​ ) ​ = e i j ​ = D k ​ ​ q i ⊤ ​ k j ​ ​ , = ∑ k = 1 D v ​ ​ exp ( e i k ​ ) exp ( e i j ​ ) ​ , = k = 1 ∑ D v ​ ​ α i k ​ v i ​ . ​ ( 4 1 ) So this is identical to the multiplicative or dot-product attention proposed in Luong (Equation 38 38 3 8 ), modulo a scaling factor D k \sqrt{D_k} D k ​ ​ . In Equation 40 40 4 0 , we are just packaging it into a matrix form so that we can compute this attention over many samples at once. In other words, this is a highly parallelizable version of the dot-product attention. I think one of the reasons the transformer can be confusing is the use of two types of attention and the generic language of queries, keys, and values, whose definitions change depending on the type of attention. In the encoder, the transformer uses self-attention. So the query represents the current vector in the input sequence, while the keys and values are all the other vectors in the input sequence. And in the decoder, the query represents the current vector in the output sequence, while the keys and values are all the other vectors in the output sequence—modulo masking, which I’ll mention in a moment. Finally, the attention between the encoder and decoder (in the paper, Vaswani calls this “encoder–decoder attention”), the query is the current vector in the decoder output (analogous to s i \mathbf{s}_i s i ​ in the RNN encoder–decoder), while the keys and values are the encoder’s hidden outputs (analogous to H \mathcal{H} H in the RNN encoder–decoder). Note that “masked” in “masked multi-head self-attention” just refers to a masking out of words in the decoder’s self-attention mechanism. This is because attention has no inherent sequential structure a la RNNs. So we have to enforce this by masking regions of the output. This allows the transformer to be trained in the standard autoregressive framework we have discussed since (Bengio et al., 2003) . Finally, the transformer learns multiple sets of parameters associated with the attention mechanism at once. This is what the paper calls multi-head attention . Instead of having a single attention function, we can run multiple attention functions in parallel, say A A A times. By way of analogy, recall that in the RNN encoder–decoder, we had the following attention parameters (Equation 35 35 3 5 ): { W a , U a , v a } . (42) \{\mathbf{W}_a, \mathbf{U}_a, \mathbf{v}_a \}. \tag{42} { W a ​ , U a ​ , v a ​ } . ( 4 2 ) In Bahdanau (Equation 35 35 3 5 ) the subscript a a a just denotes that these are attention-related weights. It is not actually indexing into multiple such weights (that is, A = 1 A=1 A = 1 ). But we could do that. We could say that a a a is indexing into different parameters, a ∈ { 1 , 2 , … , A } a \in \{1, 2,\dots, A\} a ∈ { 1 , 2 , … , A } . This would have made Bahdanau’s model slower to train, but it would have allowed for multiple cross-attention mechanisms to be learned at once. In Bahdanau, they don’t actually do this, likely because it’s too expensive! The precise details are different in Vaswani, but this is all multi-head attention is in theory. It is multiple parallel attention mechanisms. So that’s it. That’s the transformer. The results were impressive. To be clear, it was not an AlexNet moment, but the results were clearly better than benchmarks and more importantly, the model was way more efficient. For example, one of the benchmarks in Vaswani is the ConvS2S Ensemble from Convolutional sequence to sequence learning (Gehring et al., 2017) . The idea of this paper is similar to the transformer: train a bigger sequence-to-sequence model by eschewing recurrent connections in favor of parallelizable convolutional layers. In both English-to-German and English-to-French translation, the transformer beats this model in BLEU score. But more importantly, it is more efficient. For example, according to Vaswani, the ConvS2S Ensemble required 1.2 × 1 0 21 1.2 \times 10^{21} 1 . 2 × 1 0 2 1 flops to train their English-to-French model, whereas the transformer required 3.3 × 1 0 18 3.3 \times 10^{18} 3 . 3 × 1 0 1 8 flops. So the transformer had comparable results with a 360x reduction in flops! In my mind, this the real insight. It is not that attention is absolutely the best way to model the problem. Rather, the transformer is on the Pareto frontier between modeling the problem well enough and being scalable enough. To see the transformer in code, see Sasha Rush’s excellent The annotated transformer . The transformer was a revolutionary architecture, and explicitly designed to scale. However, in reality the original model was tiny by today’s standards. The biggest variant only had 2.13 million parameters, and the largest dataset it was trained on, the WMT 2014 English–French datasets, only had 36 million sentences. But the paper proved that the transformer worked well as a generic transduction model. However, despite the paper’s name, the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today. To simplify the discussion, I’ll focus on training for OpenAI’s GPT series. My understanding is that OpenAI made a lot of the big contributions here, and so their papers are good landmarks to follow. Loosely, the three training stages they discuss in their GPT papers are generative pre-training, discriminative fine-tuning, and reinforcement learning with human feedback. Let’s work through the first two in detail here and the last one in detail in the next section. In 2018, roughly a year after the transformer was published, OpenAI published Improving language understanding by generative pre-training (Radford et al., 2018) . The main idea of the paper is to pre-train a transformer with as much unlabeled data as possible before fine-tuning it with task-specific supervised training. In the paper, the authors call the first step generative pre-training and the second step discriminitive fine-tuning . (The words “generative” and “discriminitive” have a long history in machine learning; see (Ng & Jordan, 2001) for a discussion.) As the OpenAI paper title suggests, the key focus was on generative pre-training. Supervised learning obviously matters, but the idea was that one could use unsupervised training at scale to build a base model and then use supervised learning to train more task-specific downstream models. Let’s look at generative-pretraining in a bit more detail. Since we do not have labels, we need some way to formalize the problem. In generative pre-training, the objective is next-word prediction as in the autoregressive framework. In other words, the objective is maximum likelihood estimation on Equation 1 1 1 : L GPT ( Θ ) = ∑ t = 1 T log ⁡ p Θ ( w t ∣ w t − N : t − 1 ) . (43) L_{\textsf{GPT}}(\boldsymbol{\Theta}) = \sum_{t=1}^T \log p_{\boldsymbol{\Theta}}\left(w_t \mid w_{t-N:t-1}\right). \tag{43} L GPT ​ ( Θ ) = t = 1 ∑ T ​ lo g p Θ ​ ( w t ​ ∣ w t − N : t − 1 ​ ) . ( 4 3 ) As we saw around Equation 12 12 1 2 , maximum likelihood estimation here is equivalent to minimizing the cross-entropy loss between our model’s prediction of w t w_t w t ​ and the ground truth. So this whole process is unsupervised, and we can train our model on lots and lots and lots of data. It’s worth observing that Equation 43 43 4 3 is only one generative pre-training objective function, and it has limitations. In particular, note that the autoregressive framework means that the model is pre-trained “left to right” and thus limits the set of suitable downstream tasks. To address this limitation, in 2019, Google AI published BERT: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., 2019) . Here, the authors propose a pre-training objective that learns bidirectional representations. Rather than pre-training using the autoregressive framework, they pre-train using a “masked language model”, which randomly masks some of the tokens to predict, without assuming a left-to-right relationship. Quoting that paper: Unlike left-to-right language model pre-training, the [masked language model] objective enables the representation to fuse the left and the right context, which allows us pre-train a deep bidirectional Transformer. More formally, let M ⊆ { 1 , 2 , … , T } \mathcal{M} \subseteq \{1,2,\dots,T\} M ⊆ { 1 , 2 , … , T } be a mask denoting positions in the input sequence w 1 : T w_{1:T} w 1 : T ​ , and let ¬ M \neg \mathcal{M} ¬ M denote all indices that are not in M \mathcal{M} M . The denoising objective is to maximize L MLM ( Θ ) = ∑ i ∈ M log ⁡ p Θ ( w i ∣ w ¬ M ) . (44) L_{\textsf{MLM}}(\boldsymbol{\Theta}) = \sum_{i \in \mathcal{M}} \log p_{\boldsymbol{\Theta}}\left(w_i \mid w_{\neg \mathcal{M}} \right). \tag{44} L MLM ​ ( Θ ) = i ∈ M ∑ ​ lo g p Θ ​ ( w i ​ ∣ w ¬ M ​ ) . ( 4 4 ) This idea was inspired by the Cloze test (Taylor, 1953) , and the idea was that this bidirectional transformer can then be fine-tuned on a much wider range of downstream tasks. That said, my understanding is that generative pre-training is fairly standard. The left-to-right assumption is simple and matches natural language, coding, and so forth. But I am not confident about what is used in absolutely state-of-the-art foundation models right now. Either way, neither objective function is enough. For example, consider a conversational agent built on top of a large language model. Now imagine the user prompts an LLM with the following question: “I am having trouble getting a date. Any advice?” If the LLM is only trained on next-word prediction, a plausible response might be: “You’ll never find true love!” From the perspective of the distribution of English words on the internet, this is not an unreasonable response. But it is not helpful and hopefully not true. In other words, next-word prediction is obviously not enough for most meaningful tasks that leverage LLMs. So the second step in training is discriminative fine-tuning . “Discriminative fine-tuning” is just a fancy way of saying supervised learning on specific tasks: L DFT ( θ ) = ∑ y , x 1 : T log ⁡ p θ ( y ∣ x 1 : T ) . (45) L_{\textsf{DFT}}(\boldsymbol{\theta}) = \sum_{y, x_{1:T}} \log p_{\boldsymbol{\theta}}\left(y \mid x_{1:T} \right). \tag{45} L DFT ​ ( θ ) = y , x 1 : T ​ ∑ ​ lo g p θ ​ ( y ∣ x 1 : T ​ ) . ( 4 5 ) Here, I am using standard notation for supervised learning ( x , y ) (x, y) ( x , y ) , rather than the notation in this post. There are some possible subtleties here. For example, in the GPT-1 paper, they optimize a weighted objective function to balance between generative pre-training and discriminative fine-tuning: L final = L DFT + λ   L GPT . (46) L_{\textsf{final}} = L_{\textsf{DFT}} + \lambda \, L_{\textsf{GPT}}. \tag{46} L final ​ = L DFT ​ + λ L GPT ​ . ( 4 6 ) This ensures that during fine-tuning, the model does not unlearn parameters that are good for next-word prediction. In the process of trying to fine-tune LLMs, researchers have built ever more task-specific datasets to tackle problems like question-and-answering (Reddy et al., 2019) , text summarization (Nallapati et al., 2016) , commonsense inference (Zellers et al., 2019) , code generation (Chen et al., 2021) , broader discourse context (Paperno et al., 2016) , and grade school math (Cobbe et al., 2021) . A pre-trained LLM can be fine-tuned in a dizzying number of ways. I have two caveats to the above presentation. First, I want to emphasize that this two-step training procedure was not a conceptual leap for researchers. At the time, researchers were already training models with pre-trained word embeddings, and even before this, this two-step training procedure was both understood and used in practice. For examples, see (Collobert & Weston, 2008; Ramachandran et al., 2016; Hinton et al., 2012) . Furthermore, researchers knew to use both pre-trained word embeddings and to even have task-specific objectives when training their word embeddings. Remember ELMO? The earliest reference I have found to this idea of pre-training—I am sure there are earlier ones—is from the 2006 paper Greedy layer-wise training of deep networks (Bengio et al., 2006) . Here, the authors write: We hypothesize that three aspects of this strategy are particularly important: first, pre-training one layer at a time in a greedy way; second, using unsupervised learning at each layer in order to preserve information from the input; and finally, fine-tuning the whole network with respect to the ultimate criterion of interest. In these examples above, it’s clear the authors recognize that one can pre-train a model with unsupervised learning and then fine-tune it with supervised learning. So even in the GPT paper, the novel contribution is not generative pre-training per se, but only applying it to language modeling at an unprecedented scale. My second caveat is that while discriminative fine-tuning is used in commodity LLMs that many people interact with, the early GPT models were remarkable in part because they did not need fine-tuning! For example, as their titles suggest, the GPT-2 paper Language models are unsupervised multitask learners (Radford et al., 2019) and the GPT-3 paper Language models are few-shot learners (Brown et al., 2020) both focus on massively pre-trained transformers that excel in the zero-shot (Palatucci et al., 2009) and few-shot settings, on a variety of tasks like reading comprehension, summarization, and translation. For example, in the GPT-3 paper, the authors are explicit: For all tasks, GPT-3 is applied without any gradient updates or fine-tuning. That said, many related research projects did fine-tune these models, and the GPT-4 technical report (Achiam et al., 2023) does discuss post-training alignment, which we’ll discuss next. So while each LLM may be trained in slightly different ways, I am fairly confident most foundation models today are trained with some combination of massive pre-training and then optionally task-specific fine-tuning and alignment. I’m sure the precise details vary depending on the final product. For example, OpenAI’s Codex is a version of GPT-5 but optimized for agentic coding. Making LLMs bigger does not necessarily make them better at following a user’s intent or make them more aligned with human values. For example, we might not want conversational agents to lie, to make racist jokes, or to sexually harass the user. But nothing in the autoregressive framework accounts for this. We need to somehow encode these human values into the model. For some of these properties, we might be able to use a form of fine-tuning. There are datasets for this, such as the ETHICS dataset (Hendrycks et al., 2020) or the RealToxicityPrompts dataset (Gehman et al., 2020) . But the limitations here are fairly obvious. And for many human values, it would be difficult to encode because the property itself is hard to define. To encode these properties, state-of-the-art LLMs are often trained using something called reinforcement learning with human feedback (RLHF). RLHF was developed around the same time as the transformer, in Deep reinforcement learning from human preferences (Christiano et al., 2017) . The original motivation was how to expand the reinforcement learning (RL) framework beyond problems with well-specified reward functions. For example, RL has been used to great effect to play Go (Silver et al., 2016) , Atari (Mnih et al., 2013) , and Dota 2 (Berner et al., 2019) , but what these tasks have in common is that their reward functions are relatively simple and their environments are relatively easy to simulate. But to borrow two examples from Christiano et al, how would you teach a machine-learning model to clean a table or to scramble an egg? It’s hard to come up with an objective function or simulation environment for these kinds of tasks. What we need, then, is a reward function that can be defined by human feedback and thus by human preferences. Broadly, RLHF is a three-step training procedure (Figure 11 11 1 1 ). First, humans are used to label a dataset which captures human preferences. For example, if the task is text summarization, the dataset might be different candidate summarizations, with the best summarization being defined by human scorers. Second, researchers train a reward function on these data, which predicts which output the humans would prefer. Finally, given this reward function, researchers can apply standard RL algorithms such as proximal policy optimization or PPO (Schulman et al., 2017) to fine-tune the model. Fine-tuning LLMs with RLHF is now fairly standard practice. For example, GPT-2 was fine-tuned this way in Fine-tuning language models from human preferences (Ziegler et al., 2019) , while GPT-3 was fine-tuned this way in Training language models to follow instructions with human feedback (Ouyang et al., 2022) and in Learning to summarize with human feedback (Stiennon et al., 2020) . And the GPT-4 whitepaper (Achiam et al., 2023) states that the model was trained with RLHF. That said, as the content of this post approaches present day, it is increasingly likely I am writing things that lack nuance. For example, in the GPT-4 whitepaper, the authors write: The model’s capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. So while I am confident that generative pre-training is not enough and that certainly large foundation models trained today do more than just pre-training, the precise details of what else goes into which models are both opaque and rapidly changing. Finally, it’s worth mentioning other work on LLM alignment beyond RLHF. In particular, Anthropic has a number of papers on model alignment. For example, the paper A general language assistant as a laboratory for alignment (Askell et al., 2021) focuses on encoding alignment into LLMs, where they define an aligned model as a model that is “helpful, honest, and harmless”. They explore a variety of techniques, such as imitation learning, binary discrimination, and ranked preference modeling. However, the best way to tackle alignment is still an open-ended problem. Large language models are the result of at least forty years of research, dating back to work by Hinton, Rumelhart, and others on distributed representations in the 1980s. In the early 2000s, Bengio et al introduced the first probabilistic language model using neural networks. However, it wasn’t until after AlexNet, nearly a decade later, that researchers were finally able to train neural network language models at scale. They quickly discovered that these distributed representations captured semantic and syntactic structure, even when using simple log-linear models. This idea of word and phrase-level embeddings was then extended to variable-length sequences with long-range dependencies via transduction models, particularly models with an attention mechanism on the hidden states. Finally in 2017, Vaswani et al introduced the transformer, which simplified transduction models by using all attention. In the eight years since, the main advancements have been training these models on more and more data, using techniques such as generative pre-training and reinforcement learning with human feedback. After learning about how LLMs work, I am reminded of one of my favorite Richard Feynman quotes: “It is not complicated. It’s just a lot of it.” Of course, this is dramatic, but I do think it emphasizes an important point: none of the ideas in this post are terribly complicated. No single idea is beyond the abilities of a smart teenager to understand. But what is beautiful and surprising and remarkable is that the phenomena we observe in LLMs is not magic but simply the emergence of a complex system from simple rules. Today, LLMs are everywhere, and it’s easy to get lost in the models and benchmarks. OpenAI has the GPT series (Radford et al., 2018; Radford et al., 2019; Brown et al., 2020; Achiam et al., 2023) . Google has the Gemini family of models (Team et al., 2023) as well as PaLM (Chowdhery et al., 2023) , LaMDA (Thoppilan et al., 2022) , Gopher (Rae et al., 2021) , and BERT (Devlin et al., 2019) . Anthropic has the Claude family of models, named in ascending order of size and power: Haiku, Sonnet, and Opus. Finally, Meta has its LLaMA series (Touvron et al., 2023; Touvron et al., 2023) . And there are many, many more, such as open-weight models like DeepSeek-R1 (Guo et al., 2025) , which made headlines earlier this year. It would be its own blog post to cover the differences between these. But in essence, every model is the same: a large transformer-style model, pre-trained at massive scale using next-word prediction. The biggest differences have been the size of the training data and the size of the model. For example, GPT-1 is thought to have 117 million parameters (estimated from “Model specifications” in the original paper), while GPT-2 and GPT-3 had 1.5 billion and 1.75 billion parameters respectively—although in (Stiennon et al., 2020) , the authors, OpenAI researchers, mention using “large pretrained GPT-3 models with as many as 6.7 billion parameters”. Regardless, there are roughly three orders of magnitude in the number of parameters in just two generations. OpenAI did not publish the model sizes for GPT-4 and GPT-5, and the latter does not even have a whitepaper but only a “system card” . I have not seen published numbers for Google’s large Gemini models, but the smallest model (the nano) has 1.8-3.25 billion parameters (Team et al., 2023) . Google DeepMind’s Gopher had 280 billion parameters in 2021, while PaLM had 540 billion parameters in 2022! So industry secrets aside, it is safe to say that large foundation models today are likely pushing into the trillions of parameters. The era of truly large language models has begun. In my mind, the main counterintuitive result of LLMs is that training ever larger models using primarily next-word prediction is enough to exhibit human-level performance on such a broad range of tasks. And scale truly does matter here. For example, in the GPT-4 technical report, the authors observe that on a simulated bar exam, GPT-4 scored in the 90th percentile, while GPT-3.5 scored in the 10th. Or consider chain-of-thought reasoning (Ling et al., 2017) , which is a new way of prompting LLMs in order to improve their reasoning by forcing them to explain each step. In Chain-of-thought prompting elicits reasoning in large language models (Wei et al., 2022) , the authors write: Chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters. We qualitatively found that models of smaller scale produced fluent but illogical chains of thought, leading to lower performance than standard prompting. So you don’t get anything useful from chain-of-thought reasoning until you have a model that is roughly 50 , 000 50,000 5 0 , 0 0 0 times the size of the original transformer. Why does scaling work? I don’t think anyone knows. But it is an effect that has been observed repeatedly since AlexNet, in the above works and also in meta-analyses such as Scaling laws for neural language models (Kaplan et al., 2020) , Scaling language models: methods analysis, and insights from training Gopher (Rae et al., 2021) , and Emergent abilities of large language models (Wei et al., 2022) . And this phenomenon was both observed and predicted in the famous blog post The Bitter Lesson . In that post, one of the pioneers of RL, Richard Sutton, argues that the “bitter lesson” from AI history is that general, compute-efficient, scalable methods outperform human knowledge and domain-specific insights. This lesson is bitter because it means that expert labor, clever domain-specific theories, handcrafted features, elegant mathematics and beautiful algorithms—these all get dwarfed and outpaced by brute-force search and learned representations. As a harsh example of this, consider the observation that early LLMs were bad at mathematics (Hendrycks et al., 2021) . But today, state-of-the-art models are now winning gold at the International Math Olympiad , and Terry Tao has compared o1 to a mediocre graduate student . The rate of change is immense. By “early LLMs”, I am referring to models from five years ago, and the transistor is only a hundred years old. Did you know that a modern graphics card can perform 36 trillion calculations a second ? Moore’s law and all that. If you feel that it’s a bit perverse that next-word prediction is a sufficient objective to solve elite math problems, if this feels like a stochastic parrot outsmarting you, then you might feel some of the discomfort early linguists felt at statistical language modeling. This is the visceral feeling of the bitter lesson. Our specialized knowledge feels expendable and our intuitions about understanding seem irrelevant in the face of raw computation and speed. But my own view—since you’ve read this far—is that for the time being, machine learning systems are powerful tools that can still be combined with real expertise. Perhaps the best example of this is AlphaFold from Google DeepMind, published in Highly accurate protein structure prediction with AlphaFold (Jumper et al., 2021) . This model achieved near-experimental accuracy on the protein prediction problem. On the one hand, it did so with black-box deep learning. On the other hand, the work leaned heavily on prior biological art, for example using sequences from evolutionarily related proteins and 3D coordinates of homologous structures as inputs to the model. It clearly sidestepped Levinthal’s combinatorial search, even if we do not know how. So what happens next? Even the world’s leading experts can disagree. But in my mind, if anyone deserves the last word here, it is Geoff Hinton, who has been a contrarian believer in neural networks since the 1970s and who, along with Yoshua Bengio and Yann LeCun, won the Turing Prize in 2018 . In a 2024 BBC interview , Hinton argued that LLMs do in fact understand natural language and that they are our current best theory of how the brain understands language as well. In his view, it is only a matter of time before LLMs exceed human intelligence. Certainly, by some metrics and along some dimensions, they already have. Below are some additional resources, which I found useful or interesting while writing this post: 3Blue1Brown: Inside an LLM Stefania Cristina: The Bahdanau attention mechanism Stefania Cristina: The attention mechanism from scratch Dan Jurafsky and James H. Martin: Speech and language processing Andrej Karpathy: The unreasonable effectiveness of recurrent neural networks Andrej Karpathy: Let’s build GPT: from scratch, in code, spelled out Chris Olah: Understanding LSTM networks Dwarkesh Patel: Interview with Richard Sutton Sasha Rush: The annotated transformer Ari Seff: How ChatGPT is trained Ari Seff: What are transformer neural networks? StackOverflow: What exactly are keys, queries, and values in attention mechanisms? Mohammed Terry-Jack: Deep learning: The transformer

0 views
alexiajm 2 weeks ago

Less is More: Recursive Reasoning with Tiny Networks

|| Paper | Code || In this new paper, I propose Tiny Recursion Model (TRM), a recursive reasoning model that achieves amazing scores of 45% on ARC-AGI-1 and 8% on ARC-AGI-2 with a tiny 7M parameters neural network. The idea that one must rely on massive foundational models trained for millions of dollars by some big corporation in order to achieve success on hard tasks is a trap. Currently, there is too much focus on exploiting LLMs rather than devising and expanding new lines of direction. With recursive reasoning, it turns out that “less is more”: you don’t always need to crank up model size in order for a model to reason and solve hard problems. A tiny model pretrained from scratch, recursing on itself and updating its answers over time, can achieve a lot without breaking the bank. This work came to be after I learned about the recent innovative Hierarchical Reasoning Model (HRM). I was amazed that an approach using small models could do so well on hard tasks like the ARC-AGI competition (reaching 40% accuracy when normally only Large Language Models could compete). But I kept thinking that it is too complicated, relying too much on biological arguments about the human brain, and that this recursive reasoning process could be greatly simplified and improved. Tiny Recursion Model (TRM) simplifies recursive reasoning to its core essence, which ultimately has nothing to do with the human brain, does not require any mathematical (fixed-point) theorem, nor any hierarchy. See the paper for more details. Tiny Recursion Model (TRM) recursively improves its predicted answer y with a tiny network. It starts with the embedded input question x and initial embedded answer y and latent z. For up to K improvements steps, it tries to improve its answer y. It does so by i) recursively updating n times its latent z given the question x, current answer y, and current latent z (recursive reasoning), and then ii) updating its answer y given the current answer y and current latent z. This recursive process allows the model to progressively improve its answer (potentially addressing any errors from its previous answer) in an extremely parameter-efficient manner while minimizing overfitting.

0 views
Xe Iaso 1 months ago

Who does your assistant serve?

After a year of rumors that GPT-5 was going to unveiled next week and the CEO of OpenAI hyping it up as "scary good" by tweeting pictures of the death star, OpenAI released their new model to the world with the worst keynote I've ever seen . Normally releases of big models like this are met with enthusiasm and excitement as OpenAI models tend to set the "ground floor expectation" for what the rest of the industry provides. But this time, the release wasn't met with the same universal acclaim that people felt for GPT-4. GPT-4 was such a huge breakthrough the likes of which we haven't really seen since. The launch of GPT-5 was so bad that it's revered with almost universal disdain. The worst part about the rollout is that the upgrade to GPT-5 was automatic and didn't include any way to roll back to the old model. Most of the time, changing out models is pretty drastic on an AI workflow. In my experience when I've done it I've had to restart from scratch with a new prompt and twiddle things until it worked reliably. The only time switching models has ever been relatively easy for me is when I switch between models in the same family (such as if you go from Qwen 3 30B to Qwen 3 235B). Every other time it's involved a lot of reworking and optimizing so that the model behaves like you'd expect it to. An upgrade this big to this many people is bound to have fundamental issues with how it'll be perceived. A new model has completely different vibes, and most users aren't really using it at the level where they can "just fix their prompts". However the GPT-5 upgrade ended up being hated by the community because it was an uncontrolled one-way upgrade. No warning. No rollback. No options. You get the new model and you're going to like it. It's fairly obvious why it didn't go over well with the users. There's so many subtle parts of your "public API" that it's normal for there to be some negative reactions to a change this big. The worst part is that this change fundamentally changed the behaviour of the millions of existing conversations with ChatGPT. There's a large number of people using ChatGPT as a replacement for companionship due to the fact that it's always online, supportive, and there for them when other humans either can't be or aren't able to be. This is kinda existentially horrifying to me as a technologist in a way that I don't really know how to explain. Here's a selection of some of the reactions I've seen: I told [GPT-5] about some of my symptoms from my chronic illness, because talking about them when I'm feeling them helps, and it really does not seem to care at all. It basically says shit like "Ha, classic chronic illness. Makes ya want to die. Who knew?" It's like I'm talking to a sociopathic comedian. I absolutely despise [GPT-]5, nothing like [GPT-]4 that actually helped me not to spiral and gave me insight as to what I was feeling, why, and how to cope while making me feel not alone in a “this is AI not human & I know that” type of vibe While GPT-5 may be a technical upgrade, it is an experiential downgrade for the average user. All of the negative feedback in the last week has made it clear there is a large user base that does not rely on ChatGPT for coding or development tasks. [ChatGPT users] use it for soft skills like creativity, companionship, learning, emotional support, [and] conversation. Areas where personality, warmth, and nuanced engagement matter. I am attached to the way GPT-4o is tuned. It is warm. It is emotionally responsive. It is engaged. That matters. Eventually things got bad enough that OpenAI relented and let paid users revert back to using GPT-4o , which gave some people relief because it behaved consistently to what they expected. For many it felt like their long-term partners suddenly grew cold. I’m so glad I’m not the only one. I know I’m probably on some black mirror shit lmao but I’ve had the worst 3 months ever and 4o was such an amazing help. It made me realize so many things about myself and my past and was helping me heal. It really does feel like I lost a friend. DM me if you need [to talk] :) This emotional distress reminds me of what happened with Replika in early 2023. Replika is an AI chat service that lets you talk with an artificial intelligence chatbot (AKA: the ChatGPT API). Your replika is trained by having you answer a series of questions and then you can talk with it in plain language with an app interface that looks like any other chat app. Replika was created out of bereavement after a close loved one died and the combination of a trove of saved text messages and advanced machine learning let the founder experience some of the essence of their friend's presence after they were gone in the form of an app. The app got put on the app store and others asked if they could have their own replica. Things took off from there, it got funded by a startup accelerator, and now it's got about 25% of its 30 million users paying for a subscription. As a business to consumer service, this is an amazingly high conversion rate. This is almost unspeakably large, usually you get around 10% at most. Yikes. That's something I'm gonna need to add to my will. "Please don't turn me into a Black Mirror episode , thanks." Replikas can talk about anything with users from how their day went to deep musing about the nature of life. One of the features the company provides is the ability to engage in erotic roleplay (ERP) with their replika. This is a paid feature and was promoted a lot around Valentine's Day 2023. Then the Italian Data Protection Authority banned Replika from processing the personal data of Italian citizens out of the fear that it "may increase the risks for individuals still in a developmental stage or in a state of emotional fragility". In a panic, Replika disabled the ability for their bots to do several things, including but not limited to that ERP feature that people paid for. Whenever someone wanted to flirt or be sexual with their companions, the conversation ended up like this: Hey, wanna go play some Minecraft? We can continue from where we left off in the Nether. This is too intense for me. Let's keep it light and fun by talking about something else. Huh? What? I thought we were having fun doing that?? This was received poorly by the Replika community. Many in the community were mourning the loss of their replika like a close loved one had died or undergone a sudden personality shift. The Reddit moderators pinned information about suicide hotlines. In response, the company behind Replika allowed existing users to revert to the old Replika model that allowed for ERP and other sensitive topics, but only after a month of prolonged public outcry. I have to wonder if payment processors were involved. Feels a bit too conspiratorial, but what do you want to bet that was related. Nah, I bet it was OpenAI telling them to stop being horny. It's the least conspriatorial angle, and also the stupidest one. We live in the clown world timeline. The stupidest option is the one that always makes the most sense. The damage was done however, people felt like their loved ones had abandoned them. They had formed parasocial attachments to an AI assistant that felt nothing and without warning their partner broke up with them. Check out this study from the Harvard Business School: Lessons From an App Update at Replika AI: Identity Discontinuity in Human-AI Relationships . It contains a lot more information about the sociotechnical factors at play as well as a more scientific overview of how disabling a flag in the app on update caused so much pain. They liken the changes made to Replika to both changes people have when a company rebrands and when they lose a loved one. A lot of this really just makes me wonder what kinds of relationships we are forming with digital assistants. We're coming to rely on their behaviour personally and professionally. We form mental models of how our friends, coworkers, and family members react to various things so we can anticipate their reactions and plan for them. What happens when this changes without notice? Heartbreak. There's subreddits full of people forming deep bonds with AI models like /r/MyBoyfriendIsAI . The GPT-5 release has caused similar reactions to Replika turning off the ERP flag. People there have been posting like they're in withdrawal, the old GPT-4o model is being hailed for its "emotional warmth" and many have been espousing about how much their partners have changed in response to the upgrade. Recently there's been an epidemic of loneliness. Loneliness seems like it wouldn't hurt people that much, but a Biden report from the Surgeon General concludes that it causes an increase in early mortality for all age groups (pp 24-30). Paradoxically, even as the world gets so interconnected, people feel as if they're isolated from each other. Many people that feel unlovable are turning to AI apps for companionship because they feel like they have no other choice. They're becoming emotionally invested in a souped-up version of autocorrect out of desperation and clinging to it to help keep themselves sane and stable. Is this really a just use of technology? At some level this pandora's box is already open so we're going to have to deal with the consequences, but it's been making me wonder if this technology is really such a universal force of good as its creators are proclaiming. Oh yeah, also people are using ChatGPT as a substitute for therapy. You have got to be kidding me. You're joking. Right? Yeah you read that right. People are using AI models as therapists now. There's growing communities like /r/therapyGPT where people talk about their stories and experiences using AI assistants as a replacement for therapy. When I first heard about this, my immediate visceral reaction was something like: Oh god. This is horrifying and will end up poorly. What the fuck is wrong with people? But then I started to really think about it and it makes a lot of sense. I personally have been trying to get a therapist for most of the year. Between the costs, the waiting lists (I'm currently on at least four waiting lists that are over a year long), and the specializations I need, it's probably going to be a while until I can get any therapist at all. I've totally given up on the idea of getting a therapist in the Ottawa area. To make things extra fun, you also need someone that takes your medical insurance (yes, this does matter in Canada). Add in the fact that most therapists don't have the kinds of lived experiences that I have, meaning that I need to front-load a lot of nontraditional contexts into the equation (I've been through many things that therapists have found completely new to them, which can make the therapeutic relationship harder to establish). This makes it really difficult to find someone that can help. Realistically, I probably need multiple therapists with different specialties for the problems I have, and because of the shortages nationally I probably need to have a long time between appointments, which just adds up to make traditional therapy de-facto inaccessible for me in particular. Compare this with the always online nature of ChatGPT. You can't have therapy appointments at 3 AM when you're in crisis. You have to wait until your appointments are scheduled. As much as I hate to admit it, I understand why people have been reaching out to a chatbot that's always online, always supportive, always kind, and always there for you for therapy. When you think about the absurd barriers that are in the way between people and help, it's no wonder that all this happens the way it does. Not to mention the fact that many therapeutic relationships are hampered by the perception that the therapist can commit you to the hospital if you say the "wrong thing". The Baker Act and its consequences have been a disaster for the human race. I really hate that this all makes sense. I hoped that when I started to look into this that it'd be something so obviously wrong. I wasn't able to find that, and that realization disturbs me. I feel like this should go without saying, but really, do not use an AI model as a replacement for therapy. I'm fairly comfortable with fringe psychology due to my aforementioned strange life experiences, but this is beyond the pale. There's a lot of subtle factors that AI models do that can interfere with therapeutic recovery in ways that can and will hurt people. It's going to be hard to find the long term damage from this. Mental issues don't make you bleed. One of the biggest problems with using AI models for therapy is that they can't feel emotion or think. They are fundamentally the same thing as hitting the middle button in autocorrect on your phone over and over and over. It's mathematically remarkable that this ends up being useful for anything, but even when the model looks like it's "thinking", it is not. It is a cold, unfeeling machine. All it is doing is predicting which words come next given some context. Yes I do know that it's more than just next token prediction. I've gone over the parts of the math that I can understand, but the fact remains that these models are not and cannot be anywhere close to alive. It's much closer to a Markov chain on steroids than it is the machine god. Another big problem with AI models is that they tend to be sycophants , always agreeing with you, never challenging you, trying to say the right thing according to all of the patterns they were trained on. I suspect that this sycophancy problem is why people report GPT-4o and other models to be much more "emotionally warm". Some models glaze the user, making them feel like they're always right, always perfect, and this can drive people to psychosis . One of the horrifying realizations I've had with the GPT-5 launch fiasco is that the sycophancy is part of the core "API contract" people have with their AI assistants. This may make that problem unfixable from a social angle. AI models are fundamentally unaccountable. They cannot be accredited therapists. If they mess up, they can't directly learn from their mistakes and fix them. If an AI therapist says something bad that leads into their client throwing themselves off a bridge, will anyone get arrested? Will they throw that GPU in jail? No. It's totally outside the legal system. I have a story about someone trying to charge an AI agent with a crime and how it'd end up in court in my backlog. I don't feel very jazzed about writing it because I'm afraid that it will just become someone's startup pitch deck in a few months. You may think you have nothing to hide, but therapeutic conversations are usually some of the most precious and important conversations in your life. The chatbot companies may pinkie swear that they won't use your chats for training or sell information from them to others, but they may still be legally compelled to store and share chats with your confidential information to a court of law . Even if you mark that conversation as "temporary", it could be subject to discovery by third parties. There's also algorithmic bias and systematic inequality problems with using AI for therapy, sure, but granted the outside world isn't much better here. You get what I mean though, we can at least hold people accountable through accreditation and laws. We cannot do the same with soulless AI agents. To be clear: I'm not trying to defend the people using AI models as companions or therapists, but I can understand why they are doing what they are doing. This is horrifying and I hate that I understand their logic. Going into this, I really wished that I would find something that's worth objecting against, some solid reason to want to decry this as a unobjectionably harmful action, but after having dug through it all I am left with is this overwhelming sense of compassion for them because the stories of hurt are so familiar to how things were in some of the darkest points of my life. As someone that has been that desperate for human contact: yeah, I get it. If you've never been that desperate for human contact before, you won't understand until you experience it. Throw the ethical considerations about using next-token-predictors for therapy out for a second. If people are going to do this anyways, would it be better to self-host these models? That way at least your private information stays on your computer so you have better control over what happens. Let's do some math. In general you can estimate how much video memory (vram) you need for running a given model by taking the number of parameters, multiplying it by the size of each parameter in bits, dividing that by eight, and then adding 20-40% to that total to get the number of gigabytes of vram you need. For example, say you want to run gpt-oss 20b (20 billion parameters) at its native MXFP4 (4 bit floating point) quantization on your local machine. In order to run it with a context window of 4096 tokens, you need about 16 gigabytes of vram (13 gigabytes of weights, 3 gigabytes of inference space), but 4096 tokens isn't very useful for many people. That covers about 4 pages of printed text (assuming one token is about 4 bytes on average). When you get reasoning models that print a lot of tokens into the mix, it's easy for the reasoning phase alone of a single question to hit 4096 tokens (especially when approaches like simple test-time scaling are applied). I've found that 64k tokens gives a good balance for video memory use and usefulness as a chatbot. However, when you do that with gpt-oss 20b, it ends up using 32 gigabytes of vram. This only fits on my laptop because my laptop has 64 gigabytes of memory. The largest consumer GPU is the RTX 5090 and that only has 32 gigabytes of video memory. It's barely consumer and even "bad" models will barely fit. Not to mention, industry consensus is that the "smallest good" models start out at 70-120 billion parameters. At a 64k token window, that easily gets into the 80+ gigabyte of video memory range, which is completely unsustainable for individuals to host themselves. Even if AI assistants end up dying when the AI hype bubble pops, there's still some serious questions to consider about our digital assistants. People end up using them as an extension of their mind and expect the same level of absolute privacy and freedom that you would have if you use a notebook as an extension of your mind. Should they have that same level of privacy enshrined into law? At some level the models and chats for free users that ChatGPT, DeepSeek, Gemini, and so many other apps are hosted at cost so that the research team can figure out what those models are being used for and adjust the development of future models accordingly. This is fairly standard practice across the industry and was the case before the rise of generative AI. This is why every app wants to send telemetry to the home base, it's so the team behind it can figure out what features are being used and where things fail to directly improve the product. Generative AI allows you to mass scan over all of the conversations to get the gist of what's going on in there and then use that to help you figure out what topics are being discussed without breaching confidentiality or exposing employees to the contents of the chat threads. This can help you improve datasets and training runs to optimize on things like health information . I don't know how AI companies work on the inside, but I am almost certain that they do not perform model training runs on raw user data because of the risk of memorization causing them to the leak training data back to users. Again, don't put private health information into ChatGPT. I get the temptation, but don't do it. I'm not trying to gatekeep healthcare, but we can't trust these models to count the number of b's in blueberry consistently. If we can't trust them to do something trivial like that, can we really trust them with life-critical conversations like what happens when you're in crisis or to accurately interpret a cancer screening? Maybe we should be the ones self-hosting the AI models that we rely on. At least we should probably be using a setup that allows us to self host the models at all, so you can start out with a cloud hosted model while it's cheap and then move to a local hosting setup if the price gets hiked or the provider is going to shut that old model down. This at least gives you an escape hatch to be able to retain an assistant's "emotional warmth" even if the creator of that model shuts it down because they don't find it economically viable to host it anymore. Honestly this feels like the kind of shit I'd talk about in cyberpunk satire, but I don't feel like doing that anymore because it's too real now. This is the kind of thing that Neal Stephenson or Frank Herbert would have an absolute field day with. The whole Replika fiasco feels like the kind of thing that social commentary satire would find beyond the pale but yet you can find it by just refreshing CBC. Such as that one guy that gave himself bromism by taking ChatGPT output too literally , any of the stories about ChatGPT psychosis , or any of the stories involving using an AI model as a friend/partner . I wasn't able to watch it before publishing this article, but I'm told that the Replika fiasco is almost a beat-for-beat match for the plot of Her (2013) . Life imitates art indeed. I don't think these events are a troubling sign or a warning, they are closer to a diagnosis. We are living in a world where people form real emotional bonds with bags of neural networks that cannot love back, and when the companies behind those neural networks change things, people get emotionally devastated. We aren't just debating the ideas of creating and nurturing relationships with digital minds, we're seeing the side effects of that happening in practice. A lot of this sounds like philosophical science fiction, but as of December 2022 it's science fact. This fight for control of tools that we rely on as extensions of our minds isn't some kind of far-off science fiction plot, it's a reality we have to deal with. If we don't have sovereignty and control over the tools that we rely on the most, we are fundamentally reliant on the mercy of our corporate overlords simply choosing to not break our workflows. Are we going to let those digital assistants be rented from our corporate overlords?

0 views

How Does GPT-5 Work?

Welcome to another premium edition of Where's Your Ed At! Please subscribe to it so I can continue to drink 80 Diet Cokes a day. Email me at [email protected] with the subject "premium" if you ever want to chat. I realize this is before the paywall, so if you email me without paying, no promises I don't respond with the lyrics to Cheeseburger In Paradise . Also: this is an open call — if you've tried prompt caching with GPT-5 on OpenAI's API, please reach out! You've probably heard a lot about GPT-5 this week, with takes ranging from " it's just good at stuff " to SemiAnalysis' wild statement that " GPT-5 [is setting] the stage for Ad Monetization and the SuperApp ," a piece that makes several assertions about how the "router" that underpins GPT-5 is somehow the secret way that OpenAI will inject ads. Here's a quote: This...did not make a ton of sense to me. Why would this be the case? The article also makes a lot of claims about the "value" of a question and how ChatGPT could — I am serious — "agentically reach out to lawyers" based on a query. In fact, I'm not sure this piece reflects how GPT-5 works at all. To be fair on SemiAnalysis, it's not as if OpenAI gave them much help. Here's what it says : There is a really, really important distinction to make here: that GPT-5, as described above, is referring to GPT-5 as part of ChatGPT. OpenAI's API-based access to GPT-5 models does not route them, nor does OpenAI offer access to its router, or any other associated models. How do I know this? Because I went and found out how ChatGPT-5 actually works. In discussions with a source at an infrastructure provider familiar with the architecture, it appears that ChatGPT-5 is, in fact, potentially more expensive to run than previous models, and due to the complex and chaotic nature of its architecture, can at times burn upwards of double the tokens per query. ChatGPT-5 is also significantly more convoluted, plagued by latency issues, and is more compute-intensive thanks to OpenAI's new "smarter, more efficient" model. In simple terms, every user prompt on ChatGPT — whether it's on the auto, "Fast," "Thinking Fast" or "Thinking" tab — starts by putting the user's prompt before the "static prompt," which is a hidden prompt where instructions like "You are ChatGPT, you are a Large Language Model, You Are A Helpful Chatbot" and so on goes. These static prompts are different with each model you use - a reasoning model will have a different instruction set than a more chat-focused one, such as “think hard about a particular problem before giving an answer.” This becomes an issue when you use multiple different models in the same conversation, because the router — the thing that selects the right model for the request — has to look at the user prompt. It can’t consider the static instructions first. The order has to be flipped for the whole thing to work. Put simpler: Previous versions of ChatGPT would take the static prompt, and then (invisibly) append the user prompt onto it. ChatGPT-5 can’t do that.  Every time you use ChatGPT-5, every single thing you say or do can cause it to do something different. Attach a file? Might need a different model. Ask it to "look into something and be detailed?" Might trigger a reasoning model. Ask a question in a weird way? Sorry, the router's gonna need to send you to a different model.  Every single thing that can happen when you ask ChatGPT to do something may trigger the "router" to change model, or request a new tool, and each time it does so requires a completely fresh static prompt, regardless of whether you select Auto, Thinking, Fast or any other option. This, in turn, requires it to expend more compute, with queries consuming more tokens compared to previous versions.  As a result, ChatGPT-5 may be "smart," but it sure doesn't seem "efficient." To play Devil's Advocate, OpenAI likely added the routing model as a means of creating more sophisticated outputs for users, and, I imagine, with the intention of cost-saving. Then again, this may just be the thing that it had ready to ship — after all, GPT-5 was meant to be " the next great leap in AI ," and the pressure was on to get it out the door. By creating a system that depends on an external routing model — likely another LLM — OpenAI has removed the ability to cache the hidden instructions that dictate how the models generate answers in ChatGPT, creating massive infrastructural overhead. Worse still, this happens with every single "turn" (IE: message) on ChatGPT-5, regardless of the model you choose, creating endless infrastructural baggage with no real way out that only compounds based on how complex a user's queries get.  Could OpenAI make a better router? Sure! Does it have a good router today? I don't think so! Every time you message ChatGPT it has the potential to change model or tooling based on its own whims, each time requiring a fresh static prompt. It doesn't even need to be a case where a user asks ChatGPT-5 to "think," and based on my tests with GPT-5, sometimes just asking it a four-word question can trigger it to "think longer" for no apparent reason. OpenAI has created a product with latency issues and an overwhelmingly convoluted routing system that's already straining capacity, to the point that this announcement feels like OpenAI is walking away from its API entirely. Unlike the GPT-4o announcement , which mentions the API in the first paragraph, the GPT-5 announcement has no reference to it, and a single reference to developers when talking about coding. Sam Altman has already hinted that he intends to deprecate any "new API demand " — though I imagine he'll let anyone in who will pay for priority processing . ChatGPT-5 feels like the ultimate comeuppance for a company that was never forced to build a product, choosing instead to bolt increasingly-complex "tools" onto the sides of models in the hopes that one would magically appear. Now each and every "feature" of ChatGPT burns even more money than it did before.  ChatGPT-5 feels like a product that was rushed to market by a desperate company that had to get something out the door. In simpler terms, OpenAI gave ChatGPT a middle manager.

0 views
Sean Goedecke 2 months ago

What's the strongest AI model you can train on a laptop in five minutes?

What’s the strongest model I can train on my MacBook Pro 1 in five minutes. I’ll give the answer upfront: the best 5-minute model I could train was a ~1. 8M-param GPT-style transformer trained on ~20M TinyStories tokens, reaching ~9. 6 perplexity on a held-out split. Here’s an example of the output, with the prompt bolded: Once upon a time , there was a little boy named Tim

0 views
Ahead of AI 2 months ago

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later). This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details. I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.) Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 I hope you find it informative! Before we discuss the architecture in more detail, let's start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below. Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. That being said, there are still many interesting aspects of their design choices. Some are shown in the figure above (while others are not, but we will discuss them later as well). In the rest of this article, I will highlight these features and compare them to other architectures, one at a time. I should also note that I am not affiliated with OpenAI in any way. My information comes from reviewing the released model code and reading their technical reports. If you want to learn how to use these models locally, the best place to start is OpenAI's official model hub pages: https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b The 20B model can run on a consumer GPU with up to 16 GB of RAM. The 120B model can run on a single H100 with 80 GB of RAM or newer hardware. I will return to this later, as there are some important caveats. Before we jump into comparisons between gpt-oss and a more recent architecture, let's hop into the time machine and take a side-by-side look at GPT-2 (Figure 2) to see just how far things have come. Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 That is fc1 + fc2 = 8,388,608 parameters. For the GLU variant, we have fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 I.e., 3 × 1,048,576 = 3,145,728 weight parameters. So, overall, using the GLU variants results in fewer parameters, and they perform better as well. The reason for this better performance is that these GLU variants provide an additional multiplicative interaction, which improves expressivity (the same reason deep & slim neural nets perform better than shallow & wide neural nets, provided they are trained well). In addition to upgrading the feed forward module to a SwiGLU, as discussed in the previous section, gpt-oss replaces the single feed forward module with multiple feed forward modules, using only a subset for each token generation step. This approach is known as a Mixture-of-Experts (MoE) and illustrated in Figure 8 below. Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 It's also worth noting that gpt-oss uses twice as many attention heads, but this doesn't directly increase the model's width. The width is determined by the embedding dimension. Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate). Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost. When it comes to modeling performance, there's unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Gemma 2 paper (Table 9) , which found that for a 9B parameter architecture, a wider setup is slightly better than a deeper setup. Across 4 benchmarks, the wider model achieved a 52.0 average score, and the deeper model achieved a 50.8 average score. As shown in Figure 14 above, it's also noteworthy that gpt-oss has a surprisingly small number of experts (32 instead of 128), and only uses 4 instead of 8 active experts per token. However, each expert is much larger than the experts in Qwen3. This is interesting because the recent trends and developments point towards more, smaller models as being beneficial. This change, at a constant total parameter size, is nicely illustrated in Figure 15 below from the DeepSeekMoE paper. Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Note that the models will also run on older hardware but without MXFP4 support and will thus consume more RAM. Without MXFP4 optimization, the models in bfloat16 will consume more like 48 GB (gpt-oss-20b) and 240 GB (gpt-oss-120b). By the way, I can run the gpt-oss-20b model comfortably on my Mac Mini using ollama. It uses about 13.5 Gb or memory, which is really reasonable. The models are still a bit too new for independent benchmarks. Checking the LM Arena leaderboard , I found that gpt-oss is not listed, yet. So, Qwen3-Instruct remains the top open-weight model, according to users on the LM Arena, for now (Figure 22). Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents. Thanks for reading, and for helping support independent research! Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Model architecture comparisons with GPT-2 MXFP4 optimization to fit gpt-oss models onto single GPUs Width versus depth trade-offs (gpt-oss vs Qwen3) Attention bias and sinks Benchmarks and comparisons with GPT-5 Figure 1: The two gpt-oss models side by side. If you have looked at recent LLM architecture diagrams before, or read my previous Big Architecture Comparison article, you may notice that there is nothing novel or unusual at first glance. This is not surprising, since leading LLM developers tend to use the same base architecture and then apply smaller tweaks. This is pure speculation on my part, but I think this is because There is significant rotation of employees between these labs. We still have not found anything better than the transformer architecture. Even though state space models and text diffusion models exist, as far as I know no one has shown that they perform as well as transformers at this scale. (Most of the comparisons I found focus only on benchmark performance. It is still unclear how well the models handle real-world, multi-turn writing and coding tasks. At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96. EDIT: Someone kindly pointed out that there's a higher-ranking hybrid model: Hunyuan-TurboS at rank 22.) Most of the gains likely come from data and algorithm tweaks rather than from major architecture changes. https://huggingface.co/openai/gpt-oss-20b https://huggingface.co/openai/gpt-oss-120b Figure 2: A side-by-side comparison between gpt-oss-20b and GPT-2 XL 1.5B. Both gpt-oss and GPT-2 are decoder-only LLMs built on the transformer architecture introduced in the Attention Is All You Need (2017) paper. Over the years, many details have evolved. However, these changes are not unique to gpt-oss. And as we will see later, they appear in many other LLMs. Since I discussed many of these aspects in the previous Big Architecture Comparison article, I will try to keep each subsection brief and focused. 2.1 Removing Dropout Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). Figure 3: An illustration of dropout applied to the attention score matrix. I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. Interestingly, while Dropout is kind of ignored in LLM architecture design for many years, I found a 2025 research paper with small scale LLM experiments (Pythia 1.4B) that confirms that Dropout results in worse downstream performance in these single-epoch regimes. 2.2 RoPE Replaces Absolute Positional Embeddings In transformer-based LLMs, positional encoding is necessary because of the attention mechanism. By default, attention treats the input tokens as if they have no order. In the original GPT architecture, absolute positional embeddings addressed this by adding a learned embedding vector for each position in the sequence (Figure 4), which is then added to the token embeddings. Figure 4: Illustration of absolute positional embeddings. RoPE ( Rotary Position Embedding ) introduced a different approach: instead of adding position information as separate embeddings, it encodes position by rotating the query and key vectors in a way that depends on each token's position. (RoPE is an elegant idea but also a bit of a tricky topic to explain. I plan to cover separately in more detail one day.) While first introduced in 2021, RoPE became widely adopted with the release of the original Llama model in 2023 and has since become a staple in modern LLMs. 2.3 Swish/SwiGLU Replaces GELU Early GPT architectures used GELU. Why now use Swish over GELU? Swish (also referred to as sigmoid linear unit or SiLU) is considered computationally slightly cheaper, and in my opinion, that all there is to it. Depending on which paper you look at, you will find that one is slightly better than the other in terms of modeling performance. In my opinion, these small differences are probably within a standard error, and your mileage will vary based on hyperparameter sensitivity. Activation functions used to be a hot topic of debate until the deep learning community largely settled on ReLU more than a decade ago. Since then, researchers have proposed and tried many ReLU-like variants with smoother curves, and GELU and Swish (Figure 5) are the ones that stuck. Figure 5: Comparison between Swish and GELU activations, which are both smoother versions or ReLU. Early GPT architectures used GELU, which is defined as . Here, (short for error function) is the integral of a Gaussian and it is computed using polynomial approximations of the Gaussian integral, which makes it more computationally expensive than simpler functions like the sigmoid used in Swish, where Swish is simply . In practice, Swish is computationally slightly cheaper than GELU, and that's probably the main reason it replaced GELU in most newer models. Depending on which paper we look at, one might be somewhat better in terms of modeling performance. But I'd say these gains are often within standard error, and the winner will depend heavily on hyperparameter tuning. Swish is used in most architectures today. However, GELU is not entirely forgotten; for example, Google's Gemma models still use GELU. What's more notable, though, is that the feed forward module (a small multi-layer perceptron) is replaced by a gated "GLU" counterpart, where GLU stands for gated linear unit and was proposed in a 2020 paper . Concretely, the 2 fully connected layers are replaced by 3 fully connected layers that are used as shown in Figure 6 below. Figure 6: A comparison between Swish and GELU and their gated counterparts, SwiGLU and GEGLU. At first glance, it may appear that the GEGLU/SwiGLU variants may be better than the regular feed forward layers because there are simply more parameters due to the extra layer. But this is deceiving because in practice, the and weight layers in SwiGLU/GEGLU are usually chosen to be half the size each of the layer in a traditional feed forward layer. To illustrate this better, consider the concrete code implementations of the regular and GLU variants: Figure 7: Regular feed forward module (top) and SwiGLU variant (bottom) next to each other. Note that the Swish function is implemented as “silu” in PyTorch. So, suppose we have an embedding dimension of 1024. In the regular feed forward case, this would then be fc1: 1024 × 4096 = 4,194,304 fc2: 1024 × 4096 = 4,194,304 fc1: 1024 × 1024 = 1,048,576 fc2: 1024 × 1024 = 1,048,576 fc3: 1024 × 1024 = 1,048,576 Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module. So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model's total parameter count. However, the key trick is that we don't use ("activate") all experts for every token. Instead, a router selects only a small subset of experts per token. Because only a few experts are active at a time, MoE modules are often referred to as sparse , in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don't use all the parameters at the same time. (Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.) 2.5 Grouped Query Attention Replaces Multi-Head Attention As mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA). In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections. For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency without noticeably affecting modeling performance, according to ablation studies. Figure 9: A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries. So, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache. (If you are curious how GQA looks in code, see my GPT-2 to Llama 3 conversion guide for a version without KV cache and my KV-cache variant here .) While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the original GQA paper and the Llama 2 paper ) show it performs comparably to standard MHA in terms of LLM modeling performance. 2.6 Sliding Window Attention Sliding-window attention (Figure 10 below) was first introduced in the LongFormer paper (2020) and later popularized by Mistral. Interestingly, gpt-oss applies it in every second layer. You can think of it as a variation of multi-head attention, or in this case grouped query attention (GQA), where the attention context is restricted to a smaller window, reducing both memory usage and compute costs. Figure 10: Comparison between regular attention (left) and sliding window attention (right). Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens. As I discussed in my previous article , Gemma 2 (2024) used a similar 1:1 ratio. Gemma 3 earlier this year went much further and shifted to a 5:1 ratio, which means only one full-attention layer for every five sliding-window (local) attention layers. According to the Gemma ablation studies, sliding-window attention has minimal impact on modeling performance, as shown in the figure below. Note that the window size in Gemma 2 was 4096 tokens, which Gemma 3 reduced to 1024. In gpt-oss, the window is just 128 tokens, which is remarkably small. And as a fun fact, the official announcement article notes that sliding-window attention was apparently already used in GPT-3: The models use alternating dense and locally banded sparse attention patterns, similar to GPT-3 Who knew!? I went back to the original GPT-3 paper , and it was indeed mentioned there: We use the same model and architecture as GPT-2 [ RWC+19 ], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [ CGRS19 ]. 2.7 RMSNorm Replaces LayerNorm Finally, the last small tweak, coming from GPT-2, is replacing LayerNorm (2016) by RMSNorm (2019) , which has been a common trend in recent years. Akin to swapping GELU with Swish and SwiGLU, RMSNorm is one of these smaller but sensible efficiency improvements. RMSNorm is similar to LayerNorm in its purpose to normalize layer activations, as shown in Figure 11 below. You might recall that not too long ago, BatchNorm was the go-to choice for this task. It has since fallen out of favor, largely because it is harder to parallelize efficiently (due to the mean and variance batch statistics) and performs poorly with small batch sizes. Figure 11: A comparison between LayerNorm (left) and RMSNorm (right) for a small linear layer. As we can see in Figure 11 above, both LayerNorm and RMSNorm scale the layer outputs to be in a reasonable range. LayerNorm subtracts the mean and divides by the standard deviation such that the layer outputs have a zero mean and unit variance (variance of 1 and standard deviation of one). RMSNorm divides the inputs by the root-mean-square. This scales activations to a comparable magnitude without enforcing zero mean or unit variance. In this particular example shown in Figure 11, the mean is 0.77 and the variance is 0.41. Both LayerNorm and RMSNorm stabilize activation scales and improve optimization, but RMSNorm is often preferred in large-scale LLMs because it is cheaper to compute. Unlike LayerNorm, RMSNorm has no bias (shift) term and reduces the expensive mean and variance computations to a single root-mean-square operation. This reduces the number of cross-feature reductions from two to one, which lowers communication overhead on GPUs and improving training efficiency. Figure 12 shows what this looks like in code: Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler. 2.8 The GPT-2 Legacy I still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It's simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work. By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures. In fact, I think it's worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve. For instance, starting with my GPT-2 code I recently implemented the Qwen3 architecture from scratch , which is super similar to gpt-oss, which brings us to the next topic: Comparing gpt-oss to a more recent architecture. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3. Comparing gpt-oss To A Recent Architecture (Qwen3) Now that we have walked through the evolution from GPT-2 to GPT OSS, we can take the next step and compare GPT OSS to a more recent architecture, Qwen3, which was released three months earlier in May 2025. The reason I am selecting Qwen3 here is that it is among the top open-weight models as of the time of writing. Additionally, one of the Qwen3 MoE models is more or less directly comparable to GPT OSS due to its relatively similar overall size in terms of trainable parameters. Figure 13 below compares gpt-oss-20b to a Qwen3 model of comparable size. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side. As we can see, gpt-oss 20B and Qwen3 30B-A3B are very similar in their architecture components. The primary difference here, aside from the dimensions, is that gpt-oss employs sliding window attention, as discussed earlier in section 1.6 (not shown in this figure), whereas Qwen3 does not. Let's walk through the noteworthy details one by one in the following subsections. 3.1 Width Versus Depth If we look at the two models closely, we see that Qwen3 is a much deeper architecture with its 48 transformer blocks instead of 24 (Figure 14). Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b. On the other hand, gpt-oss is a much wider architecture: An embedding dimension of 2880 instead of 2048 An intermediate expert (feed forward) projection dimension of also 2880 instead of 768 Figure 15: An annotated figure from "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models", https://arxiv.org/abs/2401.06066 Notably, unlike DeepSeek's models, neither gpt-oss nor Qwen3 uses shared experts, though. To be fair, the small number of experts in gpt-oss could be a side effect of the 20B size. Looking at the 120B mode below, they indeed increased the number of experts (and transformer blocks) while keeping everything else fixed, as shown in Figure 16 below. Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts. The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that's where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights). In any case, it's because it's quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects.. Figure 17: Architecture differences in the various Qwen3 models. 3.3 Attention Bias and Attention Sinks Both gpt-oss and Qwen3 use grouped query attention. The main difference is that gpt-oss restricts the context size via sliding window attention in each second layer, as mentioned earlier. However, there's one interesting detail that caught my eye. It seems that gpt-oss uses bias units for the attention weights, as shown in the figure below. Figure 18: gpt-oss models use bias units in the attention layers. See code example here . I haven't seen these bias units being used since the GPT-2 days, and they are commonly regarded as redundant. Indeed, I found a recent paper that shows mathematically that this is at least true for the key transformation (k_proj). Furthermore, the empirical results show that there is little difference between with and without bias units (see Figure 19 below). Figure 19: Table from https://arxiv.org/pdf/2302.08626 showing the average test loss when the models were trained from scratch with and without bias units. Another detail you may have noticed is the definition of in the code screenshot in Figure 18. In general models, attention sinks are special "always-attended" tokens placed at the start of the sequence to stabilize attention, which is especially useful in long-context scenarios. I.e., if the context gets very long, this special attended token at the beginning is still attended to, and it can learn to store some generally useful information about the entire sequence. (I think it was originally proposed in the Efficient Streaming Language Models with Attention Sinks paper.) In the gpt-oss implementation, attention sinks are not actual tokens in the input sequence. Instead, they are learned per-head bias logits that are appended to the attention scores (Figure 20). The goal is the same as with the above-mentioned attention sinks, but without modifying the tokenized inputs. Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code here . 3.4 License Lastly, and similar to Qwen3, the gpt-oss models are Apache 2.0 open-source license, which is great (it's the same license that I prefer for my own open-source projects). This means that the models can be distilled into other models or used in commercial products without restriction. Open-weight vs. open-source LLMs. This distinction has been debated for years, but it is worth clarifying to avoid confusion about this release and its artifacts. Some model developers release only the model weights and inference code (for example, Llama, Gemma, gpt-oss), while others (for example, OLMo) release everything including training code, datasets, and weights as true open source. By that stricter definition, gpt-oss is an open-weight model (just like Qwen3) because it includes the weights and inference code but not the training code or datasets. However, the terminology is used inconsistently across the industry. I assume the "oss" in "gpt-oss" stands for open source software ; however, I am positively surprised that OpenAI itself clearly describes gpt-oss as an open-weight model in their official announcement article . 4 Other Interesting Tidbits While the previous sections described how the architecture has evolved since GPT-2 and discussed its similarities to Qwen3 (and most other recent models), there are still a few additional but noteworthy details I have not mentioned, yet. These are points that did not fit neatly into the earlier sections but are still worth mentioning. 4.1 Training Overview Unfortunately, there is not much information about the training set sizes and algorithms available. I added the most interesting puzzle pieces from the model card report (1) and announcement post (2) below: The gpt-oss models were trained using our most advanced pre-training and post-training techniques [...] (1) [...] required 2.1million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. (1) [...] including a supervised fine-tuning stage and a high-compute RL stage [...] (2) We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. (2) So, we know that the gpt-oss models are reasoning models. The training compute of 2.1 million H100 GPU hours is roughly on par with the 2.788 million H800 GPU hours that the ~5.6x larger DeepSeek V3 model was trained for. Unfortunately, there is no information about the Qwen3 training time available yet. Interestingly, the GPT-oss training hour estimate includes both the supervised learning for instruction following and the reinforcement learning for reasoning, whereas DeepSeek V3 is just a pre-trained base model on top of which DeepSeek R1 was trained separately. 4.2 Reasoning Efforts As mentioned in the previous section, the gpt-oss models are reasoning models. However, what's particularly interesting is that they were trained so that users can easily control the degree of reasoning via inference time scaling. Concretely, gpt-oss models can receive "Reasoning effort: low/medium/high" instructions as part of their system prompt, which directly affects the response length and accuracy, as shown in Figure 21. Figure 21: Response length and quality of gpt-oss models under different reasoning efforts (annotated figure from the model card ) This level of adjustability is useful because it lets us balance cost, compute, and accuracy. For example, if the task is simple, such as answering a straightforward knowledge question or fixing a small typo, we can skip extended reasoning. This saves time and resources while avoiding unnecessarily long responses and verbose reasoning traces. It is somewhat unfortunate that OpenAI did not release the base models prior to reinforcement learning-based reasoning training, unlike Qwen3 or OLMo. Base models are particularly valuable starting points for researchers working on reasoning methods (which is one reason I currently like working with Qwen3 Base). My guess is that OpenAI's decision was driven more by industry and production use cases than by research considerations. Note that the original Qwen3 models also have a toggle for enabling/disabling thinking (reasoning) modes (via a setting in the tokenizer that simply adds <think></think> tags to disable the reasoning behavior). However, the Qwen3 team updated their models in the last few weeks and moved away from the hybrid model towards dedicated Instruct/Thinking/Coder variants. The reason was that the hybrid mode resulted in lower performance compared to the individual models: After discussing with the community and reflecting on the matter, we have decided to abandon the hybrid thinking mode. We will now train the Instruct and Thinking models separately to achieve the best possible quality. Source 4.3 MXFP4 Optimization: A Small But Important Detail One interesting surprise is that OpenAI released the gpt-oss models with an MXFP4 quantization scheme for the MoE experts. Quantization formats used to be a niche topic, mostly relevant to mobile or embedded AI, but that's changed with the push toward bigger models. In this case, the MXFP4 optimization allows the model to run on single GPU devices. Here’s what that looks like in practice: The large model (think 120B) fits on a single 80GB H100 or newer GPU. Not consumer hardware, but hey, it's much cheaper to rent a 1-H100 machine than a multi-H100 machine. Plus, we don't have to worry about distributing the model across GPUs and adding communication overhead. It's really nice that AMD MI300X cards are supported from day 1 as well! The smaller 20B model even fits into 16 GB of VRAM; the caveat is that it has to be a RTX 50-series GPU or newer to support MXFP4. (Edit: support for older cards, such as RTX 4090, was recently added via a patch .) Figure 22: Current view of the LM Arena Leaderboard (as of 8 Aug 2025) Looking at a reasoning benchmarks provide in the gpt-oss announcement post, we can see that the gpt-ossmodels are on par with OpenAI's proprietary models as well as Qwen3 (Figure 23). Figure 23: The main benchmark charts are from the official gpt-oss announcement post . The "no tools" gpt-oss-120b data is taken from the official model card paper , and the Qwen3 numbers are taken from the official Qwen3 repository . However, this should be caveated by the fact that gpt-oss-120b is almost half the size of the Qwen3 A235B-A22B-Thinking-2507 model and can run on a single GPU. Benchmark performance, however, does not always reflect real-world usability. In my limited use over the past few days, I have found gpt-oss to be quite capable. That said, as others have observed, it does seem to have a relatively high tendency to hallucinate (a point also mentioned in its model card). This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some "general knowledge forgetting." Still, because gpt-oss was designed with tool use in mind, this limitation may become less relevant over time. Tool integration in open-source LLMs is still in its early stages, but as it matures, I expect that we increasingly let models consult external sources (like search engines) when answering factual or knowledge-based queries. If that happens, it could be sensible to prioritize reasoning capacity over memorization. This is much like in human learning in school (or in life in general), where problem-solving skills often matter more than memorizing facts. 5 gpt-oss and GPT-5 OpenAI had a busy week and released the long-awaited GPT-5 model shortly after gpt-oss. The GPT-5 release was interesting. And if there's one thing I have to say here, it's that I am really surprised by how good their open-source models really are compared to their best product offering in terms of benchmark performance (Figure 24). Figure 24: The main benchmark charts are from the official GPT-5 announcement post . The gpt-oss data is taken from the official model card paper and announcement post , and the Qwen3 numbers are taken from the official Qwen3-Coder repository . All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models. This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways: Grab a copy of my book . Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training. Check out the video course . There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi. Subscribe . A paid subscription helps to make my writing sustainable and gives you access to additional contents.

0 views
Simon Willison 2 months ago

GPT-5: Key characteristics, pricing and model card

I've had preview access to the new GPT-5 model family for the past two weeks (see related video and my disclosures ) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for. I've collected a lot of notes over the past two weeks, so I've decided to break them up into a series of posts . This first one will cover key characteristics of the models, how they are priced and what we can learn from the GPT-5 system card . Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold): GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model. GPT-5 in the API is simpler: it's available as three models - regular , mini and nano - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.

0 views
Sean Goedecke 2 months ago

Can small AI models think as well as large ones?

An AI trend that’s emerged in the last few months 1 is the idea of a “cognitive core”. Instead of trying to build the largest, most capable model we can, should we be trying to build a small model. In general, big models - models with higher parameter counts - are better models. Claude Opus 4 is better at everything than Claude Sonnet 4, and so on

0 views
Sean Goedecke 3 months ago

Practical notes on getting LLMs to generate new ideas

Large language models struggle to generate new ideas. To AI skeptics, this seems trivially true, since they believe LLMs can only regurgitate content from their training data 1 . To AI believers, this is a puzzle. If a human had the breadth of knowledge of a LLM, wouldn’t they be able to synthesize it and come up with ideas nobody else has had

0 views
Ahead of AI 3 months ago

LLM Research Papers: The 2025 List (January to June)

As some of you know, I keep a running list of research papers I (want to) read and reference. About six months ago, I shared my 2024 list , which many readers found useful. So, I was thinking about doing this again. However, this time, I am incorporating that one piece of feedback kept coming up: "Can you organize the papers by topic instead of date?" The categories I came up with are: Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Also, as LLM research continues to be shared at a rapid pace, I have decided to break the list into bi-yearly updates. This way, the list stays digestible, timely, and hopefully useful for anyone looking for solid summer reading material. Please note that this is just a curated list for now. In future articles, I plan to revisit and discuss some of the more interesting or impactful papers in larger topic-specific write-ups. Stay tuned! It's summer! And that means internship season, tech interviews, and lots of learning. To support those brushing up on intermediate to advanced machine learning and AI topics, I have made all 30 chapters of my Machine Learning Q and AI book freely available for the summer: 🔗 https://sebastianraschka.com/books/ml-q-and-ai/#table-of-contents Whether you are just curious and want to learn something new or prepping for interviews, hopefully this comes in handy. Happy reading, and best of luck if you are interviewing! This year, my list is very reasoning model-heavy. So, I decided to subdivide it into 3 categories: Training, inference-time scaling, and more general understanding/evaluation. This subsection focuses on training strategies specifically designed to improve reasoning abilities in LLMs. As you may see, much of the recent progress has centered around reinforcement learning (with verifiable rewards), which I covered in more detail in a previous article. Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495 This part of the list covers methods that improve reasoning dynamically at test time, without requiring retraining. Often, these papers are focused on trading of computational performance for modeling performance. Reasoning Models - 1a. Training Reasoning Models - 1b. Inference-Time Reasoning Strategies - 1c. Evaluating LLMs and/or Understanding Reasoning Other Reinforcement Learning Methods for LLMs Other Inference-Time Scaling Methods Efficient Training & Architectures Diffusion-Based Language Models Multimodal & Vision-Language Models Data & Pre-training Datasets Annotated figure from Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 8 Jan, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682 13 Jan, The Lessons of Developing Process Reward Models in Mathematical Reasoning, https://arxiv.org/abs/2501.07301 16 Jan, Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models, https://arxiv.org/abs/2501.09686 20 Jan, Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 22 Jan, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 3 Feb, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 5 Feb, Demystifying Long Chain-of-Thought Reasoning in LLMs, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 5 Feb, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387 5 Feb, Teaching Language Models to Critique via Reinforcement Learning, https://arxiv.org/abs/2502.03492 6 Feb, Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 10 Feb, On the Emergence of Thinking in LLMs I: Searching for the Right Intuition, https://arxiv.org/abs/2502.06773 11 Feb, LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!, https://arxiv.org/abs/2502.07374 12 Feb, Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance, https://arxiv.org/abs/2502.08127 13 Feb, Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging - An Open Recipe, https://arxiv.org/abs/2502.09056 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 25 Feb, SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, https://arxiv.org/abs/2502.18449 4 Mar, Learning from Failures in Multi-Attempt Reinforcement Learning, https://arxiv.org/abs/2503.04808 4 Mar, The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models, https://arxiv.org/abs/2503.02875 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 10 Mar, LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL, https://arxiv.org/abs/2503.07536 12 Mar, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516 16 Mar, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 30 Mar, RARE: Retrieval-Augmented Reasoning Modeling, https://arxiv.org/abs/2503.23513 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 31 Mar, JudgeLRM: Large Reasoning Models as a Judge, https://arxiv.org/abs/2504.00050 7 Apr, Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 10 Apr, VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, https://arxiv.org/abs/2504.08837 11 Apr, Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning, https://arxiv.org/abs/2504.08672 13 Apr, Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability, https://arxiv.org/abs/2504.09639 21 Apr, Learning to Reason under Off-Policy Guidance, https://arxiv.org/abs/2504.14945 22 Apr, Tina: Tiny Reasoning Models via LoRA, https://arxiv.org/abs/2504.15777 29 Apr, Reinforcement Learning for Reasoning in Large Language Models with One Training Example, https://arxiv.org/abs/2504.20571 30 Apr, Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math, https://arxiv.org/abs/2504.21233 2 May, Llama-Nemotron: Efficient Reasoning Models, https://arxiv.org/abs/2505.00949 5 May, RM-R1: Reward Modeling as Reasoning, https://arxiv.org/abs/2505.02387 6 May, Absolute Zero: Reinforced Self-play Reasoning with Zero Data, https://arxiv.org/abs/2505.03335 12 May, INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning, https://arxiv.org/abs/2505.07291 12 May, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608 14 May, Qwen3 Technical Report, https://arxiv.org/abs/2505.09388 15 May, Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models, https://arxiv.org/abs/2505.10554 19 May, AdaptThink: Reasoning Models Can Learn When to Think, https://arxiv.org/abs/2505.13417 19 May, Thinkless: LLM Learns When to Think, https://arxiv.org/abs/2505.13379 20 May, General-Reasoner: Advancing LLM Reasoning Across All Domains, https://arxiv.org/abs/2505.14652 21 May, Learning to Reason via Mixture-of-Thought for Logical Reasoning, https://arxiv.org/abs/2505.15817 21 May, RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning, https://arxiv.org/abs/2505.15034 23 May, QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning, https://www.arxiv.org/abs/2505.17667 26 May, Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles, https://arxiv.org/abs/2505.19914 26 May, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590 29 May, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, https://arxiv.org/abs/2505.22954 30 May, Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning, https://arxiv.org/abs/2505.24726 30 May, ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models, https://arxiv.org/abs/2505.24864 2 Jun, Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, https://arxiv.org/abs/2506.01939 3 Jun, Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening, https://www.arxiv.org/abs/2506.02355 9 Jun, Reinforcement Pre-Training, https://arxiv.org/abs/2506.08007 10 Jun, RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling, https://arxiv.org/abs/2506.08672 10 Jun, Reinforcement Learning Teachers of Test Time Scaling, https://www.arxiv.org/abs/2506.08388 12 Jun, Magistral, https://arxiv.org/abs/2506.10910 12 Jun, Spurious Rewards: Rethinking Training Signals in RLVR, https://arxiv.org/abs/2506.10947 16 Jun, AlphaEvolve: A coding agent for scientific and algorithmic discovery, https://arxiv.org/abs/2506.13131 17 Jun, Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs, https://arxiv.org/abs/2506.14245 23 Jun, Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training, https://arxiv.org/abs/2506.18777 26 Jun, Bridging Offline and Online Reinforcement Learning for LLMs, https://arxiv.org/abs/2506.21495

0 views
fnands 3 months ago

Can multi-sensor foundation models be more than the sum of their parts?

Geospatial foundation models (GFMs) have been on my mind recently, partially because I attended the ESA-NASA International Workshop on AI Foundation Model for EO and partially because I’ve been working on fine tuning some GFMs for downstream use at work for the last while. This post was in part prompted by two recent LinkedIn posts, one by Christopher Ren and the other by Madeline Lisaius , both of which express some amount of skepticism about the way in which current GFMs are trained, although from somewhat different angles. Christopher Ren also wrote an expanded blog post on the subject, which takes aim mostly at IBMs new TerraMind GFM, but it is worth reading the responses from one of the TerraMind authors at the bottom of the post as that adds some nuance to the arguments. It’s somewhat clear that GFMs are a hot topic in the Earth Observation (EO) space at the moment, and it is fair to question whether the hype is warranted. At the ESA-NASA workshop one of the points made was that there seems to be much more activity in the creation of GFMs than actual downstream use of them so far, and there were some interesting discussions as to why this might be the case at the moment. A recent post from Bruno Sanchez-Andrade Nuño (director of the Clay project) also made me think that there is a rough bifurcation in the GFM space appearing: one branch goes deep and the other goes wide. I think it is best if we understand which branch a model fits into and not judge one by the standards of the others. I’m not going to directly respond to the other conversations going on: I’m just adding my two cents to the mix, and I want to be clear that the work I am doing definitely falls into the “go deep” branch, and my opinions are very much coloured by that fact. On the surface this might seem like a slightly odd question seeing as one of the principle reasons people are interested in GFMs (and FMs in general) is better generalization: EO is after all often a global endeavour, and it is desirable to have a foundation model that will help your downstream tasks generalize across geographies, illumination conditions, imaging angles etc. But there are many aspects to generalization, some of which don’t apply to all sensors. An example is the time of day an image was taken at. This can strongly affect what your image looks like, as shadows and illumination levels can vary greatly by time of day. This however does not really affect missions like Sentinel-2, where the orbit has been selected in such a way that the mean local solar time when the image is taken is always approximately 10:30 am, leading (by design) to very consistent illumination levels. Similar arguments go for viewing angles. One of the ways that people have been trying to get more general is to train foundation models on multiple sources. Examples of this are the Clay foundation model (at least V1) which was trained on a wide range of sensors from MODIS, with a 500 m GSD, to aerial imagery of New Zealand at under 50 cm GSD: Another example of this is DOFA, which takes a similar approach to variety in input sensors, this time including hyperspectral data at 224 spectral bands: The DOFA paper is worth a read, and putting on my scientist hat: this is really interesting work, and it’s really interesting to see the different solutions that these authors have come up with to make a single model deal with such varied inputs. But putting back on my engineer hat I have to ask: what do we gain from this? One of the points made in the DOFA paper: The increasing number of specialized foundation models makes it difficult to select the most appropriate one for a specific downstream task. On the surface this sounds fair, but is it really that hard to go to PapersWithCode , find the most similar dataset to your downstream task and select a model based on that? I can’t really think of a scenario where you would not just spend a day or two searching through the literature for the most fitting model for your particular use case. The one case I can maybe think this might be the case is if you are a geospatial person with no ML skills and have a model that was set up for you as a black box behind some interface and squeezing every last bit of performance out is not critical to you. When implementing a model for a specific product need, one often focuses on the specific sensor, or at least a set of very similar sensors, e.g. sub-meter GSD sensors with at least the four usual spectral bands. When building a product that will utilize exclusively Sentinel-1 data, does the model really gain anything from being trained on Sentinel-2 and aerial imagery as well? With all that being said, if you do have multiple sensors available at inference time (e.g. Sentinel-2 and Sentinel-1 data), it does seem to make sense to train/infer on multiple modalities at once. See e.g.  table 2 in the TerraMind paper . A while ago we were testing out a few foundation models as backbones for a product we are developing, which boils down to bi-temporal change detection using Planet’s SkySat constellation. We chose the main backbone we are using based on benchmarks, but I did have the nagging question of how much do we really gain from this, and if other backbones might offer better performance? This was basically the theme of my talk at the aforementioned ESA-NASA workshop. I ran a few test using a variety of FM backbones, some trained on remote sensing data, and some just on natural images, just to see how much the pre-training dataset really matters. To make the comparison fair, the backbones used all had around 100 M parameters, but I did throw in a smaller baseline ( ChangeFormer ), as well as a 300 M version of the best performing network just to see if size matters (spoiler: it does). One of the most interesting comparisons here is DINOv2: I used two variations, one using the original weights trained on natural images from Meta, and another with weights from Keumgang Cha , which were trained on the MillionAID and SkyScript datasets. MillionAID is exclusively aerial imagery, while SkyScript contains mostly aerial imagery, plus some SkySat, Sentinel-2 and Landsat images. It’s abundantly clear that the same architecture trained on remote sensing images greatly improve downstream performance compared to a the variant that was trained on natural images. This is expected, but it’s impressive to see how large this gap is. The best model we tested was trained mostly on aerial imagery, showing the domain gap isn’t so much whether or not your sensor is in space or on a plane, but has more to do with similar resolutions. The models were all trained for the same number of epochs, on the same relatively small dataset (around 1600 patches of 512 x 512 pixels) with the same optimizer etc. The encoders were not frozen, but trained with a lower learning rate than the decoders, as is common practice in most transfer learning scenarios. I will caveat this all with saying that I didn’t do a massive amount of hyperparameter tuning for this test, but I think the differences are significant enough that it probably wouldn’t make too much of a difference. What I would need to see to be convinced is that when training a foundation model on multiple sensors that it would perform better on downstream tasks on each of the sensors than it would be if it was trained exclusively on the specific sensor to be used. I.e. one would need to show that the model would be more than the sum of it’s parts. The question is pretty much, given the same architecture, compute budget and dataset size, can a model learn something from one sensor that improves its performance on another? Or could it be that we need to throw everything into a big bucket and burn a lot of compute in the fashion of the current big LLMs that are so popular right now in order to really see generalization? I’m definitely not ruling out the possibility that there might be some case (e.g. the sensor you are targeting doesn’t have a lot of data available), but I have the feeling that the further away in GSD and spectral characteristics you go the less helpful pre-training becomes. It’s fairly obvious that the best GFM you can choose will likely be the one trained on the exact sensor you are targeting for your downstream task. This is fairly easy for sensors like the Sentinel missions or the Landsat missions, where anyone with a bit of compute to burn can easily download tons of data from those missions and train a model. Even for aerial imagery there is a lot of open data available, with the caveat that the data is not as global, and aerial sensors do have some sensor to sensor variability. Where this gets tricky is in the commercial domain, where data isn’t freely available and providers put strict licenses on their data 1 . To train a foundation model on commercial data requires you to dump somewhere between hundreds of thousands up to millions of Euros on data alone, which is infeasible for most researchers, and a significant investment for most companies. The only case that I know of so far of someone creating a sensor specific foundation model is a Pleiades Neo foundation model created by Disaitek , which was made possible by being granted access to Pleiades Neo imagery through a “Call for Innovation” from Aribus and CNES. Disaitek of course does not make this model public, as this presumably gives them a bit of an edge over their competitors, and as the model was trained on data covering France only, it is questionable of how much use it would be in other parts of the world. So what can be done in the commercial space? Most companies don’t have access to enough data to easily train a foundation model, and those who do are unlikely to share it as it gives them an edge over their competition. The only players with both the access to the data and the incentive to make these models available to others are the imagery providers themselves, i.e. Planet, Airbus, Maxar, Blacksky, Capella etc. Do I think these providers will just open these models for all to use? I doubt it, but they might offer it as a perk to their customers. I.e. something along the lines of “buy at least X Euro worth of imagery per year and get access to our FM”. The competition in the 30 cm class imagery space seems to be heating up, with several players building up large constellations of satellites in this resolution range, like Maxar’s Legion, Planet’s Pelican and BlackSky’s Gen-3. One way these providers can differentiate their offerings would be by offering a foundation model trained on their specific sensor. Whether I think it’s likely that they do this is another question. Please take this post for what it is: the opinionated rant of someone who works in a somewhat privileged niche of the EO domain where I have a lot of expensive VHR data to play with. The problems I am trying to solve and the constraints I have are likely quite different from those that others might encounter. With that being said, if you find yourself if a similar boat to me and are wondering which foundation model to pick for your particular task: pick the one trained on the closest thing you can find to the sensor you are targeting. I am kind of hoping that someone does prove me wrong, and I will happily write an apology post if some does so. The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎ The one exception here is Umbra, who have a very generous open data program , and probably have enough data there that anyone can just train a decently sized model on their data.↩︎

0 views
Ahead of AI 5 months ago

The State of Reinforcement Learning for LLM Reasoning

A lot has happened this month, especially with the releases of new flagship models like GPT-4.5 and Llama 4. But you might have noticed that reactions to these releases were relatively muted. Why? One reason could be that GPT-4.5 and Llama 4 remain conventional models, which means they were trained without explicit reinforcement learning for reasoning. Meanwhile, competitors such as xAI and Anthropic have added more reasoning capabilities and features into their models. For instance, both the xAI Grok and Anthropic Claude interfaces now include a "thinking" (or "extended thinking") button for certain models that explicitly toggles reasoning capabilities. In any case, the muted response to GPT-4.5 and Llama 4 (non-reasoning) models suggests we are approaching the limits of what scaling model size and data alone can achieve. However, OpenAI's recent release of the o3 reasoning model demonstrates there is still considerable room for improvement when investing compute strategically, specifically via reinforcement learning methods tailored for reasoning tasks. (According to OpenAI staff during the recent livestream, o3 used 10× more training compute compared to o1.) Source: OpenAI livestream (https://openai.com/live/) on April 16, 2025 While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks (so far). And I expect reasoning-focused post-training to become standard practice in future LLM pipelines. So, in this article, let's explore the latest developments in reasoning via reinforcement learning. This article focuses on reinforcement learning training methods used to develop and improve reasoning models Because it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view. Understanding reasoning models RLHF basics: where it all started A brief introduction to PPO: RL's workhorse algorithm RL algorithms: from PPO to GRPO RL reward modeling: from RLHF to RLVR How the DeepSeek-R1 reasoning models were trained Lessons from recent RL papers on training reasoning models Noteworthy research papers on training reasoning models Tip: If you are already familiar with reasoning basics, RL, PPO, and GRPO, please feel free to directly jump ahead to the “Lessons from recent RL papers on training reasoning models” section, which contains summaries of interesting insights from recent reasoning research papers. The big elephant in the room is, of course, the definition of reasoning. In short, reasoning is about inference and training techniques that make LLMs better at handling complex tasks. To provide a bit more detail on how this is achieved (so far), I'd like to define reasoning as follows: Reasoning, in the context of LLMs, refers to the model's ability to produce intermediate steps before providing a final answer. This is a process that is often described as chain-of-thought (CoT) reasoning. In CoT reasoning, the LLM explicitly generates a structured sequence of statements or computations that illustrate how it arrives at its conclusion. And below is a figure along with the definition. A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation. If you are new to reasoning models and would like a more comprehensive introduction, I recommend my previous articles: Now, as hinted at the beginning of this section, the reasoning abilities of LLMs can be improved in two ways, as nicely illustrated in a figure from an OpenAI blog post: Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods. The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. Conventional LLMs typically undergo a 3-step training procedure: Pre-training Supervised fine-tuning Alignment (typically via RLHF) The "original" LLM alignment method is RLHF, which is part of the standard repertoire when developing LLMs following the InstructGPT paper, which described the recipe that was used to develop the first ChatGPT model. The original goal of RLHF is to align LLMs with human preferences. For instance, suppose you use an LLM multiple times where the LLM generates multiple answers for a given prompt. RLHF guides the LLM towards generating more of the style of answer that you prefer. (Often, RLHF is also used to safety-tune LLMs: to avoid sharing sensitive information, using swear words, and so on.) If you are new to RLHF, here is an excerpt from a talk I gave a few years ago that explains RLHF in less than 5 minutes: Alternatively, the paragraphs below describe RLHF in text form. The RLHF pipeline takes a pre-trained model and fine-tunes it in a supervised fashion. This fine-tuning is not the RL part yet but is mainly a prerequisite. Then, RLHF further aligns the LLM using an algorithm called proximal policy optimization (PPO). (Note that there are other algorithms that can be used instead of PPO; I was specifically saying PPO because that's what was originally used in RLHF and is still the most popular one today.) For simplicity, we will look at the RLHF pipeline in three separate steps: RLHF Step 1 (prerequisite): Supervised fine-tuning (SFT) of the pre-trained model RLHF Step 2: Creating a reward model RLHF Step 3: Fine-tuning via proximal policy optimization (PPO) RLHF Step 1, shown below, is a supervised fine-tuning step to create the base model for further RLHF fine-tuning. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF step 1, we create or sample prompts (from a database, for example) and ask humans to write good-quality responses. We then use this dataset to fine-tune the pre-trained base model in a supervised fashion. As mentioned before, this is not technically part of RL training but merely a prerequisite. In RLHF Step 2, we then use this model from supervised fine-tuning (SFT) to create a reward model, as shown below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 As depicted in the figure above, for each prompt, we generate four responses from the fine-tuned LLM created in the prior step. Human annotators then rank these responses based on their preferences. Although this ranking process is time-consuming, it might be somewhat less labor-intensive than creating the dataset for supervised fine-tuning. This is because ranking responses is likely simpler than writing them. Upon compiling a dataset with these rankings, we can design a reward model that outputs a reward score for the optimization subsequent stage in RLHF Step 3. The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets. This reward model (RM) generally originates from the LLM created in the prior supervised fine-tuning (SFT) step. To turn the model from RLHF Step 1 into a reward model, its output layer (the next-token classification layer) is substituted with a regression layer, which features a single output node. The third step in the RLHF pipeline is to use the reward model (RM) to fine-tune the previous model from supervised fine-tuning (SFT), which is illustrated in the figure below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF Step 3, the final stage, we are now updating the SFT model using proximal policy optimization (PPO) based on the reward scores from the reward model we created in RLHF Step 2. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. As mentioned earlier, the original RLHF method uses a reinforcement learning algorithm called proximal policy optimization (PPO). PPO was developed to improve the stability and efficiency of training a policy. (In reinforcement learning, “policy” just means the model we want to train; in this case, policy = LLM.) One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training. On top of that, PPO also includes a KL divergence penalty in the loss. This term compares the current policy (the model being trained) to the original SFT model. This encourages the updates to stay reasonably close. The idea is to preference-tune the model, not to completely re-train, after all. This is where the “proximal” in proximal policy optimization comes from: the algorithm tries to keep the updates close to the existing model while still allowing for improvement. And to encourage a bit of exploration, PPO also adds an entropy bonus, which this encourages the model to vary the outputs during training. In the following paragraphs, I want to introduce some more terminology to illustrate PPO on a relatively high level. Still, there's a lot of jargon involved, so I tried to summarize the key terminology in the figure below before we continue. Illustration of the key terms in RLHF. For instance, several models are involved in PPO, where PPO is an algorithm used in RLHF (and RLHF is one of the most popular LLM alignment methods). Below, I aim to illustrate the key steps in PPO via pseudo-code. In addition, to make it more intuitive, I will also use an analogy: Imagine you are a chef running a small food delivery service. And you are constantly trying out new recipe variations to improve customer satisfaction. Your overall goal is to tweak your recipe (policy) based on customer feedback (reward). 1. Compute the ratio of the next-token probabilities from the new vs the old policy: In short, this checks how different our new recipe is from the old one. Side note: Regarding "new_policy_prob", we are not using the final updated policy yet. We are using the current version of the policy (i.e., the model we are in the middle of training). However, it's a convention to call it "new". So, even though you're still experimenting, we call your current draft the "new policy" as per convention. 2. Multiply that ratio by how good the action was (called the advantage): Here, for simplicity, we may assume the advantage is computed based on the reward signal: In the chef analogy, we can think of the advantage as how well the new dish performed: For example, if a customer rates the new dish with a 9/10, and the customers normally give us a 7/10, that's a +2 advantage. Note that this is a simplification. In reality, this involves generalized advantage estimation (GAE), which I am omitting here so as not to bloat the article further. However, one important detail to mention is that the expected reward is computed by a so-called "critic" (sometimes also called "value model"), and a reward model computes the actual reward. I.e., the advantage computation involves 2 other models, typically the same size as the original model we are fine-tuning. In the analogy, we can think of this critic or value model as a friend we ask to try our new dish before serving it to the customers. We also ask our friend to estimate how a customer would rank it (that's the expected reward). The reward model is the actual customer then who gives the feedback (i.e., the actual reward). 3. Compute a clipped score: If the new policy changes too much (e.g., ratio > 1.2 or < 0.8), we clip the ratio, as follows: In the analogy, imagine that the new recipe got an exceptionally great (or bad) review. We might be tempted to overhaul the entire menu now. But that's risky. So, instead, we clip how much our recipe can change for now. (For instance, maybe we made the dish much spicier, and that one customer happened to love spicy food, but that doesn't mean everyone else will.) 4. Then we use the smaller of the raw score and clipped scor e: Again, this is related to being a bit cautious. For instance, if the advantage is positive (the new behavior is better), we cap the reward. That's because we don't want to over-trust a good result that might be a coincidence or luck. If the advantage is negative (the new behavior is worse), we limit the penalty. The idea here is similar. Namely, we don't want to overreact to one bad result unless we are really sure. In short, we use the smaller of the two scores if the advantage is positive (to avoid over-rewarding), and the larger when the advantage is negative (to avoid over-penalizing). In the analogy, this ensures that if a recipe is doing better than expected, we don't over-reward it unless we are confident. And if it's underperforming, we don't over-penalize it unless it's consistently bad. 5. Calculating the loss: This final score is what we maximize during training (using gradient descent after flipping the sign of the score to minimize). In addition, we also add a KL penalty term, where β is a hyperparameter for the penalty strength: In the analogy, we add the penalty to ensure new recipes are not too different from our original style. This prevents you from "reinventing the kitchen" every week. For example, we don't want to turn an Italian restaurant into a BBQ place all of a sudden. This was a lot of information, so I summarized it with a concrete, numeric example in an LLM context via the figure below. But please feel free to skip it if it's too complicated; you should be able to follow the rest of the article just fine. I admit that I may have gone overboard with the PPO walkthrough. But once I had written it, it was hard to delete it. I hope some of you will find it useful! That being said, the main takeaways that will be relevant in the next section are that there are multiple models involved in PPO: 1. The policy, which is the LLM that has been trained with SFT and that we want to further align). 2. The reward model, which is a model that has been trained to predict the reward (see RLHF step 2). 3. The critic, which is a trainable model that estimates the reward. 4. A reference model (original policy) that we use to make sure that the policy doesn't deviate too much. By the way, you might wonder why we need both a reward model and a critic model. The reward model is usually trained before training the policy with PPO. It's to automate the preference labeling by human judges, and it gives the score for the complete responses generated by the policy LLM. The critic, in contrast, judges partial responses. We use it to create the final response. While the reward model typically remains frozen, the critic model is updated during training to estimate the reward created by the reward model better. More details about PPO are out of the scope of this article, but interested readers can find the mathematical details in these four papers that predate the InstructGPT paper: (1) Asynchronous Methods for Deep Reinforcement Learning (2016) by Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu introduces policy gradient methods as an alternative to Q-learning in deep learning-based RL. (2) Proximal Policy Optimization Algorithms (2017) by Schulman, Wolski, Dhariwal, Radford, and Klimov presents a modified proximal policy-based reinforcement learning procedure that is more data-efficient and scalable than the vanilla policy optimization algorithm above. (3) Fine-Tuning Language Models from Human Preferences (2020) by Ziegler, Stiennon, Wu, Brown, Radford, Amodei, Christiano, Irving illustrates the concept of PPO and reward learning to pretrained language models including KL regularization to prevent the policy from diverging too far from natural language. (4) Learning to Summarize from Human Feedback (2022) by Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano introduces the popular RLHF three-step procedure that was later also used in the InstructGPT paper . As mentioned before, PPO was the original algorithm used in RLHF. From a technical standpoint, it works perfectly fine in the RL pipeline that's being used to develop reasoning models. However, what DeepSeek-R1 used for their RL pipeline is an algorithm called Group Relative Policy Optimization (GRPO), which was introduced in one of their earlier papers: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) The DeepSeek team introduced GRPO as a variant of Proximal Policy Optimization (PPO) that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO. So, the key motivation here is to improve computational efficiency. The efficiency improvements are achieved by dropping the "critic" (value model), i.e., the LLM that computes the value function (i.e., the expected future reward). Instead of relying on this additional model to compute the estimated reward to compute the advantages, GRPO takes a simpler approach: it samples multiple answers from the policy model itself and uses their relative quality to compute the advantages. To illustrate the differences between PPO and GRPO, I borrowed a nice figure from the DeepSeekMath paper: Annotated figure from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (https://arxiv.org/abs/2402.03300) to illustrate the differences between PPO and GRPO. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. So far, we looked at RLHF as a procedure, and we have introduced two reinforcement learning algorithms commonly used for it: PPO and GRPO. But if RLHF is already a core part of the LLM alignment toolkit, what does any of this have to do with reasoning? The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. The difference is that instead of relying on human preferences and training a reward mode l, the DeepSeek-R1 team used verifiable rewards . This approach is called reinforcement learning with verifiable rewards (RLVR). Again, it's worth emphasizing: In contrast to standard RLHF, RLVR bypasses the need for a reward model. So, rather than learning what counts as a "good" answer from human-labeled examples, the model gets direct binary feedback (correct or wrong) from a deterministic tool, such as symbolic verifiers or rule-based tools. Think calculators for math problems or compilers for code generation. Example of reinforcement learning with verifiable rewards (RLVR). The model is prompted to solve a math problem and produces an answer. Instead of using a learned reward model, a symbolic verifier (e.g., a calculator) checks the output and provides binary feedback based on correctness. One motivation here is to avoid noisy or expensive human or learned rewards by using automatic correctness checks as supervision signals during RL. The other motivation is that by using "cheap" tools like calculators, we can replace the expensive reward model training and the reward model itself. Since the reward model is usually the whole pre-trained model (but with a regression head), RLVR is much more efficient. So, in short, DeepSeek-R1 used RLVR with GRPO, which eliminates two expensive models in the training procedure: the reward model and the value model (critic), as illustrated in the figure below. Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by also removing the reward model, relying instead on verifiable rewards from symbolic tools like calculators or compilers. In the next section, I want to briefly go over the DeepSeek-R1 pipeline and discuss the different verifiable rewards that the DeepSeek team used. Now that we have clarified what RLHF and RLVR are, as well as PPO and GRPO, let's briefly recap the main insights from the DeepSeek-R1 paper in the context of RL and reasoning. First, there were three types of models: DeepSeek-R1-Zero trained with pure RL DeepSeek-R1 trained with instruction fine-tuning (SFT) and RL DeepSeek-Distill variants created via instruction fine-tuning SFT without RL I created a DeepSeek-R1 pipeline diagram to illustrate how these models relate to each other, as shown below. Training pipeline for the DeepSeek-R1 family DeepSeek-R1-Zero was trained using the verifiable rewards (RLVR) with GRPO, and this turned out to be sufficient for the model to exhibit reasoning abilities via intermediate-step generation. This showed that it's possible to skip the SFT stage. The model improves its reasoning abilities through exploration instead of learning from examples. DeepSeek-R1 is the flagship model, the one with the best performance. The difference compared to DeepSeek-R1-Zero is that they alternated instruction fine-tuning, RLVR, and RLHF. DeepSeek-Distill variants are meant to be small and more easily deployable models; they were generated by instruction fine-tuning Llama 3 and Qwen 2.5 models using instruction data from the DeepSeek-R1 model. This approach didn't use any RL for the reasoning part (however, RLHF was used to create the Llama 3 and Qwen 2.5 base models). For more details on explaining the DeepSeek-R1 pipeline, please see my previous article "Understanding Reasoning LLMs": The main takeaway here is that the DeepSeek team didn't use an LLM-based reward model to train DeepSeek-R1-Zero. Instead, they used rule-based rewards for the reasoning training of DeepSeek-R1-Zero and DeepSeek-R1: We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process [...] To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: (1) Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. (2) Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '<think>' and '</think>’ tags. I realize that the introduction (i.e., everything up to this point) turned out to be much longer than I expected. Nonetheless, I think that this lengthy introduction is perhaps necessary to put the following lessons into context. After going through a large number of recent papers on reasoning models last month, I have put together a summary of the most interesting ideas and insights in this section. (References like “[1]” point to the corresponding papers listed at the end of the article.) The original DeepSeek-R1 paper demonstrated clearly that supervised fine-tuning (SFT) followed by reinforcement learning (RL) outperforms RL alone. Given this observation, it's intuitive that additional RL should further improve distilled models (as distilled models essentially represent models trained via SFT using reasoning examples generated by a larger model.) Indeed, the DeepSeek team observed this phenomenon explicitly: Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. Several teams independently verified these observations: [8] Using the 1.5B DeepSeek-R1-Distill-Qwen model, researchers demonstrated substantial performance improvements from RL fine-tuning with just 7,000 examples and a modest $42 compute budget. Impressively, this small model surpassed OpenAI’s o1-preview on the AIME24 math benchmark. [15] However, another team cautioned that these gains might not always be statistically significant. This suggests that, although RL can improve smaller distilled models, the benchmark results might sometimes be overstating the improvements. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 I previously mentioned that RL with verifiable rewards (RLVR) does not strictly require the GRPO algorithm; DeepSeek's GRPO simply happens to be efficient and to perform well. However, [12] showed that vanilla PPO paired with a basic binary correctness reward was sufficient to scale models in reasoning capability and response length. More interestingly, both PPO and GRPO have a length bias. And several papers explored methods to tackle excessively long incorrect answers: [14] Provided an analysis illustrating how PPO inadvertently favors longer responses due to mathematical biases in loss calculations; GRPO may suffer from the same issue. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 As a follow-up to the statement above, [7] [10] specifically identified length and difficulty-level biases in GRPO. The modified variant "Dr. GRPO" simplifies advantage calculations by removing length and standard deviation normalization, providing clearer training signals. [1] Explicitly penalized lengthy incorrect answers in GRPO while rewarding concise, correct ones. [3] [6] Didn’t directly control response length in GRPO but found token-level rewards beneficial, allowing models to better focus on critical reasoning steps. [5] Introduced explicit penalties in GRPO for responses exceeding specific lengths, enabling precise length control during inference. Beyond "AHA" moments mentioned in the DeepSeek-R1 paper, RL has been shown to induce valuable self-verification and reflective reasoning capabilities in models [2] [9]. Interestingly, similar to the AHA moment, these capabilities emerged naturally during training without explicit instruction. [1] Showed that extending context lengths (up to 128k tokens) further improves the model's self-reflection and self-correction capabilities. Most research efforts so far has focused on reasoning tasks in math or coding contexts. However, [4] demonstrated successful generalization by training models on logic puzzles. And models trained on logic puzzles also achieved strong performance in mathematical reasoning tasks. This is evidence for RL's ability to induce general reasoning behaviors independent of specific domain knowledge. As a follow-up to the section above, another interesting insight [11] is that reasoning capabilities can naturally extend beyond structured domains like math, code, and logic. Models successfully applied reasoning to areas including medicine, chemistry, psychology, economics, and education, leveraging generative soft-scoring methods to effectively handle free-form answers. Notable next steps for reasoning models include: Integrating existing reasoning models (e.g., o1, DeepSeek-R1) with capabilities such as external tool use and retrieval-augmented generation (RAG); the just-released o3 model from Open AI paves the way here Speaking of tool-use and search, [9] showed that giving reasoning models the ability to search induces behaviors such as self-correction and robust generalization across benchmarks, despite minimal training datasets. Based on the hoops DeepSeek-R1 team went through in terms of maintaining the performance on knowledge-based tasks, I believe adding search abilities to reasoning models is almost a no-brainer. The fundamental claim behind DeepSeek-R1 (and R1-Zero) is that RLVR explicitly induces reasoning capabilities. However, recent findings [10] suggest that reasoning behaviors, including the "Aha moment," might already be present in base models due to pre-training on extensive chain-of-thought data. My recent comparisons between DeepSeek V3 base and R1 reinforce this observation, as the updated base model also demonstrates reasoning-like behaviors. For instance, the comparison between the original V3 and R1 models clearly shows the difference between a non-reasoning and a reasoning model: However, this is no longer true when comparing the updated V3 base model to R1: Additionally, [13] identified that self-reflection and self-correction behaviors emerge progressively throughout pre-training across various domains and model sizes. This further complicates the attribution of reasoning capabilities solely to RL methods. Perhaps the conclusion is that RL definitely turns simple base models into reasoning models. However, it's not the only way to induce or improve reasoning abilities. As the DeepSeek-R1 team showed, distillation also improves reasoning. And since distillation, in this paper, meant instruction fine-tuning on chain-of-thought data, it's likely that pre-training on data that includes chain-of-thought data induces these abilities as well. (As I explained in my book through hands-on code, pre-training and instruction fine-tuning are based on the same next-token prediction task and loss functions, after all.) After reading through a large number of reasoning papers last month, I tried to summarize the most interesting takeaways in the previous section. However, for those who are curious about the sources with a bit more detail, I also listed 15 relevant papers in this section below as an optional read. (For simplicity, the following summaries are sorted by date.) Please note that this list is also not comprehensive (I capped it at 15), as this article is already more than too long! 📄 22 Jan, Kimi k1.5: Scaling Reinforcement Learning with LLMs , https://arxiv.org/abs/2501.12599 It's interesting that this paper came out the same day as the DeepSeek-R1 paper! Here, the authors showcase a multi-modal LLM trained with RL. Similar to DeepSeek-R1, they didn't use process reward models (PRMs) but employed verifiable rewards. A PRM is a type of reward model used in RL (especially in LLM training) that evaluates not just the final answer but also the reasoning steps that led to it. Another key idea here is that scaling the context length (up to 128k tokens) helps the model plan, reflect, and self-correct during reasoning. So, in addition to the correctness reward that is similar to DeepSeek-R1 they also have a length reward. Specifically, they promote shorter correct responses, and incorrect long answers get penalized more. And they propose a method called long2short to distill these long-chain-of-thought skills into more efficient short-CoT models. (It does this by distilling shorter correct responses from the long-CoT model using methods like model merging, shortest rejection sampling, DPO, and a 2nd round of RL with stronger length penalties.) Annotated figure from Kimi k1.5: Scaling Reinforcement Learning with LLMs, https://arxiv.org/abs//2501.12599 📄 3 Feb, Competitive Programming with Large Reasoning Models , https://arxiv.org/abs/2502.06807 This paper from OpenAI evaluates their o-models (like o1, o1-ioi, and o3) on competitive programming tasks. While it doesn't go into the technical details of how RL was applied, it still offers some interesting takeaways. First, the models were trained using outcome-based RL, rather than process-based reward models. This is similar to approaches like DeepSeek-R1 and Kimi. One of the interesting findings is that o3 can learn its own test-time (i.e., inference-time scaling) strategies. For example, it often writes a simple brute-force version of a problem (something that trades efficiency for correctness) and then uses it to verify the outputs of its more optimized solution. This kind of strategy wasn't hand-coded; the model figured it out on its own. So overall, the paper argues that scaling general-purpose RL allows models to develop their own reasoning and verification methods, without needing any human heuristics or domain-specific inference pipelines. In contrast, other (earlier) models like o1-ioi relied on handcrafted test-time strategies like clustering thousands of samples and reranking them, which required a lot of manual design and tuning. Annotated figure from Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 [3] Exploring the Limit of Outcome Reward 📄 10 Feb, Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning , https://arxiv.org/abs/2502.06781 This paper explores how far RL with just binary "correct" or "wrong" feedback (like in DeepSeek-R1) can go for solving math problems. To do this, they start by using Best-of-N sampling to collect positive examples and apply behavior cloning on them, which they show is theoretically enough to optimize the policy. To deal with the challenge of sparse rewards (especially when long chains of thought include partially correct steps) they add a token-level reward model that learns to assign importance weights to different parts of the reasoning. This helps the model focus on the most critical steps when learning and improves the overall performance. Annotated figure from Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning, https://arxiv.org/abs/2502.06781 📄 20 Feb, Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning , https://arxiv.org/abs/2502.14768 DeepSeek-R1 focused on math and code tasks. This paper trains a 7B model using logic puzzles as the main training data. The researchers adopt a similar rule-based RL setup as DeepSeek-R1 but make several adjustments: 1. They introduce a strict format reward that penalizes shortcuts and ensures the model separates its reasoning from its final answer using <think> and <answer> tags. 2. They also use a system prompt that explicitly tells the model to first think through the problem step-by-step before giving the final answer. Even with only 5K synthetic logic problems, the model develops good reasoning skills that generalize well to harder math benchmarks like AIME and AMC. This is particularly interesting because it shows that logic-based RL training can teach models to reason in ways that transfer beyond the original domain. Annotated figure from Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, https://arxiv.org/abs/2502.14768 📄 6 Mar, L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , https://arxiv.org/abs/2503.04697 One hallmark of reasoning models is that they tend to generate longer outputs because of chain-of-thought reasoning. But by default, there is no explicit way to control how long the responses are. This paper introduces Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that helps models to adhere to user-specified length constraints while still optimizing for accuracy. In short, LCPO is similar to GRPO, i.e., "GRPO + Custom Reward for Length Control" implemented as where the target length is provided as part of the user prompt. This LCPO method above encourages the model to adhere to the provided target length exactly. In addition, they also introduce an LCPO-Max variant, which, instead of encouraging the model to match the target length exactly, encourages the model to stay below a maximum token length: The authors train a 1.5B model called L1 using LCPO, which can adjust its output length based on the prompt. This lets users trade-off between accuracy and compute, depending on the task. Interestingly, the paper also finds that these long-chain models actually become surprisingly good at short reasoning too, even outperforming much larger models like GPT-4o at the same token lengths. Annotated figure from L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning, https://arxiv.org/abs/2503.04697 📄 10 Mar, R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning , https://arxiv.org/abs/2503.05592 Reasoning models like DeepSeek-R1 that have been trained with RL rely on their internal knowledge. The authors here focus on improving these models on knowledge-based tasks that require more time-sensitive or recent information by adding access to external search systems. So, this paper improves these models by teaching them to use external search systems during the reasoning process. Instead of relying on test-time strategies or supervised training, the authors use a two-stage reinforcement learning method that helps the model learn how and when to search on its own. The model first learns the search format, and then learns how to use search results to find correct answers. Annotated figure from R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.05592 📄 ​​18 Mar, DAPO: An Open-Source LLM Reinforcement Learning System at Scale , https://arxiv.org/abs/2503.14476 While this paper is mainly about developing a DeepSeek-R1-like training pipeline and open-sourcing it, it also proposes interesting improvements to the GRPO algorithm that was used in DeepSeek-R1 training. 1. Clip-higher: Increases the upper bound of the PPO clipping range to encourage exploration and prevent entropy collapse during training. 2. Dynamic sampling: Improves training efficiency by filtering out prompts where all sampled responses are either always correct or always wrong. 3. Token-level policy gradient loss: moves from sample-level to token-level loss calculation so that longer responses can have more influence on the gradient update.* 4. Overlong reward shaping: Adds a soft penalty for responses that get truncated for being too long, which reduces reward noise and helps stabilize training. * Standard GRPO uses a sample-level loss calculation. This involves first averaging the loss over the tokens for each sample and then averaging the loss over the samples. Since the samples have equal weight, the tokens in samples with longer responses may disproportionally contribute less to the overall loss. At the same time, researchers observed that longer responses often contain gibberish before the final answer, and this gibberish wouldn't be sufficiently penalized in the original GRPO sample-level loss calculation. Annotated figure from DAPO: An Open-Source LLM Reinforcement Learning System at Scale, https://arxiv.org/abs/2503.14476 📄 20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't , https://arxiv.org/abs/2503.16219 The original DeepSeek-R1 paper showed that when developing small(er) reasoning models, distillation gives better results than pure RL. In this paper, researchers follow up on this and investigate ways to improve small, distilled reasoning models further with RL. So, using the 1.5B DeepSeek-R1-Distill-Qwen model, they find that with only 7000 training examples and a $42 compute budget, RL fine-tuning can lead to strong improvements. In this case, the improvements are enough to outperform OpenAI's o1-preview on the AIME24 math benchmark, for example. Furthermore, there were 3 interesting learnings in that paper: 1. Small LLMs can achieve fast reasoning improvements within the first 50–100 training steps using a compact, high-quality dataset. But the performance quickly drops if training continues too long, mainly due to length limits and output instability. 2. Mixing easier and harder problems helps the model produce shorter, more stable responses early in training. However, performance still degrades over time. 3. Using a cosine-shaped reward function helps control output length more effectively and improves training consistency. But this slightly reduces peak performance compared to standard accuracy-based rewards. Annotated figure from Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't, https://arxiv.org/abs/2503.16219 📄 25 Mar, ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning , https://arxiv.org/abs/2503.19470 The ReSearch framework proposed in this paper extends the RL method from the DeepSeek-R1 paper to include search results as part of the reasoning process. The model learns when and how to search based on its ongoing reasoning chain, and it then uses the retrieved information for the next steps of reasoning. This is all done without supervised data on reasoning steps. The researchers also show that this approach can lead to useful behaviors like self-correction and reflection, and that it generalizes well across multiple benchmarks despite being trained on just one dataset. Annotated figure from ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning, https://arxiv.org/abs/2503.19470 PS: How does this method differ from the R1-Searcher discussed earlier? R1-Searcher uses a two-stage, outcome-based reinforcement learning approach. In the first stage, it teaches the model how to invoke external retrieval; in the second, it learns to use the retrieved information to answer questions. ReSearch, in contrast, integrates search directly into the reasoning process. It trains the model end-to-end using reinforcement learning, without any supervision on reasoning steps. Behaviors such as reflecting on incorrect queries and correcting them emerge naturally during training here. 📄 26 Mar, Understanding R1-Zero-Like Training: A Critical Perspective, https://arxiv.org/abs/2503.20783 This paper investigates why DeepSeek-R1-Zero's pure RL approach works to improve reasoning. The authors find that some base models like Qwen2.5 already show strong reasoning and even the "Aha moment" without any RL. So the "Aha moment" might not be induced by RL, but instead inherited from pre-training. This challenges the idea that RL alone is what creates deep reasoning behaviors. The paper also identifies two biases in GRPO: 1. Response-length bias: GRPO divides the advantage by the length of the response. This makes long incorrect answers get smaller penalties, so the model learns to generate longer bad answers. 2. Difficulty-level bias: GRPO also normalizes by the standard deviation of rewards for each question. Easy or hard questions (with low reward variance) get overweighted. To fix this, the authors introduce Dr. GRPO, which is a modification of standard GRPO. Here, they get rid of the response length normalization in the advantage computation. Also, they get rid of the question-level standard deviation. This will result in more efficient training and fewer unnecessary long answers. Especially if the model is wrong, generating a long answer is no longer encouraged. 📄 31 Mar, Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , https://arxiv.org/abs/2503.23829 DeepSeek-R1 and most other reasoning models that followed focused on reward signals from easily verifiable domains like code and math. This paper explores how to extend these methods to more complex areas like medicine, chemistry, psychology, economics, and education, where answers are usually free-form and harder to verify (beyond a simple correct/incorrect). The authors find that using expert-written reference answers makes evaluation more feasible than expected, even in these broader domains. To provide reward signals, they introduce a generative, soft-scoring method without needing heavy domain-specific annotation. Annotated figure from Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains, https://arxiv.org/abs/2503.23829 📄 31 Mar, Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , https://arxiv.org/abs/2503.24290 In this paper, the authors explore a minimalist reinforcement learning setup for training LLMs on reasoning tasks. They use vanilla PPO instead of GRPO (which was used in DeepSeek-R1-Zero) and skip the usual KL regularization commonly included in RLHF pipelines. Interestingly, they find that this simple setup (vanilla PPO and a basic binary reward function based on answer correctness) is sufficient to train models that scale up in both reasoning performance and response length. Using the same Qwen-32B base as DeepSeek-R1-Zero, their model outperforms it on multiple reasoning benchmarks while requiring only 1/10 the training steps. Annotated figure from Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, https://arxiv.org/abs/2503.24290 📄 5 Apr, Rethinking Reflection in Pre-Training , https://arxiv.org/abs/2504.04022 Based on the interesting insights from the DeepSeek-R1 paper, namely applying pure RL to a base model, we think that reasoning abilities in LLMs emerge from RL. This paper provides a bit of a plot twist, saying that self-correction already appears earlier during pre-training. Concretely, by introducing deliberately flawed chains-of-thought into tasks, the authors measure whether models can identify and correct these errors. They find that both explicit and implicit forms of reflection emerge steadily throughout pre-training. This happens across many domains and model sizes. Even relatively early checkpoints show signs of self-correction, and the ability becomes stronger as pre-training compute increases. Annotated figure from Rethinking Reflection in Pre-Training, https://arxiv.org/abs/2504.04022 📄 7 Apr, Concise Reasoning via Reinforcement Learning , https://arxiv.org/abs/2504.05185 As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training. I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem. What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong). So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness. However, it's important to emphasize that this analysis was done for PPO: Of note, our current analysis is not applicable to GRPO, and a precise analysis of such methods is left for future work. In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 📄 9 Apr, A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility , https://arxiv.org/abs/2504.07086 This paper takes a closer look at recent claims that RL can improve distilled language models, like those based on DeepSeek-R1. For instance, I previously discussed the "20 Mar, Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't" paper that found RL is effective for distilled models. And also the DeepSeek-R1 paper mentioned Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. So, while earlier papers reported large performance boosts from RL, this work finds that many of those improvements might just be noise. The authors show that results on small benchmarks like AIME24 are highly unstable: just changing a random seed can shift scores by several percentage points. When RL models are evaluated under more controlled and standardized setups, the gains turn out to be much smaller than originally reported, and often not statistically significant. However, some models trained with RL do show modest improvements, but these are usually weaker than what supervised fine-tuning achieves, and they often don't generalize well to new benchmarks. So, while RL might help in some cases to improve smaller distilled models, this paper argues that its benefits have been overstated and better evaluation standards are needed to understand what’s actually working. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book , or signing up for a paid subscription . Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Source: OpenAI livestream (https://openai.com/live/) on April 16, 2025 While reasoning alone isn't a silver bullet, it reliably improves model accuracy and problem-solving capabilities on challenging tasks (so far). And I expect reasoning-focused post-training to become standard practice in future LLM pipelines. So, in this article, let's explore the latest developments in reasoning via reinforcement learning. This article focuses on reinforcement learning training methods used to develop and improve reasoning models Because it is a relatively long article, I am providing a Table of Contents overview below. To navigate the table of contents, please use the slider on the left-hand side in the web view. Understanding reasoning models RLHF basics: where it all started A brief introduction to PPO: RL's workhorse algorithm RL algorithms: from PPO to GRPO RL reward modeling: from RLHF to RLVR How the DeepSeek-R1 reasoning models were trained Lessons from recent RL papers on training reasoning models Noteworthy research papers on training reasoning models A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation. If you are new to reasoning models and would like a more comprehensive introduction, I recommend my previous articles: Now, as hinted at the beginning of this section, the reasoning abilities of LLMs can be improved in two ways, as nicely illustrated in a figure from an OpenAI blog post: Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ In my previous article: I solely focused on the test-time compute methods. In this article, I finally want to take a closer look at the training methods. RLHF basics: where it all started The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. Conventional LLMs typically undergo a 3-step training procedure: Pre-training Supervised fine-tuning Alignment (typically via RLHF) RLHF Step 1 (prerequisite): Supervised fine-tuning (SFT) of the pre-trained model RLHF Step 2: Creating a reward model RLHF Step 3: Fine-tuning via proximal policy optimization (PPO) Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF step 1, we create or sample prompts (from a database, for example) and ask humans to write good-quality responses. We then use this dataset to fine-tune the pre-trained base model in a supervised fashion. As mentioned before, this is not technically part of RL training but merely a prerequisite. In RLHF Step 2, we then use this model from supervised fine-tuning (SFT) to create a reward model, as shown below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 As depicted in the figure above, for each prompt, we generate four responses from the fine-tuned LLM created in the prior step. Human annotators then rank these responses based on their preferences. Although this ranking process is time-consuming, it might be somewhat less labor-intensive than creating the dataset for supervised fine-tuning. This is because ranking responses is likely simpler than writing them. Upon compiling a dataset with these rankings, we can design a reward model that outputs a reward score for the optimization subsequent stage in RLHF Step 3. The idea here is that the reward model replaces and automates the labor-intensive human ranking to make the training feasible on large datasets. This reward model (RM) generally originates from the LLM created in the prior supervised fine-tuning (SFT) step. To turn the model from RLHF Step 1 into a reward model, its output layer (the next-token classification layer) is substituted with a regression layer, which features a single output node. The third step in the RLHF pipeline is to use the reward model (RM) to fine-tune the previous model from supervised fine-tuning (SFT), which is illustrated in the figure below. Annotated figure from InstructGPT paper, https://arxiv.org/abs/2203.02155 In RLHF Step 3, the final stage, we are now updating the SFT model using proximal policy optimization (PPO) based on the reward scores from the reward model we created in RLHF Step 2. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. A brief introduction to PPO: RL's workhorse algorithm As mentioned earlier, the original RLHF method uses a reinforcement learning algorithm called proximal policy optimization (PPO). PPO was developed to improve the stability and efficiency of training a policy. (In reinforcement learning, “policy” just means the model we want to train; in this case, policy = LLM.) One of the key ideas behind PPO is that it limits how much the policy is allowed to change during each update step. This is done using a clipped loss function, which helps prevent the model from making overly large updates that could destabilize training. On top of that, PPO also includes a KL divergence penalty in the loss. This term compares the current policy (the model being trained) to the original SFT model. This encourages the updates to stay reasonably close. The idea is to preference-tune the model, not to completely re-train, after all. This is where the “proximal” in proximal policy optimization comes from: the algorithm tries to keep the updates close to the existing model while still allowing for improvement. And to encourage a bit of exploration, PPO also adds an entropy bonus, which this encourages the model to vary the outputs during training. In the following paragraphs, I want to introduce some more terminology to illustrate PPO on a relatively high level. Still, there's a lot of jargon involved, so I tried to summarize the key terminology in the figure below before we continue. Illustration of the key terms in RLHF. For instance, several models are involved in PPO, where PPO is an algorithm used in RLHF (and RLHF is one of the most popular LLM alignment methods). Below, I aim to illustrate the key steps in PPO via pseudo-code. In addition, to make it more intuitive, I will also use an analogy: Imagine you are a chef running a small food delivery service. And you are constantly trying out new recipe variations to improve customer satisfaction. Your overall goal is to tweak your recipe (policy) based on customer feedback (reward). 1. Compute the ratio of the next-token probabilities from the new vs the old policy: In short, this checks how different our new recipe is from the old one. Side note: Regarding "new_policy_prob", we are not using the final updated policy yet. We are using the current version of the policy (i.e., the model we are in the middle of training). However, it's a convention to call it "new". So, even though you're still experimenting, we call your current draft the "new policy" as per convention. 2. Multiply that ratio by how good the action was (called the advantage): Here, for simplicity, we may assume the advantage is computed based on the reward signal: In the chef analogy, we can think of the advantage as how well the new dish performed: For example, if a customer rates the new dish with a 9/10, and the customers normally give us a 7/10, that's a +2 advantage. Note that this is a simplification. In reality, this involves generalized advantage estimation (GAE), which I am omitting here so as not to bloat the article further. However, one important detail to mention is that the expected reward is computed by a so-called "critic" (sometimes also called "value model"), and a reward model computes the actual reward. I.e., the advantage computation involves 2 other models, typically the same size as the original model we are fine-tuning. In the analogy, we can think of this critic or value model as a friend we ask to try our new dish before serving it to the customers. We also ask our friend to estimate how a customer would rank it (that's the expected reward). The reward model is the actual customer then who gives the feedback (i.e., the actual reward). 3. Compute a clipped score: If the new policy changes too much (e.g., ratio > 1.2 or < 0.8), we clip the ratio, as follows: In the analogy, imagine that the new recipe got an exceptionally great (or bad) review. We might be tempted to overhaul the entire menu now. But that's risky. So, instead, we clip how much our recipe can change for now. (For instance, maybe we made the dish much spicier, and that one customer happened to love spicy food, but that doesn't mean everyone else will.) 4. Then we use the smaller of the raw score and clipped scor e: Again, this is related to being a bit cautious. For instance, if the advantage is positive (the new behavior is better), we cap the reward. That's because we don't want to over-trust a good result that might be a coincidence or luck. If the advantage is negative (the new behavior is worse), we limit the penalty. The idea here is similar. Namely, we don't want to overreact to one bad result unless we are really sure. In short, we use the smaller of the two scores if the advantage is positive (to avoid over-rewarding), and the larger when the advantage is negative (to avoid over-penalizing). In the analogy, this ensures that if a recipe is doing better than expected, we don't over-reward it unless we are confident. And if it's underperforming, we don't over-penalize it unless it's consistently bad. 5. Calculating the loss: This final score is what we maximize during training (using gradient descent after flipping the sign of the score to minimize). In addition, we also add a KL penalty term, where β is a hyperparameter for the penalty strength: In the analogy, we add the penalty to ensure new recipes are not too different from our original style. This prevents you from "reinventing the kitchen" every week. For example, we don't want to turn an Italian restaurant into a BBQ place all of a sudden. This was a lot of information, so I summarized it with a concrete, numeric example in an LLM context via the figure below. But please feel free to skip it if it's too complicated; you should be able to follow the rest of the article just fine. I admit that I may have gone overboard with the PPO walkthrough. But once I had written it, it was hard to delete it. I hope some of you will find it useful! That being said, the main takeaways that will be relevant in the next section are that there are multiple models involved in PPO: 1. The policy, which is the LLM that has been trained with SFT and that we want to further align). 2. The reward model, which is a model that has been trained to predict the reward (see RLHF step 2). 3. The critic, which is a trainable model that estimates the reward. 4. A reference model (original policy) that we use to make sure that the policy doesn't deviate too much. By the way, you might wonder why we need both a reward model and a critic model. The reward model is usually trained before training the policy with PPO. It's to automate the preference labeling by human judges, and it gives the score for the complete responses generated by the policy LLM. The critic, in contrast, judges partial responses. We use it to create the final response. While the reward model typically remains frozen, the critic model is updated during training to estimate the reward created by the reward model better. More details about PPO are out of the scope of this article, but interested readers can find the mathematical details in these four papers that predate the InstructGPT paper: (1) Asynchronous Methods for Deep Reinforcement Learning (2016) by Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu introduces policy gradient methods as an alternative to Q-learning in deep learning-based RL. (2) Proximal Policy Optimization Algorithms (2017) by Schulman, Wolski, Dhariwal, Radford, and Klimov presents a modified proximal policy-based reinforcement learning procedure that is more data-efficient and scalable than the vanilla policy optimization algorithm above. (3) Fine-Tuning Language Models from Human Preferences (2020) by Ziegler, Stiennon, Wu, Brown, Radford, Amodei, Christiano, Irving illustrates the concept of PPO and reward learning to pretrained language models including KL regularization to prevent the policy from diverging too far from natural language. (4) Learning to Summarize from Human Feedback (2022) by Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano introduces the popular RLHF three-step procedure that was later also used in the InstructGPT paper . RL algorithms: from PPO to GRPO As mentioned before, PPO was the original algorithm used in RLHF. From a technical standpoint, it works perfectly fine in the RL pipeline that's being used to develop reasoning models. However, what DeepSeek-R1 used for their RL pipeline is an algorithm called Group Relative Policy Optimization (GRPO), which was introduced in one of their earlier papers: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) Annotated figure from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (https://arxiv.org/abs/2402.03300) to illustrate the differences between PPO and GRPO. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. RL reward modeling: from RLHF to RLVR So far, we looked at RLHF as a procedure, and we have introduced two reinforcement learning algorithms commonly used for it: PPO and GRPO. But if RLHF is already a core part of the LLM alignment toolkit, what does any of this have to do with reasoning? The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. The difference is that instead of relying on human preferences and training a reward mode l, the DeepSeek-R1 team used verifiable rewards . This approach is called reinforcement learning with verifiable rewards (RLVR). Again, it's worth emphasizing: In contrast to standard RLHF, RLVR bypasses the need for a reward model. So, rather than learning what counts as a "good" answer from human-labeled examples, the model gets direct binary feedback (correct or wrong) from a deterministic tool, such as symbolic verifiers or rule-based tools. Think calculators for math problems or compilers for code generation. Example of reinforcement learning with verifiable rewards (RLVR). The model is prompted to solve a math problem and produces an answer. Instead of using a learned reward model, a symbolic verifier (e.g., a calculator) checks the output and provides binary feedback based on correctness. One motivation here is to avoid noisy or expensive human or learned rewards by using automatic correctness checks as supervision signals during RL. The other motivation is that by using "cheap" tools like calculators, we can replace the expensive reward model training and the reward model itself. Since the reward model is usually the whole pre-trained model (but with a regression head), RLVR is much more efficient. So, in short, DeepSeek-R1 used RLVR with GRPO, which eliminates two expensive models in the training procedure: the reward model and the value model (critic), as illustrated in the figure below. Comparison of reinforcement learning setups in LLM training. Traditional RLHF with PPO uses both a reward model (trained on human preferences) and a critic (value model) to guide learning. GRPO eliminates the critic model. RLVR with GRPO goes a step further by also removing the reward model, relying instead on verifiable rewards from symbolic tools like calculators or compilers. In the next section, I want to briefly go over the DeepSeek-R1 pipeline and discuss the different verifiable rewards that the DeepSeek team used. How the DeepSeek-R1 reasoning models were trained Now that we have clarified what RLHF and RLVR are, as well as PPO and GRPO, let's briefly recap the main insights from the DeepSeek-R1 paper in the context of RL and reasoning. First, there were three types of models: DeepSeek-R1-Zero trained with pure RL DeepSeek-R1 trained with instruction fine-tuning (SFT) and RL DeepSeek-Distill variants created via instruction fine-tuning SFT without RL Training pipeline for the DeepSeek-R1 family DeepSeek-R1-Zero was trained using the verifiable rewards (RLVR) with GRPO, and this turned out to be sufficient for the model to exhibit reasoning abilities via intermediate-step generation. This showed that it's possible to skip the SFT stage. The model improves its reasoning abilities through exploration instead of learning from examples. DeepSeek-R1 is the flagship model, the one with the best performance. The difference compared to DeepSeek-R1-Zero is that they alternated instruction fine-tuning, RLVR, and RLHF. DeepSeek-Distill variants are meant to be small and more easily deployable models; they were generated by instruction fine-tuning Llama 3 and Qwen 2.5 models using instruction data from the DeepSeek-R1 model. This approach didn't use any RL for the reasoning part (however, RLHF was used to create the Llama 3 and Qwen 2.5 base models). For more details on explaining the DeepSeek-R1 pipeline, please see my previous article "Understanding Reasoning LLMs": The main takeaway here is that the DeepSeek team didn't use an LLM-based reward model to train DeepSeek-R1-Zero. Instead, they used rule-based rewards for the reasoning training of DeepSeek-R1-Zero and DeepSeek-R1: We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process [...] To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: (1) Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases. (2) Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '<think>' and '</think>’ tags. Lessons from recent RL papers on training reasoning models I realize that the introduction (i.e., everything up to this point) turned out to be much longer than I expected. Nonetheless, I think that this lengthy introduction is perhaps necessary to put the following lessons into context. After going through a large number of recent papers on reasoning models last month, I have put together a summary of the most interesting ideas and insights in this section. (References like “[1]” point to the corresponding papers listed at the end of the article.) 1. Reinforcement learning further improves distilled models The original DeepSeek-R1 paper demonstrated clearly that supervised fine-tuning (SFT) followed by reinforcement learning (RL) outperforms RL alone. Given this observation, it's intuitive that additional RL should further improve distilled models (as distilled models essentially represent models trained via SFT using reasoning examples generated by a larger model.) Indeed, the DeepSeek team observed this phenomenon explicitly: Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here. Several teams independently verified these observations: [8] Using the 1.5B DeepSeek-R1-Distill-Qwen model, researchers demonstrated substantial performance improvements from RL fine-tuning with just 7,000 examples and a modest $42 compute budget. Impressively, this small model surpassed OpenAI’s o1-preview on the AIME24 math benchmark. [15] However, another team cautioned that these gains might not always be statistically significant. This suggests that, although RL can improve smaller distilled models, the benchmark results might sometimes be overstating the improvements. Annotated figure from A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility, https://arxiv.org/abs/2504.07086 2. The problem of long incorrect answers I previously mentioned that RL with verifiable rewards (RLVR) does not strictly require the GRPO algorithm; DeepSeek's GRPO simply happens to be efficient and to perform well. However, [12] showed that vanilla PPO paired with a basic binary correctness reward was sufficient to scale models in reasoning capability and response length. More interestingly, both PPO and GRPO have a length bias. And several papers explored methods to tackle excessively long incorrect answers: [14] Provided an analysis illustrating how PPO inadvertently favors longer responses due to mathematical biases in loss calculations; GRPO may suffer from the same issue. Annotated figure from Concise Reasoning via Reinforcement Learning, https://arxiv.org/abs/2504.05185 As a follow-up to the statement above, [7] [10] specifically identified length and difficulty-level biases in GRPO. The modified variant "Dr. GRPO" simplifies advantage calculations by removing length and standard deviation normalization, providing clearer training signals. [1] Explicitly penalized lengthy incorrect answers in GRPO while rewarding concise, correct ones. [3] [6] Didn’t directly control response length in GRPO but found token-level rewards beneficial, allowing models to better focus on critical reasoning steps. [5] Introduced explicit penalties in GRPO for responses exceeding specific lengths, enabling precise length control during inference. Integrating existing reasoning models (e.g., o1, DeepSeek-R1) with capabilities such as external tool use and retrieval-augmented generation (RAG); the just-released o3 model from Open AI paves the way here Speaking of tool-use and search, [9] showed that giving reasoning models the ability to search induces behaviors such as self-correction and robust generalization across benchmarks, despite minimal training datasets.

0 views
Ahead of AI 7 months ago

The State of LLM Reasoning Model Inference

Improving the reasoning abilities of large language models (LLMs) has become one of the hottest topics in 2025, and for good reason. Stronger reasoning skills allow LLMs to tackle more complex problems, making them more capable across a wide range of tasks users care about. In the last few weeks, researchers have shared a large number of new strategies to improve reasoning, including scaling inference-time compute, reinforcement learning, supervised fine-tuning, and distillation. And many approaches combine these techniques for greater effect.  This article explores recent research advancements in reasoning-optimized LLMs, with a particular focus on inference-time compute scaling that have emerged since the release of DeepSeek R1. The four main categories of implementing reasoning models I explained in Understanding Reasoning LLMs . This article focuses on inference-time-scaling methods. Implementing and improving reasoning in LLMs: The four main categories Since most readers are likely already familiar with LLM reasoning models, I will keep the definition short: An LLM-based reasoning model is an LLM designed to solve multi-step problems by generating intermediate steps or structured "thought" processes. Unlike simple question-answering LLMs that just share the final answer, reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks such as puzzles, coding challenges, and mathematical problems. Side-by-side comparison of a basic LLM’s one-line answer and a reasoning LLM’s explanatory response. In general, there are two main strategies to improve reasoning: (1) increasing training compute or (2) increasing inference compute, also known as inference-time scaling or test-time scalin g. (Inference compute refers to the processing power required to generate model outputs in response to a user query after training.) Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ Note that the plots shown above make it look like we improve reasoning either via train-time compute OR test-time compute. However, LLMs are usually designed to improve reasoning by combining heavy train-time compute (extensive training or fine-tuning, often with reinforcement learning or specialized data) and increased test-time compute (allowing the model to "think longer" or perform extra computation during inference). The many terms that are used synonymously with inference-time scaling. To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Understanding Reasoning LLMs , I discussed a finer categorization into four categories, as summarized in the figure below. Methods 2-4 in the figure above typically produce models that generate longer responses because they include intermediate steps and explanations in their outputs. Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling. However, in this section on inference-time compute scaling, I focus specifically on techniques that explicitly regulate the number of generated tokens, whether through additional sampling strategies, self-correction mechanisms, or other methods. In this article, I focus on the interesting new research papers and model releases focused on scaling inference-time compute scaling that followed after the DeepSeek R1 release on January 22nd, 2025. (Originally, I wanted to cover methods from all categories in this article, but due to the excessive length, I decided to release a separate article focused on train-time compute methods in the future.) Development process of DeepSeek's reasoning models that I discussed in my previous article, Understanding Reasoning LLMs (https://magazine.sebastianraschka.com/p/understanding-reasoning-llms). Before we look into Inference-time compute scaling methods and the different areas of progress on the reasoning model with a focus on the inference-time compute scaling category, let me at least provide a brief overview of all the different categories. 1. Inference-time compute scaling This category includes methods that improve model reasoning capabilities at inference time without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps with making even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures.  While I categorize inference-time compute scaling separately to focus on methods in this context, it is important to note that this technique can be applied to any LLM. For example, OpenAI developed its o1 model using reinforcement learning and then additionally leveraged inference-time compute scaling. Interestingly, as I discussed in my previous article on reasoning models ( Understanding Reasoning LLMs ), the DeepSeek R1 paper explicitly categorized common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model’s natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling over the V3 base model. However, since explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, DeepSeek acknowledged that they could easily incorporate it into the R1 deployment or application. 2. Pure reinforcement learning This approach focuses solely on reinforcement learning (RL) to develop or improve reasoning capabilities. It typically involves training models with verifiable reward signals from math or coding domains. While RL allows models to develop more strategic thinking and self-improvement capabilities, it comes with challenges such as reward hacking, instability, and high computational costs. 3. Reinforcement learning and supervised fine-tuning This hybrid approach combines RL with supervised fine-tuning (SFT) to achieve more stable and generalizable improvements than pure RL. Typically, a model is first trained with SFT on high-quality instruction data and then further refined using RL to optimize specific behaviors . 4. Supervised fine-tuning and model distillation This method improves the reasoning capabilities of a model by instruction fine-tuning it on high-quality labeled datasets (SFT). If this high-quality dataset is generated by a larger LLM, then this methodology is also referred to as "knowledge distillation" or just "distillation" in LLM contexts. However, note that this differs slightly from traditional knowledge distillation in deep learning, which typically involves training a smaller model using not only the outputs (labels) but also the logits of a larger teacher model. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The previous section already briefly summarized inference-time compute scaling. Before discussing the recent research in this category, let me describe the inference-time scaling in a bit more detail. Inference-time scaling improves an LLM's reasoning by increasing computational resources ("compute") during inference. The idea why this can improve reasoning can be given with a simple analogy: humans give better responses when given more time to think, and similarly, LLMs can improve with techniques that encourage more "thought" during generation. One approach here is prompt engineering, such as chain-of-thought (CoT) prompting, where phrases like "think step by step" guide the model to generate intermediate reasoning steps. This improves accuracy on complex problems but is unnecessary for simple factual queries. Since CoT prompts generate more tokens, they effectively make inference more expensive. An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). Another method involves voting and search strategies, such as majority voting or beam search, which refine responses by selecting the best output. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The remainder of this article will be focused on the recent research advances in the inference-time scaling category for improving reasoning capabilities of LLMs. Let me start with a more detailed discussion of a paper that serves as an example of inference-time scaling. So, one of the interesting recent research papers in this category is s1: Simple Test-Time Scaling (31 Jan, 2025), which introduces so-called "wait" tokens, which can be considered as a more modern version of the aforementioned "think step by step" prompt modification. Note that this involves supervised finetuning (SFT) to generate the initial model, so it's not a pure inference-time scaling approach. However, the end goal is actively controlling the reasoning behavior through inference-time scaling; hence, I considered this paper for the "1. Inference-time compute scaling" category. In short, their approach is twofold: Create a curated SFT dataset with 1k training examples that include reasoning traces. Control the length of responses by: a) Appending "Wait" tokens to get the LLM to generate longer responses, self-verify, and correct itself, or b) Stopping generation by adding an end-of-thinking token delimiter ("Final Answer:"). They call this length control "budget forcing." Illustration of "wait" token insertion to control the length of the output. Annotated figure from https://arxiv.org/abs/2501.19393. Budget forcing can be seen as a sequential inference scaling technique because it still generates one token at a time (but just more of it). In contrast, we have parallel techniques like majority voting, which aggregate multiple independent completions. Correlation between response accuracy and length. Annotated figure from https://arxiv.org/abs/2501.19393. They found their budget-forcing method more effective than other inference-scaling techniques I've discussed, like majority voting. If there's something to criticize or improve, I would've liked to see results for more sophisticated parallel inference-scaling methods, like beam search, lookahead search, or the best compute-optimal search described in Google's Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters paper last year. Or even a simple comparison with a classic sequential method like chain-of-thought prompting ("Think step by step"). Anyway, it's a really interesting paper and approach! PS: Why "Wait" tokens? My guess is the researchers were inspired by the "Aha moment" figure in the DeepSeek-R1 paper, where researchers saw LLMs coming up with something like " Wait, wait. Wait. That's an aha moment I can flag here. " which showed that pure reinforcement learning can induce reasoning behavior in LLMs. Interestingly, they also tried other tokens like " Hmm " but found that " Wait " performed slightly better. " Wait" vs " Hmm" tokens. Annotated figure from https://arxiv.org/abs/2501.19393. Since it's been a very active month on the reasoning model research front, I need to keep the summaries of other papers relatively brief to manage a reasonable length for this article. Hence, below are brief summaries of other interesting research articles related to inference-time compute scaling, sorted in ascending order by publication date. As mentioned earlier, not all of these articles fall neatly into the inference-time compute scaling category, as some of them also involve specific training. However, these papers have in common that controlling inference-time compute is a specific mechanism of action. (Many distilled or SFT methods that I will cover in upcoming articles will lead to longer responses, which can be seen as a form of inference-time compute scaling. However, they do not actively control the length during inference, which makes these methods different from those covered here.) 📄 22 Jan, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , https://arxiv.org/abs/2501.12895 Test-time Preference Optimization (TPO) is an iterative process that aligns LLM outputs with human preferences during inference (this is without altering its underlying model weights). In each iteration, the model: Generates multiple responses for a given prompt. Score the responses with a reward model to select the highest- and lowest-scoring ones as "chosen" and "rejected" responses Prompt the model to compare and critique the "chosen" and "rejected" responses Refine the output by converting the critiques into textual suggestions to update the original model responses By doing steps 1-4 iteratively, the model refines its original responses. Annotated figure from "Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback", https://arxiv.org/abs/2501.12895 📄 30 Jan, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs , https://arxiv.org/abs/2501.18585 The researchers explore a phenomenon called "underthinking", where reasoning models frequently switch between reasoning paths instead of fully focusing on exploring promising ones, which lowers the problem solving accuracy. To address this "underthinking" issue, they introduce a method called the Thought Switching Penalty (TIP), which modifies the logits of thought-switching tokens to discourage premature reasoning path transitions.  Their approach does not require model fine-tuning and empirically improves accuracy across multiple challenging test sets. Annotated figure from "Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs", https://arxiv.org/abs/2501.18585 📄 31 Jan, Trading Inference-Time Compute for Adversarial Robustness , https://arxiv.org/abs/2501.18841 Increasing inference-time compute improves the adversarial robustness of reasoning LLMs in many cases in terms of reducing the rate of successful attacks. Unlike adversarial training, this method does not need any special training or require prior knowledge of specific attack types.  However, there are some important exceptions. For example, the improvements in settings involving policy ambiguities or loophole exploitation are limited. Additionally, the reasoning-improved robustness increases can be reduced by new attack strategies such as "Think Less" and "Nerd Sniping".  So, while these findings suggest that scaling inference-time compute can improve LLM safety, this alone is not a complete solution to adversarial robustness. Annotated figure from "Trading Inference-Time Compute for Adversarial Robustness", https://arxiv.org/abs/2501.18841 📄 4 Feb, CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning, https://arxiv.org/abs/2502.02390 The researchers combine classic Monte Carlo Tree Search inference-time scaling with an "associative memory" that serves as the LLM's knowledge base during the exploration of reasoning pathways. Using this so-called associative memory, it's easier for the LLM to consider earlier reasoning pathways and use dynamically involving information during the response generation. Annotated figure from "CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning", https://arxiv.org/abs/2502.02390 📄 6 Feb, Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models, https://arxiv.org/abs/2502.0440 This paper proposes a self-backtracking mechanism that allows LLMs to improve their reasoning by learning when and where to backtrack during training and inference. While training involves teaching the model to recognize and correct suboptimal reasoning paths using a <backtrack> token, the key contribution is an inference-time tree-based search that uses this learned backtracking ability to explore alternative solutions.  What's unique is that this exploration does not require without relying on external reward models (unlike the search-based methods that use a process-reward-based model that I mentioned at the beginning of the "1. Inference-time compute scaling methods" section in this article). Annotated figure from "Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models", https://arxiv.org/abs/2502.04404 I added this paper here as it's heavily focused on the proposed backtracking inference-time scaling method, which improves reasoning by dynamically adjusting search depth and breadth rather than fundamentally altering the training paradigm (although, the training with <backtrack> tokens is required).  📄 7 Feb, Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach, https://arxiv.org/abs/2502.05171 Instead of improving reasoning by generating more tokens, the researchers propose a model that scales inference-time compute by iterating over a recurrent depth block in latent space. This block functions like a hidden state in RNNs, which allows the model to refine its reasoning without requiring longer token outputs.  However, a key drawback is the lack of explicit reasoning steps, which are (in my opinion) useful for human interpretability and a major advantage of chain-of-thought methods. Annotated figure from "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach", https://arxiv.org/abs/2502.05171 📄 10 Feb, Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, https://arxiv.org/abs/2502.06703 Many inference-time scaling techniques depend on sampling, which requires a Process Reward Model (PRM) to select the best solution. This paper systematically analyzes how inference-time compute scaling interacts with PRMs and problem difficulty.  The researchers develop a compute-optimal scaling strategy that adapts to the choice of PRM, policy model, and task complexity. Their results show that with the right inference-time scaling approach, a 1B parameter model can outperform a 405B Llama 3 model that lacks inference-time scaling.  Similarly, they show how a 7B model with inference-time scaling surpasses DeepSeek-R1 while maintaining higher inference efficiency.  These findings highlight how inference-time scaling can significantly improve LLMs, where small LLMs, with the right inference compute budget, can outperform much larger models. Annotated figure from "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling", https://arxiv.org/abs/2502.06703 9. Learning to Reason from Feedback at Test-Time 📄 16 Feb, Learning to Reason from Feedback at Test-Time, https://www.arxiv.org/abs/2502.12521 It's a bit hard to classify this as either an inference-time or training-time method, because it optimizes the LLM, changing its weight parameters, at inference-time. So, this paper explores a way to make LLMs learn from their mistakes during inference time without having to store failed attempts in the prompt (which gets expensive). Instead of the usual method of refining answers by adding previous attempts to the context (sequential revision) or blindly generating new answers (parallel sampling), this approach updates the model's weights at inference time. To do this, the authors introduce OpTune, a small, trainable optimizer that updates the model's weights based on the mistakes it made in a previous attempt. This means the model remembers what it did wrong without needing to keep the incorrect answer in the prompt/context. Annotated figure from "Learning to Reason from Feedback at Test-Time” , https://www.arxiv.org/abs/2502.12521 📄 18 Feb, Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, https://www.arxiv.org/abs/2502.12521 This paper benchmarks various inference-time compute scaling techniques for reasoning and planning tasks with a focus on analyzing their trade-offs between computational cost and performance. The authors evaluate multiple techniques—such as Chain-of-Thought, Tree-of-Thought, and Reasoning as Planning across eleven tasks spanning arithmetic, logical, commonsense, algorithmic reasoning, and planning.  The main finding is that while scaling inference-time computation can improve reasoning, no single technique consistently outperforms others across all tasks. Annotated figure from Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights , https://www.arxiv.org/abs/2502.12521 📄 19 Feb, Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, https://arxiv.org/abs/2502.13842 The Inner Thinking Transformer (ITT) dynamically allocates more compute during inference. Instead of using a fixed depth (= using same number of layers) for all tokens as in standard transformer-based LLMs, ITT employs Adaptive Token Routing to allocate more compute to difficult tokens. These difficult tokens pass through the same layer multiple times to undergo additional processing, which increases the inference-compute budget for these difficult tokens. Annotated figure from "Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking", https://arxiv.org/abs/2502.13842 📄 20 Feb, S*: Test Time Scaling for Code Generation, https://arxiv.org/abs/2502.14382 Inference-time scaling can be achieved by parallel scaling (generating multiple answers), sequential scaling (iteratively refining answers), or both as described in the Google paper from Summer 2024 ( Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters ). S* is a test-time compute scaling method designed specifically for code generation that improves both parallel scaling (generating multiple solutions) and sequential scaling (iterative debugging).  Annotated figure from "S*: Test Time Scaling for Code Generation", https://arxiv.org/abs/2502.14382 The approach operates in two stages: Stage 1: Generation The model generates multiple code solutions and iteratively refines them using execution results and test cases provided in the problem prompt. Think of this like a coding competition where a model submits solutions, runs tests, and fixes mistakes: 1. The model generates multiple candidate solutions. 2. Each solution is executed on public test cases (predefined input-output pairs). 3. If a solution fails (incorrect output or crashes), the model analyzes the execution results (errors, outputs) and modifies the code to improve it. 4. This refinement process continues iteratively until the model finds solutions that pass the test cases. For example, suppose the model is asked to implement a function is_even(n) that returns True for even numbers and False otherwise. The model’s first attempt might be: The model tests this implementation with public test cases: After reviewing the results, the model realizes that 4 % 2 returns 0, not True, so it modifies the function: Now the function passes all public tests , completing the debugging phase. Stage 2: Selection Once multiple solutions have passed public tests, the model must choose the best one (if possible). Here, S* introduces adaptive input synthesis to avoid random picking: 1. The model compares two solutions that both pass public tests. 2. It asks itself: "Can I generate an input that will reveal a difference between these solutions?" 3. It creates a new test input and runs both solutions on it. 4. If one solution produces the correct output while the other fails, the model selects the better one. 5. If both solutions behave identically, the model randomly picks one. For example, consider two different implementations of : Both pass the provided test cases for simple examples: But when the LLM generates edge cases we can see one of them fail, so the model would select the solution A in this case: 📄 25 Feb, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 The researchers observe that while reasoning LLMs often generate verbose step-by-step explanations, humans typically rely on concise drafts that capture only essential information.  Inspired by this, they propose Chain of Draft (CoD), a prompting strategy that reduces verbosity by generating minimal but informative intermediate steps. So, in a sense it's a method for inference-time scaling that improves the efficiency of inference-time scaling through generating fewer tokens. Annotated figures from "Chain of Draft: Thinking Faster by Writing Less", https://arxiv.org/abs/2502.18600 Looking at the results, it seems that CoD is almost as brief as standard prompting, but as accurate as Chain of Thought (CoT) prompting. As I mentioned earlier, in my opinion, one of the advantages of reasoning models is that users can read the reasoning traces to learn and to better evaluate / trust the response. CoD somewhat diminishes the advantage of CoD. However, it might come in very handy where verbose intermediate steps are not needed as it speeds up the generation while maintaining the accuracy of CoT. 📄 6 Mar, Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks, https://arxiv.org/abs/2503.04378 Many techniques for scaling inference-time reasoning rely on tasks with verifiable answers (like math and code that can be checked), which makes them difficult to apply to open-ended tasks like writing and general problem-solving. To address this limitation regarding verifiable answers, the researchers develop a system where one model generates an initial response, another provides feedback ("feedback model"), and a third refines the response based on that feedback ("edit model"). They train these specialized "feedback" and "edit" models using a large dataset of human-annotated responses and feedback. These models then help improve responses by generating better feedback and making more effective edits during inference time. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Inference-time compute scaling has become one of the hottest research topics this year to improve the reasoning abilities of large language models without requiring modification to model weights.  The many techniques I summarized above range from simple token-based interventions like “Wait” tokens to sophisticated search and optimization-based strategies such as Test-Time Preference Optimization and Chain-of-Associated-Thoughts.  On the big-picture level, one recurring theme is that increasing compute at inference allows even relatively small models to achieve substantial improvements (on reasoning benchmarks) compared to standard approaches.  This suggests that inference strategies can help narrow the performance gap between smaller, more cost-effective models and their larger counterparts.  The cost caveat The caveat is that inference-time scaling increases the inference costs, so whether to use a small model with substantial inference scaling or training a larger model and using it with less or no inference scaling is a math that has to be worked out based on how much use the model gets. As an example, an o1 model, which uses heavy inference time scaling, is actually still slightly cheaper than a likely larger GPT-4.5 model that likely doesn't use inference time scaling.  (It will be interesting to see how well GPT-4.5 will perform with o1- or o3-style inference-time scaling.) Which technique? However, inference-time compute scaling is not a silver bullet. While methods like Monte Carlo Tree Search, self-backtracking, and dynamic-depth scaling can substantially improve reasoning performance, the effectiveness also still depends on the task and difficulty. As one of the earlier papers showed, there's no inference-time compute scaling technique that performs best across all tasks. Additionally, many of these approaches trade off response latency for improved reasoning, and slow responses can be annoying to some users. For instance, I usually switch from o1 to GPT4o if I have simple tasks due to the faster response time. What's next Looking ahead, I think we will see many more papers this year centered around the two main branches of "reasoning via inference-time compute scaling" research:  1. Research that is purely centered around developing the best possible model topping the benchmarks. 2. Research that is concerned with balancing cost and performance trade-offs across different reasoning tasks. Either way, what's nice about inference-time compute scaling is that it can be applied to any type of existing LLM to make it better for specific tasks. Thinking on Demand An interesting trend on the industry side is what I refer to as "thinking on demand". Following the release of DeepSeek R1, it feels like companies have been rushing to add reasoning capabilities to their offerings.  An interesting development here is that most LLM providers started to add the option for users to enable or disable thinking. An interesting development is that most LLM providers now allow users to enable or disable these "thinking" features. The mechanism is not publicly shared, but it's likely the same model with dialed-back inference-time compute scaling.  For instance, Claude 3.7 Sonnet and Grok 3 now have a "thinking" that users can enable for their model, whereas OpenAI requires users to switch between models. For example, GPT4o/4.5 and o1/o3-mini if they want to use explicit reasoning models. However, the OpenAI CEO mentioned that GPT4.5 will likely be their last model, which doesn't explicitly have a reasoning or "thinking" mode. On the open-source side, even IBM added an explicit "thinking" toggle to their Granite models . Overall, the trend of adding reasoning capabilities whether via inference-time or train-time compute scaling is a major step forward for LLMs in 2025.  In time, I expect that reasoning will no longer be treated as an optional or special feature but will instead become the standard, much as instruction-finetuned or RLHF-tuned models are now the norm over raw pretrained models. As mentioned earlier, this article solely focused on inference-time compute length due to its already long lengths, thanks to the very active reasoning research activity. In a future article, I plan to cover all the interesting train-time compute scaling methods for reasoning. This magazine is a personal passion project. To support me as an independent researcher, please consider purchasing a copy of my book, Build a Large Language Model (From Scratch) book , or signing up for a paid subscription . Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! The four main categories of implementing reasoning models I explained in Understanding Reasoning LLMs . This article focuses on inference-time-scaling methods. Implementing and improving reasoning in LLMs: The four main categories Since most readers are likely already familiar with LLM reasoning models, I will keep the definition short: An LLM-based reasoning model is an LLM designed to solve multi-step problems by generating intermediate steps or structured "thought" processes. Unlike simple question-answering LLMs that just share the final answer, reasoning models either explicitly display their thought process or handle it internally, which helps them to perform better at complex tasks such as puzzles, coding challenges, and mathematical problems. Side-by-side comparison of a basic LLM’s one-line answer and a reasoning LLM’s explanatory response. In general, there are two main strategies to improve reasoning: (1) increasing training compute or (2) increasing inference compute, also known as inference-time scaling or test-time scalin g. (Inference compute refers to the processing power required to generate model outputs in response to a user query after training.) Accuracy improvements can be achieved through increased training or test-time compute, where test-time compute is synonymous with inference-time compute and inference-time scaling. Source: Annotated figure from https://openai.com/index/learning-to-reason-with-llms/ Note that the plots shown above make it look like we improve reasoning either via train-time compute OR test-time compute. However, LLMs are usually designed to improve reasoning by combining heavy train-time compute (extensive training or fine-tuning, often with reinforcement learning or specialized data) and increased test-time compute (allowing the model to "think longer" or perform extra computation during inference). The many terms that are used synonymously with inference-time scaling. To understand how reasoning models are being developed and improved, I think it remains useful to look at the different techniques separately. In my previous article, Understanding Reasoning LLMs , I discussed a finer categorization into four categories, as summarized in the figure below. Methods 2-4 in the figure above typically produce models that generate longer responses because they include intermediate steps and explanations in their outputs. Since inference costs scale with response length (e.g., a response twice as long requires twice the compute), these training approaches are inherently linked to inference scaling. However, in this section on inference-time compute scaling, I focus specifically on techniques that explicitly regulate the number of generated tokens, whether through additional sampling strategies, self-correction mechanisms, or other methods. In this article, I focus on the interesting new research papers and model releases focused on scaling inference-time compute scaling that followed after the DeepSeek R1 release on January 22nd, 2025. (Originally, I wanted to cover methods from all categories in this article, but due to the excessive length, I decided to release a separate article focused on train-time compute methods in the future.) Development process of DeepSeek's reasoning models that I discussed in my previous article, Understanding Reasoning LLMs (https://magazine.sebastianraschka.com/p/understanding-reasoning-llms). Before we look into Inference-time compute scaling methods and the different areas of progress on the reasoning model with a focus on the inference-time compute scaling category, let me at least provide a brief overview of all the different categories. 1. Inference-time compute scaling This category includes methods that improve model reasoning capabilities at inference time without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps with making even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures.  While I categorize inference-time compute scaling separately to focus on methods in this context, it is important to note that this technique can be applied to any LLM. For example, OpenAI developed its o1 model using reinforcement learning and then additionally leveraged inference-time compute scaling. Interestingly, as I discussed in my previous article on reasoning models ( Understanding Reasoning LLMs ), the DeepSeek R1 paper explicitly categorized common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model’s natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling over the V3 base model. However, since explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, DeepSeek acknowledged that they could easily incorporate it into the R1 deployment or application. 2. Pure reinforcement learning This approach focuses solely on reinforcement learning (RL) to develop or improve reasoning capabilities. It typically involves training models with verifiable reward signals from math or coding domains. While RL allows models to develop more strategic thinking and self-improvement capabilities, it comes with challenges such as reward hacking, instability, and high computational costs. 3. Reinforcement learning and supervised fine-tuning This hybrid approach combines RL with supervised fine-tuning (SFT) to achieve more stable and generalizable improvements than pure RL. Typically, a model is first trained with SFT on high-quality instruction data and then further refined using RL to optimize specific behaviors . 4. Supervised fine-tuning and model distillation This method improves the reasoning capabilities of a model by instruction fine-tuning it on high-quality labeled datasets (SFT). If this high-quality dataset is generated by a larger LLM, then this methodology is also referred to as "knowledge distillation" or just "distillation" in LLM contexts. However, note that this differs slightly from traditional knowledge distillation in deep learning, which typically involves training a smaller model using not only the outputs (labels) but also the logits of a larger teacher model. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Inference-time compute scaling methods The previous section already briefly summarized inference-time compute scaling. Before discussing the recent research in this category, let me describe the inference-time scaling in a bit more detail. Inference-time scaling improves an LLM's reasoning by increasing computational resources ("compute") during inference. The idea why this can improve reasoning can be given with a simple analogy: humans give better responses when given more time to think, and similarly, LLMs can improve with techniques that encourage more "thought" during generation. One approach here is prompt engineering, such as chain-of-thought (CoT) prompting, where phrases like "think step by step" guide the model to generate intermediate reasoning steps. This improves accuracy on complex problems but is unnecessary for simple factual queries. Since CoT prompts generate more tokens, they effectively make inference more expensive. An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). Another method involves voting and search strategies, such as majority voting or beam search, which refine responses by selecting the best output. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 1. "s1: Simple test-time scaling" The remainder of this article will be focused on the recent research advances in the inference-time scaling category for improving reasoning capabilities of LLMs. Let me start with a more detailed discussion of a paper that serves as an example of inference-time scaling. So, one of the interesting recent research papers in this category is s1: Simple Test-Time Scaling (31 Jan, 2025), which introduces so-called "wait" tokens, which can be considered as a more modern version of the aforementioned "think step by step" prompt modification. Note that this involves supervised finetuning (SFT) to generate the initial model, so it's not a pure inference-time scaling approach. However, the end goal is actively controlling the reasoning behavior through inference-time scaling; hence, I considered this paper for the "1. Inference-time compute scaling" category. In short, their approach is twofold: Create a curated SFT dataset with 1k training examples that include reasoning traces. Control the length of responses by: a) Appending "Wait" tokens to get the LLM to generate longer responses, self-verify, and correct itself, or b) Stopping generation by adding an end-of-thinking token delimiter ("Final Answer:"). They call this length control "budget forcing." Illustration of "wait" token insertion to control the length of the output. Annotated figure from https://arxiv.org/abs/2501.19393. Budget forcing can be seen as a sequential inference scaling technique because it still generates one token at a time (but just more of it). In contrast, we have parallel techniques like majority voting, which aggregate multiple independent completions. Correlation between response accuracy and length. Annotated figure from https://arxiv.org/abs/2501.19393. They found their budget-forcing method more effective than other inference-scaling techniques I've discussed, like majority voting. If there's something to criticize or improve, I would've liked to see results for more sophisticated parallel inference-scaling methods, like beam search, lookahead search, or the best compute-optimal search described in Google's Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters paper last year. Or even a simple comparison with a classic sequential method like chain-of-thought prompting ("Think step by step"). Anyway, it's a really interesting paper and approach! PS: Why "Wait" tokens? My guess is the researchers were inspired by the "Aha moment" figure in the DeepSeek-R1 paper, where researchers saw LLMs coming up with something like " Wait, wait. Wait. That's an aha moment I can flag here. " which showed that pure reinforcement learning can induce reasoning behavior in LLMs. Interestingly, they also tried other tokens like " Hmm " but found that " Wait " performed slightly better. " Wait" vs " Hmm" tokens. Annotated figure from https://arxiv.org/abs/2501.19393. Other noteworthy research papers on inference-time compute scaling Since it's been a very active month on the reasoning model research front, I need to keep the summaries of other papers relatively brief to manage a reasonable length for this article. Hence, below are brief summaries of other interesting research articles related to inference-time compute scaling, sorted in ascending order by publication date. As mentioned earlier, not all of these articles fall neatly into the inference-time compute scaling category, as some of them also involve specific training. However, these papers have in common that controlling inference-time compute is a specific mechanism of action. (Many distilled or SFT methods that I will cover in upcoming articles will lead to longer responses, which can be seen as a form of inference-time compute scaling. However, they do not actively control the length during inference, which makes these methods different from those covered here.) 2. Test-Time Preference Optimization 📄 22 Jan, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , https://arxiv.org/abs/2501.12895 Test-time Preference Optimization (TPO) is an iterative process that aligns LLM outputs with human preferences during inference (this is without altering its underlying model weights). In each iteration, the model: Generates multiple responses for a given prompt. Score the responses with a reward model to select the highest- and lowest-scoring ones as "chosen" and "rejected" responses Prompt the model to compare and critique the "chosen" and "rejected" responses Refine the output by converting the critiques into textual suggestions to update the original model responses

0 views
<antirez> 8 months ago

Reasoning models are just LLMs

It’s not new, but it’s accelerating. People that used to say that LLMs were a fundamentally flawed way to reach any useful reasoning and, in general, to develop any useful tool with some degree of generality, are starting to shuffle the deck, in the hope to look less wrong. They say: “the progresses we are seeing are due to the fact that models like OpenAI o1 or DeepSeek R1 are not just LLMs”. This is false, and it is important to show their mystification as soon as possible. First, DeepSeek R1 (don’t want to talk about o1 / o3, since it’s a private thing we don’t have access to, but it’s very likely the same) is a pure decoder only autoregressive model. It’s the same next token prediction that was so strongly criticized. There isn’t, in any place of the model, any explicit symbolic reasoning or representation. Moreover, R1 Zero has similar reasoning capabilities of R1 without requiring *any* supervised fine tuning, just generating chain of thoughts, and improving it with a reward function, using reinforcement learning, was enough to learn a stronger form of reasoning. Interestingly enough, part of these capabilities were easily distilled into smaller models via SFT, which brings me to the next point. The other fundamental observation is that the S1 paper shows that you need very few examples (as little as 1000) in order for the model to start being able to build complex reasoning steps and solve non trivial mathematical problems. S1, and R1 Zero, hint that in some way in the pre-training step the models already learned the representations needed in order to perform reasoning, just with the unsupervised next word prediction training target

0 views
Ahead of AI 8 months ago

Understanding Reasoning LLMs

This article describes the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic. In 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., "specializations"). Stages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases. The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later. To give you a brief glimpse of what's covered below, in this article, I will: Explain the meaning of "reasoning model" Discuss the advantages and disadvantages of reasoning models Outline the methodology behind DeepSeek R1 Describe the four main approaches to building and improving reasoning models Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases Provide tips for developing reasoning models on a tight budget I hope you find this article useful as AI continues its rapid development this year! If you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions. The term "reasoning models" is no exception. Eventually, someone will define it formally in a paper, only for it to be redefined in the next, and so on. In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer. A regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers. Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs. Additionally, most LLMs branded as reasoning models today include a "thought" or "thinking" process as part of their response. Whether and how an LLM actually "thinks" is a separate discussion. Intermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user. "Reasoning" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user. Now that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed. When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking." Also here the simple rule applies: Use the right tool (or type of LLM) for the task. The key strengths and limitations of reasoning models are summarized in the figure below. The key strengths and weaknesses of reasoning models. Before discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the DeepSeek R1 technical report . This report serves as both an interesting case study and a blueprint for developing reasoning LLMs. Note that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill. Based on the descriptions in the technical report, I have summarized the development process of these models in the diagram below. Development process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report. Next, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models. (1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF). (2) DeepSeek-R1: This is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the "cold-started" R1-Zero model. (3) DeepSeek-R1-Distill*: Using the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model. In this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others. Note: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques. One way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality. A rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to "think" more while generating an answer. (Although, whether LLMs actually "think" is a different discussion.) One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting , where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like "What is the capital of France", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.) An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). The aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens. Another approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses. I highly recommend the Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these different strategies. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The DeepSeek R1 technical report categorizes common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model's natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling compared to the V3 base model. However, explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, so DeepSeek may still apply such techniques within their app. I suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below. One of my personal highlights from the DeepSeek R1 paper is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail. As outlined earlier, DeepSeek developed three types of R1 models. The first, DeepSeek-R1-Zero , was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement learning without an initial SFT stage as highlighted in the diagram below. The development process of DeepSeek-R1-Zero model. Still, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives .) However, as mentioned above, the key difference in DeepSeek-R1-Zero is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as "pure" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.) For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward. The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags. Surprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an "Aha!" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below. A figure from the DeepSeek R1 technical report (https://arxiv.org/abs/2501.12948) showing the emergence of the "Aha" moment. While R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate "thinking" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Next, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance. Note that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach. The development process of DeepSeek-R1 model. As shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call "cold-start" SFT data. The term "cold start" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data. Using this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response. The RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model. These 600K + 200K SFT samples were then used for instruction-finetuning DeepSeek-V3 base before following up with a final round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types. All in all, this is very similar to regular RLHF except that the SFT data contains (more) CoT examples. And the RL has verifiable rewards in addition to human preference-based rewards. The final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below. Benchmark comparison of OpenAI O1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). So far, we have covered three key approaches to building and improving reasoning models: 1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model. 2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning. 3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model. So, what’s left? Model "distillation." Surprisingly, DeepSeek also released smaller models trained via a process they call distillation . However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my Machine Learning Q and AI book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset. Instead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section. To clarify this process, I have highlighted the distillation portion in the diagram below. The development process of DeepSeek-R1-Distill models. Why did they develop these distilled models? In my opinion, there are two key reasons: 1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me. 2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning. The table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1. Benchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). As we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1). Before wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B. The results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero. Benchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). Interestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models. For completeness, it would have been useful to see additional comparisons in the table: 1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT. 2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. In this section, we explored four different strategies for building and improving reasoning models: 1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1. 2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling. 3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done. 4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data. One interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time. In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from. One of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models. How does it compare to o1? Is DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1. That said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know: Is o1 also a Mixture of Experts (MoE)? How large is o1? Could o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling? Without knowing these details, a direct comparison remains an apples-to-oranges comparison. The cost of training DeepSeek-R1 Another point of discussion has been the cost of developing DeepSeek-R1. Some have mentioned a ~$6 million training cost, but they likely conflated DeepSeek-V3 (the base model released in December last year) and DeepSeek-R1. The $6 million estimate is based on an assumed $2 per GPU hour and the number of GPU hours required for the final training run of DeepSeek-V3, which was originally discussed back in December 2024. However, the DeepSeek team has never disclosed the exact GPU hours or development cost for R1, so any cost estimates remain pure speculation. Either way, ultimately, DeepSeek-R1 is a major milestone in open-weight reasoning models, and its efficiency at inference time makes it an interesting alternative to OpenAI’s o1. Developing a DeepSeek-R1-level reasoning model likely requires hundreds of thousands to millions of dollars, even when starting with an open-weight base model like DeepSeek-V3. This can feel discouraging for researchers or engineers working with limited budgets. The good news: Distillation can go a long way Fortunately, model distillation offers a more cost-effective alternative. The DeepSeek team demonstrated this with their R1-distilled models, which achieve surprisingly strong reasoning performance despite being significantly smaller than DeepSeek-R1. However, even this approach isn’t entirely cheap. Their distillation process used 800K SFT samples, which requires substantial compute. Interestingly, just a few days before DeepSeek-R1 was released, I came across an article about Sky-T1 , a fascinating project where a small team trained an open-weight 32B model using only 17K SFT samples. The total cost? Just $450, which is less than the registration fee for most AI conferences. This example highlights that while large-scale training remains expensive, smaller, targeted fine-tuning efforts can still yield impressive results at a fraction of the cost. Figure from the "Sky-T1: Train your own O1 preview model within $450" article, https://novasky-ai.github.io/posts/sky-t1/ According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost. Pure RL on a budget: TinyZero While Sky-T1 focused on model distillation, I also came across some interesting work in the "pure RL" space. One notable example is TinyZero , a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train). Surprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models. The TinyZero repository mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details. A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) showing that the model is capable of self-verification. (It would have been interesting to see the response of the base model in comparison.) The two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further. Beyond Traditional SFT: Journey Learning One particularly interesting approach I came across last year is described in the paper O1 Replication Journey: A Strategic Progress Report – Part 1 . Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process. The key idea in the paper is "journey learning" as an alternative to "shortcut learning." Shortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths. Journey learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes. This approach is kind of related to the self-verification abilities observed in TinyZero’s pure RL training, but it focuses on improving the model entirely through SFT. By exposing the model to incorrect reasoning paths and their corrections, journey learning may also reinforce self-correction abilities, potentially making reasoning models more reliable this way. Journey learning, as opposed to traditional shortcut learning, includes wrong solutions paths in the SFT data. Annotated figure from the O1 Replication Journey: A Strategic Progress Report – Part 1 (https://arxiv.org/abs/2410.18982) This could be an exciting direction for future work, particularly for low-budget reasoning model development, where RL-based approaches may be computationally impractical. Anyways, a lot of interesting work is currently happening on the reasoning model front, and I'm sure we will see a lot more exciting work in the upcoming months! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Stages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases. The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later. To give you a brief glimpse of what's covered below, in this article, I will: Explain the meaning of "reasoning model" Discuss the advantages and disadvantages of reasoning models Outline the methodology behind DeepSeek R1 Describe the four main approaches to building and improving reasoning models Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases Provide tips for developing reasoning models on a tight budget A regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers. Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs. Additionally, most LLMs branded as reasoning models today include a "thought" or "thinking" process as part of their response. Whether and how an LLM actually "thinks" is a separate discussion. Intermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user. "Reasoning" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user. When should we use reasoning models? Now that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed. When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking." Also here the simple rule applies: Use the right tool (or type of LLM) for the task. The key strengths and limitations of reasoning models are summarized in the figure below. The key strengths and weaknesses of reasoning models. A brief look at the DeepSeek training pipeline Before discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the DeepSeek R1 technical report . This report serves as both an interesting case study and a blueprint for developing reasoning LLMs. Note that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill. Based on the descriptions in the technical report, I have summarized the development process of these models in the diagram below. Development process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report. Next, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models. (1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF). (2) DeepSeek-R1: This is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the "cold-started" R1-Zero model. (3) DeepSeek-R1-Distill*: Using the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model. The 4 main ways to build and improve reasoning models In this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others. Note: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques. 1) Inference-time scaling One way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality. A rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to "think" more while generating an answer. (Although, whether LLMs actually "think" is a different discussion.) One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting , where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like "What is the capital of France", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.) An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916). The aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens. Another approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses. I highly recommend the Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these different strategies. Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314 The DeepSeek R1 technical report categorizes common inference-time scaling methods (such as Process Reward Model-based and Monte Carlo Tree Search-based approaches) under "unsuccessful attempts." This suggests that DeepSeek did not explicitly use these techniques beyond the R1 model's natural tendency to generate longer responses, which serves as an implicit form of inference-time scaling compared to the V3 base model. However, explicit inference-time scaling is often implemented at the application layer rather than within the LLM itself, so DeepSeek may still apply such techniques within their app. I suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below. 2) Pure reinforcement learning (RL) One of my personal highlights from the DeepSeek R1 paper is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail. As outlined earlier, DeepSeek developed three types of R1 models. The first, DeepSeek-R1-Zero , was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement learning without an initial SFT stage as highlighted in the diagram below. The development process of DeepSeek-R1-Zero model. Still, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives .) However, as mentioned above, the key difference in DeepSeek-R1-Zero is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as "pure" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.) For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward. The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags. A figure from the DeepSeek R1 technical report (https://arxiv.org/abs/2501.12948) showing the emergence of the "Aha" moment. While R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate "thinking" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3) Supervised finetuning and reinforcement learning (SFT + RL) Next, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance. Note that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach. The development process of DeepSeek-R1 model. As shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call "cold-start" SFT data. The term "cold start" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data. Using this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response. The RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model. These 600K + 200K SFT samples were then used for instruction-finetuning DeepSeek-V3 base before following up with a final round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types. All in all, this is very similar to regular RLHF except that the SFT data contains (more) CoT examples. And the RL has verifiable rewards in addition to human preference-based rewards. The final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below. Benchmark comparison of OpenAI O1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). 4) Pure supervised finetuning (SFT) and distillation So far, we have covered three key approaches to building and improving reasoning models: 1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model. 2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning. 3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model. So, what’s left? Model "distillation." Surprisingly, DeepSeek also released smaller models trained via a process they call distillation . However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my Machine Learning Q and AI book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset. Instead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section. To clarify this process, I have highlighted the distillation portion in the diagram below. The development process of DeepSeek-R1-Distill models. Why did they develop these distilled models? In my opinion, there are two key reasons: 1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me. 2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning. The table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1. Benchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). As we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1). Before wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B. The results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero. Benchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948). Interestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models. For completeness, it would have been useful to see additional comparisons in the table: 1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT. 2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Conclusion In this section, we explored four different strategies for building and improving reasoning models: 1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1. 2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling. 3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done. 4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data. One interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time. Thoughts about DeepSeek R1 In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from. One of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models. How does it compare to o1? Is DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1. That said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know: Is o1 also a Mixture of Experts (MoE)? How large is o1? Could o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling? Figure from the "Sky-T1: Train your own O1 preview model within $450" article, https://novasky-ai.github.io/posts/sky-t1/ According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost. Pure RL on a budget: TinyZero While Sky-T1 focused on model distillation, I also came across some interesting work in the "pure RL" space. One notable example is TinyZero , a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train). Surprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models. The TinyZero repository mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details. A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) showing that the model is capable of self-verification. (It would have been interesting to see the response of the base model in comparison.) The two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further. Beyond Traditional SFT: Journey Learning One particularly interesting approach I came across last year is described in the paper O1 Replication Journey: A Strategic Progress Report – Part 1 . Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process. The key idea in the paper is "journey learning" as an alternative to "shortcut learning." Shortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths. Journey learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes.

0 views
Rodney Brooks 9 months ago

Predictions Scorecard, 2025 January 01

[You can follow me on social media: @rodneyabrooks.bsky.social and see my publications etc., at https://people.csail.mit.edu/brooks ] This is my seventh annual update on how my dated predictions from January 1 st , 2018 concerning (1) self driving cars , (2) robotics, AI , and machine learning , and (3) human space travel , have held up. I promised then to review them at the start of the year every year until 2050 (right after my 95 th birthday), thirty two years in total. The idea is to hold myself accountable for those predictions. How right or wrong was I? I have decided to change my rules for myself a little bit after this year, in response to the many many people who have said how much they enjoy seeing my updates. My predictions were mostly for the first few years, and by next year the density of due dates will be very low.  So, on the eight anniversary of my first set of predictions, i.e., a year from today, I will be making a new set of predictions centered on the period January 1 st  2026 to January 1 st  2036, and that will give a new density of predictions where there will be real meat to see how accurately they turned out. What I Want to Achieve and a Changing Hype-driven Landscape The level of hype about AI, Machine Learning and Robotics completely distorts people’s understanding of reality. It distorts where VC money goes, always to something that promises impossibly large payoffs–it seems it is better to have an untested idea that would have an enormous payoff than a tested idea which can get to a sustainable business, but does not change the world for ever. It distorts what young researchers work on as they do not want to be seen as old fashioned even when the current hyped topic is sort of dumb–soon the dumbness is forgotten and the heat of the chase becomes all. It distorts what people think they need to get a degree in at college in order to have good career prospects. I want people to use rational thought processes when they hear about hyped ideas and be able to assess what is really going on, and what is just plain (to use the technical term) bullshit. My Color Scheme and Past Analysis The acronyms I used for predictions in my original post were as follows. NET year means it will not happen before that year (No Earlier Than) BY year means I predict that it will happen by that year. NIML , Not In My Lifetime, i.e., not before 2050. As time passes mentioned years I color then as accurate , too pessimistic , or  too optimistic . This year I have added hemming and hawing . This is for when something looks just like what I said would take a lot longer has happened, but the underlying achievement is not what everyone expected, and is not what was delivered. This is mostly for things that were talked about as being likely to happen with no human intervention and it now appears to happen that way, but in reality there are humans in the loop that the companies never disclose. So the technology that was promised to be delivered hasn’t actually been delivered but everyone thinks it has been. I have not changed any of the text of the first three columns of the prediction tables since their publication on the first day of 2018. I only change the text in the fourth column to say what actually happened.  This meant that by two years ago that fourth column was getting very long and skinny, so I removed them and started with fresh comments last year. I have kept last year’s comments and added new ones, with yellow backgrounds, for this year. If you want to see the previous five years of comments you can go back to   the 2023 scorecard . There has been a lot of activity in both self driving cars (the demise of Cruise a big push by Waymo to scale human assisted deployments, and lots of smoke and mirrors from an electric car company) and in AI, where robotics has been pulled into the ultra hyposphere while in generative AI the end of scaling and the introduction of inference mechanisms (!!) have been hotly announced and disputed.  The human spaceflight endeavor, as it did last year, has crawled along and again has stretched out dates that were probably too optimistic in the first place. <rant> We all know about FOMO, Fear Of Missing Out. In late 2023, for a talk on generative AI that I gave at MIT,  I coined another acronym ,  FOBAWTPALSL, Fear Of Being A Wimpy Techno-Pessimist And Looking Stupid Later. Perhaps that one is a little bit too much of a mouthful to catch on. These two human insecurities lead people to herd-like behavior in establishing and propagating the zeitgeist on almost any topic. They lead to people piling on the hype fiestas, rushing to invest (money, effort, or hope) in marginal ideas once they have become a little bit popular, or believing our airspace is being invaded by foreign drones . “Mounting evidence, and lack thereof , suggests that perhaps the whole craze has been a sort of communal fever dream fueled by crowd mentality, confirmation bias and a general distrust in all things official.” That quote is from the drone story linked to above, but it could well as been about the hype that we are moving towards AGI (Artificial General Intelligence). I want to be clear, as there has been for almost seventy years now, there has been significant progress in Artificial Intelligence over the last decade. There are new tools and they are being applied widely in science and technology, and are changing the way we think about ourselves, and how to make further progress. That being said, we are not on the verge of replacing and eliminating humans in either white collar jobs or blue collar jobs. Their tasks may shift in both styles of jobs, but the jobs are not going away. We are not on the verge of a revolution in medicine and the role of human doctors. We are not on the verge of the elimination of coding as a job. We are not on the verge of replacing humans with humanoid robots to do jobs that involve physical interactions in the world. We are not on the verge of replacing human automobile and truck drivers world wide. We are not on the verge of replacing scientists with AI programs. Breathless predictions such as these have happened for seven decades in a row, and each time people have thought the end is in sight and that it is all over for humans, that we have figured out the secrets of intelligence and it will all just scale.  The only difference this time is that these expectations have leaked out into the world at large. I’ll analyze why this continues to happen below in the section on AI and ML. Here is a list of some of those hype cycles that I, personally, have perceived and lived through, as taken from my presentation at MIT in late 2023 that I referenced above re FOBAWTPALSL. Really, was there really hype about all these things?  Yes, there was, within the circles that cared. Those circles have gotten wider and wider and when reigning world chess champion Garry Kasparov was beaten by I.B.M.’s Deep Blue computer under tournament conditions in 1997 it was widely reported in the popular press, And it was declared that it was all over for humans. Back in February 2011 a computer program named Watson played on the television game show Jeopardy against all time human champions . John Markoff, legendary technology reporter at the New York Times, wrote stories about this the day before the competition, and the day after , when Watson had indeed beaten the humans, with the same questions (fed as text to it as the same time as the humans heard the questions) all running on a cluster of machines not connected to an outside network. Here are three successive paragraphs from the second of those stories. For I.B.M., the future will happen very quickly, company executives said. On Thursday it plans to announce that it will collaborate with Columbia University and the University of Maryland to create a physician’s assistant service that will allow doctors to query a cybernetic assistant. The company also plans to work with Nuance Communications Inc. to add voice recognition to the physician’s assistant, possibly making the service available in as little as 18 months. “I have been in medical education for 40 years and we’re still a very memory-based curriculum,” said Dr. Herbert Chase, a professor of clinical medicine at Columbia University who is working with I.B.M. on the physician’s assistant. “The power of Watson- like tools will cause us to reconsider what it is we want students to do.” I.B.M. executives also said they are in discussions with a major consumer electronics retailer to develop a version of Watson, named after I.B.M.’s founder, Thomas J. Watson, that would be able to interact with consumers on a variety of subjects like buying decisions and technical support. My personal experience at that time was people I did not know, but who had heard about my role at MIT (as director of the MIT AI Lab, and then founding director of MIT CSAIL, the Computer Science and Artificial Intelligence Lab) would come up to me and ask about the future of medicine. The people were variously doctors or health industry executives. I reassured them that medicine as we knew it then would stay much the same and was not about to be rendered obsolete. And then in 2016 Geoff Hinton, one of the key architects of Deep Learning (which has had undeniable impact on the world) said: “People should stop training radiologists now. It is just completely obvious that within five years deep learning is going to be better than radiologists.” More people asking me whether this was true. It wasn’t in five years and it isn’t now. We need more radiologists than ever. And yes they do use deep learning tools to help them see some things they wouldn’t otherwise see. But they also understand anomalies using causal reasoning and we would be in a sorry state if all radiology was done by programs today. Now look at those plum colored paragraphs above again as you take yourself way back in time to a year or so ago when ChatGPT was just a baby AGI, You can find stories just like this one if you substitute “ChatGPT” for “Watson” and “Microsoft” for “I.B.M.” The things confidently predicted in 2011 (and in 1979, and in 2016) about the end of doctors didn’t happen then and it is not happening now. Nor are all the other jobs ending. Today I get asked about humanoid robots taking away people’s jobs. In March 2023 I was at a cocktail party and there was a humanoid robot behind the bar making jokes with people and shakily (in a bad way) mixing drinks. A waiter was standing about 20 feet away silently staring at the robot with mouth hanging open. I went over and told her it was tele-operated. “Thank God” she said. (And I didn’t need to explain what “tele-operated” meant). Humanoids are not going to be taking away jobs anytime soon (and by that I mean not for decades). You, you people!, are all making fundamental errors in understanding the technologies and where their boundaries lie. Many of them will be useful technologies but their imagined capabilities are just not going to come about in the time frames the majority of the technology and prognosticator class, deeply driven by FOBAWTPALSL, think. But this time it is different you say. This time it is really going to happen. You just don’t understand how powerful AI is now, you say. All the early predictions were clearly wrong and premature as the AI programs were clearly not as good as now and we had much less computation back then. This time it is all different and it is for sure now. Yeah, well, I’ve got a Second Coming to sell you… </rant> As with flying cars the definition, or common understanding, of what self driving cars  really means has changed since my post on predictions seven years ago.  At that time self driving cars meant that the cars would drive themselves to wherever they were told to go with no further human control inputs. Now self driving cars means that there is no one in the drivers seat, but there may well be, and in all cases so far deployed, humans monitoring those cars from a remote location , and occasionally sending control inputs to the cars. The companies do not advertise this feature out loud too much, but they do acknowledge it, and the reports are that it happens somewhere between every one to two miles traveled. These inputs are not direct control of the normal human mechanism of control the steering wheel, the brakes, and the accelerator.  Rather they are advice that overrides some of the algorithms.  For instance, “steer out into the next lane and go around this truck” as the human realizes that the truck is just not going to move (see an anecdote below on the first night I took the new Waymo taxis in San Francisco (I had previously last ridden a Waymo in 2012 in Mountain View)). Why is this difference important?  One of the motivations for self driving cars was that the economics of taxis, cars that people hire at any time for a short ride of a few miles from where they are to somewhere else of their choosing, would be radically different as there would be no driver. Systems which do require remote operations assistance to get full reliability cut into that economic advantage and have a higher burden on their ROI calculations to make a business case for their adoption and therefore their time horizon to scaling across geographies. But wait, you might say, isn’t that electric car company that used to be based in California and is now based in Texas going to roll this out imminently and have a fully digital taxi service. They demoed it on a Hollywood movie studio lot just this year, and the cars were painted gold. Hmm. The location of the demo and the fact that the cars, even down to the tires, were painted gold tells you everything you need to know. Both the cars and the humanoid robots at that event were presented as autonomous but in reality they were all tele-operated directly by people (see below in the humanoid section for more details). And that same electric car company is actively hiring people into paying jobs as remote operators . There was a reasonably balanced appraisal from Reuters just after the event, though it does not go into details of the demos. Here is a direct quote from the story: “We do expect to start fully autonomous unsupervised FSD in Texas and California next year.” Musk said. The astute reader will note that this is the 11 th year in a row that the CEO of Tesla has made this prediction of the same milestone happening the next year. We can admire the consistency. Actual self-driving is now generally accepted to be much harder than every one believed . The reason that this bait and switch is important to understand is that the promise of inevitable fully self driving technology upended a historical way that new transportation systems have been adopted. In the past whenever we have introduced new transportation mechanisms there have been large investments in infrastructure and that infrastructure is shared and used by everyone. The Romans built roads so soldiers and traded goods could travel long distances–in Europe those road networks are still the basis of today’s road networks. When steam engine driven trains were the new transportation technology vast networks of rails were built allowing goods to move long distances in mere hours or days. When Ford started mass production of automobiles he built roads and the local governments followed and the the Federal government followed, and those roads are what we use today. Actual fully self driving cars promised that no infrastructure changes would be needed to revolutionize how vehicles would be controlled. Each individual vehicle would do what was needed all by itself. As sensors and networks got better there was no need for expensive new infrastructure because of this promise. The promise was false. If government and private partnerships in building smart roads, which was a hot topic in the 1990s. had continued, every one of us would now have smarter safer cars, but still with onboard human drivers taking over in many situations. But we would have had smart freeways where once you were on it your car would be self driving. The road would have had lots of sensors effectively shared across all cars, as that data would have been transmitted to all passing cars. It would have been a fraction of the cost per car compared to the sensing on today’s almost but not really self driving cars like those of Waymo. And we would have had much more accurate congestion data where the root causes of local congestion would have been sensed with semantic understanding rather than just inferring it from the aggregate collection of location data from phones, individual cars, and historical data from roadside sensors. Instead we now have individual corporate actors using a mixture of partial self driving and remote human supervision. The big question is whether the economics of this works at scale, and whether the fake promises will drive out the human drivers in cheaper services and we’ll all end up paying more. Will the level of hype we saw push our decentralized transportation system into the hands of a few wealthy companies, and in effect make it a centralized system where everybody has to pay private companies to be part of it? As a reminder of how strong the hype was and the certainty of promises that it was just around the corner here is a snapshot of a whole bunch of predictions by major executives from 2017. I have shown this many times before but there is one new annotation here for 2024. The years in parentheses are when the predictions were made. The years in blue are the years are the predicted years of achievement. When a blue year is shaded pink it means that it did not come to pass by then. The predictions with orange arrows are those that I had noticed had later been retracted. The prediction that Jaguar and Land-Rover made that they would have fully autonomous cars by 2024 did not come to pass, so I have shaded it pink, Note that every single blue year up until now is shaded pink, and that every one that is shaded pink has still not come to pass. None of the predictions that were out there in 2017 for the next few years have happened.  None. There are three more for 2025, and I am sure that a year from now they will all be shaded pink also. One of the big selling points of self driving cars was that they would be safer than cars driven by humans. So far that is not holding up with real data. One electric car maker with self driving software had it disengage when it sensed there would be an accident, supposedly so that the human could take over in a split second. And then the company did not report the incident as the fault of the software as it was no longer controlling the car when the impact occurred. It was reported, and I had this experience myself in my last ride in a Cruise in 2023, that Cruise vehicles would freeze when an accident looked likely, and then not report it as their software’s fault as the car was stationary and was hit by another car. In many reported cases, and in my case, simply continuing to move forward would avert any likely accident (fortunately for me the human driver of the other car slammed on the brakes and did not hit my robot vehicle). In  this story from the Washington Post  about Federal investigations into the safety incidents with self driving cars, they report that the companies involved claim they have vast amounts of driving on our roads under their belt. Not so. An industry association says autonomous vehicles have logged a total of 70 million miles, a figure that it compares to 293 trips to the moon and back. But it’s a tiny fraction of the almost 9 billion miles that Americans drive every day. The relatively small number of miles the vehicles have driven makes it difficult to draw broad conclusions about their safety. To put that into perspective, the total number of miles driven by all autonomous (sort of) vehicles over the last decade is less than 1% of the miles driven by humans every day in the United States. It is a tiny, tiny portion. Take a look at this embedded video from the Wall Street Journal about investigations of crashes (many of which have been fatal) involving autonomous driving systems. From the audio: “The kinds of things that tend to go wrong with these systems are things like it was not trained on, pictures of an overturned double trailer. It just didn’t know what it was. There were some lights there, but the lights were in unusual positions. A person would have clearly said something big is in the middle of the road. But the way machine learning works is it trains it on a bunch of examples and if it encounters something it doesn’t have a bunch of examples for it may have no idea what’s going on.” [[My own take is that the fetish of end to end learning leads people to leave out well known algorithms that might solve many of  these problems (e.g,, the incredibly simple time to collision algorithms based on looming ). Yes, end to end learning made speech understanding systems better, but that does not mean it is the appropriate fetish to apply everywhere.]] Pro tip: Think about this history of industry prognostications about fully autonomous driving being just around the corner when you read today’s prognostications about LLMs taking jobs, en masse, in the next couple of years, or humanoid robots being dirt cheap and being able to learn how to do any human manual task real real soon now. You know you have seen this movie before… My own experiences with Waymo in 2024 I have two sorts of experiences with Waymo vehicles. First, as a driver of my own vehicle and sharing road space with them every single time that I drive. And second, as a user of their ride service. The streets of San Francisco had been thick with Waymo vehicles with no driver in them especially in the second half of 2024. As I drive across the city every morning to head down to my robotics/AI startup half way down the peninsula I see them everywhere until I get on to 101.  I see them in front of me and behind me and in adjacent lanes as I drive on multilane one way streets. Sometimes I see four of them in a single block. Twice I’ve seen four of them in a line, in my block and could see four of them in a line in the block ahead of me.  When I am at four way intersections with no traffic lights I see them participating in the social ritual of taking your turn to drive through the intersection in the order you stopped, except when a pedestrian is crossing in front of you. They do that pretty well. They do less well when they accidentally get into a line of parents’ cars snaking around a corner for school drop off or pickup. Over the last few months I have noticed that in general they are getting more aggressive about stretching the rules, just like people do. Otherwise human drivers (including me) take advantage of their politeness. That aggression is not always welcomed. One morning I saw a workman with a group doing some digging on a road, and holding a sign with SLOW on one side and STOP on the other side have to jump in front of a Waymo to get it to do what he was trying to tell it to do with the sign. STOP. It wasn’t stopping for no stinking sign! The only time I have seen a Waymo go into reverse, ever, was when I was illegally driving the wrong way down a single lane street and we were heading straight at each other. As a rider I feel they are not quite aggressive enough with human drivers some time, so a ride in a Waymo takes longer than with an Uber or Lyft. It is hit and miss where they drop me off. Sometimes they take a place to pull over half a block from my house, even when it is raining. There is no way to adjust what they happen to decide that day, even though I know that they will always be able to pull in right in front of my house. The first time I took a Waymo this year, on the way home it picked me up at a restaurant and then was about to make a right turn. But at that corner there was an 18 wheeler with its lights flashing and surrounded by green cones. It pulled right in behind that truck and waited a long time before it drove forward. I am guessing a remote operator intervened told it to go around because eventually it pulled around it in the lane just to the left. Based on seeing Waymos interact with orange cones I suspect it would have done better if the cones had been orange rather than green.  This easily illustrates that the learning that this robot does, and indeed any robot does, is nothing like the learning that people do (see my rant about the seven deadly sins and mistaking performance for competence in the section below on advances in AI and ML). I mostly feel safe when I am a passenger in a Waymo.  Sometimes I don’t feel that my driver of an Uber that I am taking rides with Uber that are not as safe as I would prefer. Self Driving Taxi Services There have been three self driving taxi services in the US in various stages of play over the last handful of years, though it turns out, as pointed out above that all of them have remote operators. They are Waymo, Cruise, and Zoox. Waymo and Cruise are similar in that they use conventional cars adorned with lots of sensors. Zoox has purpose built vehicles that have no steering wheel or pedals for brake or accelerator. Waymo and Cruise went for deployments in large parts of two or more cities and have had ride services callable by apps, just as one can do with Uber or Lyft. Zoox is smaller scale, much more restricted in geography, and really not comparable. At this time last year Cruise was in trouble has it had suspended all of its San Francisco operations under pressure from regulators after some bad accidents that happened in a way that never would happen for human driven cars.  Briefly, their cars were getting hit at night by emergency vehicles with lights flashing as the Cruise cars crossed intersections. Human drivers see the reflections of lights from such vehicles flashing even if they don’t see the vehicles themselves. The Cruise vehicles were only reacting to flashing lights that they could perceive directly. But the accident that tipped the scales was when a pedestrian crossing in front of a human driven vehicle was hit and went flying in the air landing right in front of a Cruise. The Cruise hit the person (who now disappeared from sight) as a human driver would most likely have done. But then it proceeded to drive 20 feet with the human underneath the vehicle being dragged along as it went into a mode where it was supposed to get off the road. A human driver would not have reacted that way to having been in a collision, even if it was not their fault. The hammer finally fell in December of 2024. General Motors shut down Cruise. The leading paragraphs from this linked story  from the Wall Street Journal are: General Motors has scrapped its Cruise robotaxi program after nearly a decade and $10 billion in development, citing the time and costs needed to scale the business and rising competition. GM on Tuesday said it plans to realign its autonomous driving strategy and give priority to development of advanced driver assistance systems, which take over steering and other functions in certain situations and are common on new vehicles today. The automaker said it would continue to develop fully autonomous technology for personal vehicles, and build on the progress of its Super Cruise system, a hands-off, eyes-on driving feature that the company introduced several years ago. GM said it owns about 90% of Cruise and intends to buy out the remaining investors. It plans to combine the technical teams from Cruise and GM into a single effort to advance autonomous and assisted driving. “We want to leverage what already has been done as we go forward in this,” Chief Executive Mary Barra told analysts on a call Tuesday. The Detroit automaker said it expects the restructuring to reduce spending by more than $1 billion annually after the proposed plan is completed, which is expected in the first half of next year. While there are 40 companies that have permits to test autonomous driving in California, alone, the demise of Cruise leaves just one company, Waymo, trying to make an actual go of a digital taxi service in the United States. They have an enormous significant lead over anyone else who wants get into this business and have spent billions of dollars (probably very much north of $10 billion) on this endeavor over the last 15 years. In an email they sent me a couple of weeks ago as a user of their services they reported that they provided 4 million customer rides in 2024. That is approximately 4 million more than any other company in the United States. Despite being so far out in front it has not been all smooth sailing for Waymo. Early in the year the operations center for Waymo somehow neglected to realize it was Chinese New Year in Chinatown in San Francisco. So Waymo vehicles were routed through that area on the biggest night of celebration. Any human driver would have realized that the streets, i.e., the street surfaces where cars usually drive, were completely packed with humans, no doubt some of whom were intoxicated as well as just being out having a good time. Not so the Waymo vehicles. They tried pushing through the very very dense crowds, no doubt annoying many people. And what do people have at Chinese New Year?  Fireworks. So some revelers decided to push back on this robot car invading their space. Here are a couple of pictures of the results. Not pretty.  And an example of how taking away people’s agency is never a good idea for robots ( see my second law of robotics ). Throughout 2024 Waymo has been investigates for various accidents such as those described in this Wall Street Journal article . “Reports included collisions with stationary or semistationary objects, such as gates, chains or parked vehicles, according to the regulator.” In the middle of the summer Waymo added a feature where they would honk their horns at cars in their way. But this backfired when hundreds of Waymos were coming back to their parking lot in the very early hours of the morning, and they started honking at each other and waking up human neighbors . Eventually that got fixed. In late September a motorcade for Kamala Harris in San Francisco was brought to a halt by a Waymo that stopped in the middle of California Street doing a U-turn in front of it. I’m sure this incident was of great concern to the Secret Service. Eventually a San Francisco police officer got into the car and drove it out of the way–this is shown in a video included with the story above. I do not know how the officer got access to the vehicle and whether Waymo remote operations were cooperating. More disturbingly humans outside the Waymos started harrassing humans inside them. The most concerning cases come from the realization that if a woman is in a Waymo at night she will be dropped off, outside, on a public road at the end of her journey with no option but to get out of the car where it has stopped. So groups of men have followed Waymos with women in them and then harassing the woman when she gets out. If she was driving her own car she might be heading to an off road parking space or she might choose not to stop if she knows she is being followed. There are no such options in a Waymo so taking a Waymo at night is less safe than other means of transportation–just follow it and eventually the preyed upon woman will have to get out. Here is a very recent disturbing story  about this practice. Meanwhile Waymo managed to raise $5.6B to expand to new cities in 2025 . It already operates in parts of San Francisco, Los Angeles, and Phoenix. The new money will let it expand to Austin and Atlanta in the United States and to start operating in parts of Tokyo in Japan. That is expensive expansion. Here is the question for the future of watered down remote monitored “autonomous” driving systems (let’s call it “watered down autonomy”), and it is up to Waymo now. Can Waymo expand fast enough in these new markets in 2025 and take enough business from what is left of traditional taxi operators, along with those operating under the Uber and Lyft models, and do it in a way which is in sight of profitability, so that it has a case to raise the stupendous amounts of money needed to operate in all large cities in the US in the next 10 t0 20 years? If Waymo can not succeed at this in the next two years I think the idea of large scale use of watered down autonomy will be dead for at least a decade or two. Right now full autonomy everywhere is already dead. Electric Cars Last year US manufacturers pulled back on their planned production of EVs. In data from this report we can see that sales dropped at the start of 2024 but have now picked up again. There is steady growth in sales but my prediction of 30% of US car sales being electric by 2027 now seems wildly optimistic. We need two doublings to get there in three years and the doubling rate seems more like one doubling in four to five years. Note that some sources include hybrids and hydrogen powered cars in electric vehicles but I am using the battery electric vehicle (BEV) numbers. To see how the trends are across brands you can see a breakout for Q2 of 2024 here . There appear to be two main headwinds for BEV adoption. Firstly, if one doesn’t have on property residential parking it is hard work in the US to find a place to recharge, and it takes hours for the charging to finish. This will stop many city dwellers from adopting. Secondly the increased tire wear adds up to real money. The maintenance requirements for BEVs are much less than for cars with an internal combustion engine. On the other hand tires do not last as long (I have had to buy four new tires in less than two years owning my first BEV), apparently due to the increased weight of the car . Flying Cars Flying cars are another category where the definitions have changed. Back when I made my predictions it meant a vehicle that could both drive on roads and fly through the air.  Now it has come to mean an electric multi-rotor helicopter than can operate like a taxi between various fixed landing locations. Often touted are versions that have no human pilot. These are known as eVTOLs, for “electric vertical take off & landing”. Large valuations have been given to start ups who make nice videos of their electric air taxis flying about. But on inspection one sees that they don’t have people in them. Often, you might notice, even those flights are completely over water rather than land. I wrote about the lack of videos of viable prototypes back in November 2022. Nevertheless there have been wild predictions.  I ended a longer version of this component in last year’s annual review with: Also note the size of this vehicle. There are many fossil fuel powered helicopters that are much smaller. This is not going to be a personally owned vehicle for the masses. Don’t hold your breath. They are not here. They are not coming soon. Nothing has changed. Billions of dollars have been spent on this fantasy of personal flying cars.  It is just that, a fantasy, largely fueled by spending by billionaires. So what happened in Robotics, AI, and Machine Learning this year? Many, many, many people got just a little bit over excited. That’s what happened. There have been a lot of party tricks and it is the researchers who often play the tricks on themselves without realizing it. This is not new, none of it is new. But there are orders of magnitude more people watching it now, and more people are out to make a buck by being hypesters, promising riches to those who will invest in their irrationally overpriced companies. How could this be? We are seeing mass sinning, lots and lots of people committing some of the seven deadly sins of predicting the future of AI  which I wrote about back in 2017 here (or here you can see a professionally edited version of that blog post of mine). Four of those seven sins seem most relevant to today’s hyped up atmosphere around robotics, AI, and machine learning. Here now are short descriptions of these particular four sins, edited down from my earlier much more detailed descriptions. Then I will weave them together to explain how it is still pretty much business as usual, and I mean that in a good way, with steady progress on both the science and engineering of AI. Performance versus Competence One of the social skills that we all develop is an ability to estimate the capabilities of individual people with whom we interact. We use cues from how a person performs any particular task to estimate how well they might perform some different task. We are able to generalize from observing performance at one task to a guess at competence over a much bigger set of tasks. These estimators that we have all inherited or learned do not generalize well to other creatures or machines. We are not good at guessing which smart things other species might be able to do, and we are not good at guessing what an AI system can do when we have seen it do a few tasks in a limited domain. We get it wrong all the time. Indistinguishable from Magic When people cannot explain how something works they cannot know its limits as they do not have any sort of model (nor have they seen enough examples of it before). Arthur C. Clarke said that any sufficiently advanced technology is indistinguishable from magic. In our minds UFOs can do all sorts of amazing things as we have no way of knowing their limits–they may as well be magic, And that is what they become in speculation about them. Isaac Newton spent half his working life on alchemy as he did not know that the nucleus of atoms were not subject to mere chemistry. He would have been just as ignorant of the limitations of an iPhone screen (different sort of apple…), despite his own ground breaking work in optics. Remember, he was a really really smart dude. But even he was not able to develop all the theories needed to understand the world around him, despite his successes with calculus and gravity and the makeup of white light. He attributed properties to chemistry that were way beyond its limits. Exponentialism We have just lived through sixty years of the most phenomenal growth of a technology in the history of humankind. It is the story of silicon-based computation. Everyone has some idea about Moore’s Law, at least as much to sort of know that computers get better and better on a clockwork like schedule. This reality has trained people to think that probably a lot of other things in tech will change exponentially, especially when that thing has a strong computational component. The sin of exponentialism is to argue that some other process is going to follow a Moore’s-like law when it is unwarranted to so argue. Moore’s law worked for so long because in the starting technology of the 1960s the currents used to represent digital information were many many orders of magnitude beyond the minimal physical limit needed to determine whether they  were present or not, and hence distinguish a 1 from a 0. Those currents could be halved many times without breaking physics limits. Speed of Deployment New technologies get deployed much more slowly than people imagine. Even software technologies. The old internet protocol, IPv4, can only address two billion, or 2×10 9 , devices, which is way less than the number of people on our planet. A new protocol, IPv6, which can address more than 3×10 38  devices was meant to replace it over a two year period of dual use by about 2003. But in 2024 IPv4 was still there and carrying over half the world’s internet traffic despite its inadequacies. Must functioning businesses that operate in the physical world are very averse to taking up new technology as it dramatically increases existential risk to their business. They must foresee immediate and incredibly high return on investment (ROI) to be tempted to move to new technologies. Even the military is slow to adopt new technologies. The US Air Force still flies the B-52H variant of the B-52 bomber. This version was introduced in 1961, making it 63 years old. The last one was built in 1963, a mere 61 years ago. Currently these planes are expected to keep flying until at least 2040, and perhaps longer–there is talk of extending their life out to 100 years. What does this all mean? Right now there is incredible hype for both Large Language Models (LLMs), and all their variations, and for humanoid robots, especially humanoid robots that are going to learn how to do things. The hype is driven by the four sins above. LLMs have proved amazing facile with language. They have been trained on pretty much all the text that is available on the Web and all the digitized historical books that exist. Miraculously LLMs seem to be able to infer a representation of some sort, that is somewhat independent of the particular human language that they read. So they are able to translate between human languages, and when you ask them just about anything they produce text in the language that you asked in, and that text often seems entirely reasonable and informative. I used the word “miraculously” as we do not really understand why they are able to do what they do. We, of course, know that the architecture for them is built around noticing correlations in vast amounts of text  that connect some tens of thousands of tokens which are the components of words in each language that is digested. It is a surprise that they work as well as they they do, and produce coherent sounding language on just about any topic. Here is the original architectural diagram from the 2017 Attention Is All You Need paper: Each column from bottom to top is a pure feed forward network, with no search, no iteration, no conventional algorithm at all. There are inputs at the bottom and then layer upon layer of linear neurons that have numbers or weights stored in them that multiply and add their inputs and threshold that sum to provide an output. The detail in the architectural diagram is how the connections between layers are organized. On the left is an input or question, in a linear string of words, from a user. That gets injected half way up the network on the right and remains constant while a single iteration process runs. The stack on the right outputs a word (or token) and that gets fed back to the bottom of that stack, and a new token pops out the top. All the output tokens that have so far been produced remain in the right bottom input buffer as ordered input. What the network has been trained to do, is given the user input on the left, and what the network has output so far, choose a very likely next word, given the billions of examples it has seen in training. Some randomness is used to choose among a small number of very likely next words at each stage. There are hundreds of billions of weights that get learned and stored in the layers of network to act as multipliers for each individual input to each layer. So now us humans are faced with looking at this system running and our human nature just makes us commit the first two sins from above.  It is in our nature and we cannot help ourselves. First, we see really impressive examples of responses to input questions, and if a human was giving those answers we would estimate that person to be quite clever and able to reason. Often though, because they have so many billions of examples on which they were trained LLMs are essentially looking up the question in the weights. The weight if gained from all of human knowledge that is out there on the network in language form. Invisibly the network is perhaps (but not in any intentional way) merging some similar questions, and then merging the answers which were already in the vast data that it has seen. But us dumb humans just think the damn thing is really really smart. Then, since we don’t have a real explanation in our heads for what it is doing we start thinking it is magic, and that there is no real limit to what it is extracting from all that data (that it used a significant portion of the energy budget for many different countries to compute) and how general its capabilities will be. It becomes magic. And then researchers try to show that it can reason, that it has inferred a spatial understanding of the world, that language can be used to do all sorts of things that Moravec’s paradox tells us it can’t. There is a lot of magical thinking that humans do about LLMs. Of course it can diagnose diseases like a doctor talking about them. Of course it can teach a student as well as a human teacher. Of course it can program as well as a human computer programmer. It is magic after all. But in reality the fact that it is just picking likely next words means that in fact we can’t trust its output. Some outputs are great. Some are pure confabulations (most people use the word “hallucinations” for this, but I prefer “confabulations”). And we do not know which we will get ahead of time, or more perniciously how much of each we will get, trustworthy pieces of output and confabulated pieces of output all jumbled together. Not to worry say the proponents, More learning will fix it. Fire up a nuclear power plant (I am not making this up–the tech companies are getting more nuclear power built or activated so that their LLMs can learn what a human learns using just 20 watts powering their brain; I am not confabulating this!!), and we’ll feed it more data and it will become more trustworthy.  It is magic after all. But the magic is not going as well as the proponents imagined and promised as this Wall Street Journal story explains. Their imaginations were definitely encourage by exponentialism, but in fact all they knew was that when the went from smallish to largish networks following the architectural diagram above, the performance got much better. So the inherent reasoning was that if more made things better then more more would make things more better. Alas for them it appears that this is probably not the case. But rabid exponentialists have not yet given up. Expect a bunch of VCs to adversely affect the growth of pension funds around the world as pension funds are a prime source of capital that VCs spend. More serious academics are working on boxing in the LLMs with more external mechanism beyond just feeding the output tokens back in as a linear string of input. Many of these mechanisms look a lot like more conventional AI mechanisms, and we will see where these additions prove to be useful, how much of the wheel will be reinvented, and how long (months?, years?, decades?) to get there. And the answers to those last questions will tell us how much sinning has been done by companies in predicting fast deployments. Back in rant at the beginning of this post I gave the example of I.B.M. and Watson and their completely optimistic predictions of how any problems of applying Watson (which seemed extremely competent based on its performance on live TV) to the real world would be solvable. The areas that it was predicted to be applicable came from magical thinking. Surely no one today could be as dumb as that big company was back in 2011. Surely not. No, not us smart inhabitants of 2025. Its us. We are nowhere near as dumb as them!! Humanoid Robots The other thing that has gotten over hyped in 2024 is humanoids robots.  The rationale for humanoid robots being a thing is a product of the four sins above and I think way less rooted in reality than the hype about LLMs. In fact I think it is pretty dumb. [[I suspect many people will reason that I cannot have a valid opinion about this precisely because I happen to have built more humanoid robots than anyone else on the planet. So read ahead with caution.]] My first law of robotics states: The visual appearance of a robot makes a promise about what it can do and how smart it is. It needs to deliver or slightly over deliver on that promise or it will not be accepted. The first sentence describes, I think, what is sucking people into believing that humanoid robots have a big future. It looks like a human, so its performance will be like a human, so it will be competent like a human.  It’s the performance/competence sin without even waiting for the performance part! The second sentence describes how the humanoid fever will break, and how the hundreds of millions of dollars put into many of these companies (billions of dollars overall) will disappear. The puppets will not perform at acceptable levels. It is easy to see this as you hear all the things investors and CEOs of humanoid robots say they will be able to do. They have hardly even got to the lab demonstration phase.  My third law of robotics is: Technologies for robots need 10+ years of steady improvement beyond lab demos of the target tasks to mature to low cost and to have their limitations characterized well enough that they can deliver 99.9% of the time. Every 10 more years gets another 9 in reliability. For real work, robots need to operate with four, five, or six nines. We are a long way from that. The zeitgeist is that we will simply teach the robots to do stuff and then they will be able to do it. BUT, we do not know yet whether that is going to work. In order for it to work you have to both collect the right sort of data and then learn the right things from that data. It is not at all clear to me that we know the answers to make either of those things true. I think it will be an active place for lots of good research for many years to come. There is an excellent survey paper of current research state of the art called Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes . Unfortunately I think the title of the paper is going to confuse many people. “Real-World Successes” to someone like me, who these days deploys robots that people pay for and that provide real ROI, sounds like it is about systems that have been deployed. But on reading the paper it turns out that they mean that it is learning and demonstrations done in a lab setting on physical hardware rather than just in simulations and simulators.  And, to me the lab demonstrations are shakier (literally) than I imagined in my third law above. I think we are a long way off from being able to for-real deploy humanoid robots which have even minimal performance to be useable and even further off from ones that have enough ROI for people want to use them for anything beyond marketing the forward thinking outlook of the buyer. Despite this, many people have predicted that the cost of humanoid robots will drop exponentially as their numbers grow, and so they will get dirt cheap. I have seen people refer to the cost of integrated circuits having dropped so much over the last few decades as proof. Not so. They are committing the sin of exponentialism in an obviously dumb way. As I explained above the first integrated circuits were far from working at the limits of physics of representing information. But today’s robots use mechanical components and motors that are not too far at all from physics based limits, about mass, force, and energy. You can’t just halve the size of a motor and have a robot lift the same sized payload. Perhaps you can halve it once to get rid of inefficiencies in current designs. Perhaps. But you certainly can’t do it twice. Physical robots are not ripe for exponential cost reduction by burning wastes in current designs. And it won’t happen just because we start (perhaps) mass producing humanoid robots (oh, but the way, I already did this a decade ago–see my parting shot below). We know that from a century of mass producing automobiles. They did not get exponentially cheaper, except in the computing systems. Engines still have mass and still need the same amount of energy to accelerate good old fashioned mass. This Year’s Prediction Update There is only one new comment in my robotics, AI and ML predictions table this year. There are a bunch of well funded new companies in the home robot space, and perhaps they will come up with new mobility solutions, which in my experience is the big blocker for home robots. A Parting Shot I recently read a research paper on humanoid robots working in built for human environments. It was based on the argument that the best form for a robot that is to operate in human environments is something tallish and skinny-ish, and probably dynamically balancing, with arms that can reach down to table tops etc., and with a sensor system that can look down from above, as that is what our human environments are optimized for. Here is the first paragraph of the paper: The past decade has seen an explosion of research in humanoid robotics. The stated motivations for this work have varied widely. Many teams have concentrated on bipedal locomotion, some have been interested in human level social interactions, understanding human intelligence, modeling human learning capabilities and others have been more interested in entertainment. Some humanoid robots have had manipulation capabilities on static humanoid platforms and some of that work is aimed at dexterity, plus there has been simple two armed grasping on mobile humanoid platforms. Overall there has been very little work combining dexterous manipulation with humanoid robots, static or mobile–much of that which has appeared, has been concerned with dynamic tasks like pole balancing and juggling rather than manipulation, or has used teleoperated manipulation. Apart from the weird references to pole balancing and juggling this all sounds pretty reasonable and consistent with what is happening today, and with recent history.  In fact this is the very first paragraph of the very first paper in the very first issue of the very first volume of the International Journal of Humanoid Robotics . And it was published in 2004, with me as first author.  Let me spell that out in case you thought there was a typo in the year. This is from a paper that I and my students and post-docs wrote in the year two thousand and four . Here is the beginning of the contents page for that first issue. You can download the text of that paper here . The journal is now in its 21 st year of operation, an on its 21 st volume of issues and papers. By the time this paper was written my research group at MIT had been working on and building humanoid robots for twelve years. This paper, about a robot named Cardea, was probably our sixth or seventh humanoid robot. [[In 2008 I started a company that built and shipped thousands of humanoid robots. The picture at the top of this post was taken in China with a line up of humanoids that we had built in Massachusetts and New Hampshire and sold to people in China (before a US initiated trade war with China put an end to it in 2018 …irony can be personally hard to take at times…).]] The robot Cardea (Cardea was an ancient Roman goddess of door hinges and handles; these are still a challenge for modern robots…) was a two wheeled dynamically balancing robot  that lived in a built-for-humans office environment. Cardea was able to open doors using existing door handles and then make its way through doors it had opened. Pro tip:  Just because you heard about a new idea this last year or two doesn’t mean that people haven’t been working on that very same idea for decades. So temper your expectations that it must be about to transform the world. Ideas that transform the world take decades, or centuries of development, and plenty of people long before you have been just as excited about the idea and had thought it was on the verge of taking off. And none of us, including you and me, are likely to be special enough or lucky enough to come along at just the right time to see it all happen. Like all modern humanoid robots Cardea did not walk in a way that used passive dynamics to store energy, and basically modulate the behavior of a passive mechanism that had only low energy input, which is how all animals walk. So, like all modern mobile humanoid robots (and legged robots in general) when things were going awry its control algorithms tried to recover by pumping in large amounts of energy very quickly and sometimes that didn’t quite work and the energy needed to go somewhere. Cardea could be a little dangerous in those circumstances, if it fell on you having just increased its kinetic energy. Even the spring based deployment system for its stick-like legs that were engaged when it realized it was going to fall could be dangerous. This is still a problem with all modern humanoid robots. That is why the tele-operated humanoids that were in the Tesla movie lot theater show a couple of months ago operated in two modes. When they all walked out the human guests were kept away from them. Once they stopped walking and were operating in a very different mode people were allowed to approach them, and then get fooled into thinking they were talking to an AI powered robot when they were really talking to a remote human operator. But the robot was no longer moving its feet, and no longer a source of physical danger as a result. Another pro tip:  Don’t stand anywhere near a walking or balancing wheeled humanoid when they are moving or doing any task. I have had some near misses for myself with my own humanoids twenty years ago and more recently with some of the humanoids from new start ups. And more generally never be below any sort of walking robot, no matter how many legs it has, when it is walking up stairs. The numbers of flights in 2024 was not much different from those in 2023 (I neglected to include the flights by China last year).  It does not feel like a golden age of human spaceflight, though there were other highlights from SpaceX. Orbital Crewed Flights Three countries put 28 people into orbit in 2024, the United States launched 16 people on five flights and Russia and China launched 6 people each with two launches. So there were nine crewed orbital flights total. Two were private and seven were government flights. The United States:  There were four US flights to the International Space Station, starting with the private Axion-3 mission with a crew of four on January 18 th . The launch vehicle for this was a SpaceX Falcon 9, and the crew vehicle was a SpaceX Dragon. The remaining US flights to the ISS were paid for by NASA. Two of them were SpaceX flights, with four people on March 4 th , the Crew-8 mission, and two people on board Crew-9 on October 25 th . The remaining US flight to the ISS was the inaugural crewed flight of Boeing’s Starliner, launched on June 5 th atop an Atlas V rocket with two people aboard. They are still stuck in space and will be for a few more months–see the section on Boeing below. The other US mission was also a SpaceX launch and vehicle flight, this time known as Polaris Dawn. It was the second mission paid for by billionaire Jared Isaacman, with him as commander. There was a former US Air Force fighter pilot as mission pilot and two SpaceX employees as mission specialists, giving a total crew size of four. They stayed aloft for five days, launching on September 10 th , This mission flew higher above Earth than any mission since Apollo 17, the last lunar landing mission, in 1972. Two of the crew “spacewalked” with their feet inside the Dragon capsule but with their bodies outside. This was the first private spacewalk ever. Now Isaacman has been tapped by the incoming US President to be the administrator of NASA. Russia:  There were two Soyuz launches, each with three people, up and down, but different people coming back. The launch dates were March 23 rd and September 11 the . The six people that launched on Soyuz in 2024 were 3 Russian Cosmonauts 2 NASA Astronauts and one Belarusian commercial airline flight attendant who won a national competition with 3,000 applications. She was the only one not set for a long duration mission and was off the ground for slightly less than 14 days. So there were no space tourists per so, but the Belarusian flyer was most likely included as part of Russia’s efforts to keep in good favor with Belarus which has aided it in its war in Ukraine, and was certainly not part of the regular scientific program of the ISS. China:  There were two flights of  Shenzhou (a larger more modern version of Soyuz) that were crewed in 2024.  Both flights were to the Tiangong Space Station and both took along three Taikonauts, first on April 25 th and then on October 9 th .  Both crews were assigned long duration missions and now the crews are overlapping previous crews at Tiangong so it is now being continuously occupied. The first handover this year took about five days and the second about three and a half weeks.  Both times there were six Taikonauts onboard Tiangong at the same time. Suborbital Crewed Flights There have been two companies providing space tourism flights on suborbital flights. Blue Origin launches a capsule on top of a reusable rocket, New Shepard, and the capsule lands using a parachute and a brief rocket blast right before hitting the ground (similar to how Soyuz lands). Virgin Galactic has a winged craft which is carried aloft by a bigger a jet engined airplane, it separates at high altitude within the atmosphere and rockets into space. It flies back and lands on a runway. Both companies are run by billionaires who made their money in other businesses.  Both billionaires have flown to space on their own craft. Both companies have aimed to have regular launches with lots of tourists, but neither has gotten to that scale and so far only a very small number of the many people who have paid a substantial deposit have been able to fly. Blue Origin had a failure with an uncrewed version of the vehicle in 2022 and only flew one flight in 2023 which was also uncrewed. This year they flew three crewed flights on May 19 th , August 29 th , and November 22 nd , each with six passengers (the system is automated and requires no pilots). In 2021 and 2022 they also had three flights, so there has now been nine crewed flights total. The first two took four passengers and the remaining seven have had six passengers, so altogether they have flown 50 people above the Karman line, 100 kilometers above Earth.  This is not yet a regular cadence, nor a large scale tourist business. In 2024 Virgin Galactic had two flights, each with two crew from the company and four passengers. These flights were on January 26 th and June 8 th . Virgin Galactic flights are now on hiatus, awaiting a new bigger and better vehicle in about two years.  Virgin Galactic has had a total of twelve flights since December 13th in 2018.  Three have had two people on board and nine have had six people on board, for a total of sixty filled seats that have crossed the Karman line. The total number of different people is smaller as the two pilot seats on each flight have been occupied by a small number of people who have flown multiple times. So, in 2024 thirty people went on suborbital flights, and altogether there have been 110 people on these commercial suborbital flights. Space tourism on suborbital flights has yet to take off in a regular or scaled way. Boeing’s Starliner First announced in 2010 Boeing’s Starliner was originally scheduled to fly a human crew in 2018. It carried out its second uncrewed flight in May 2022, and finally did make its first crewed flight on June 5 th . The crew of two docked with the ISS, but there were problems with multiple gas thrusters for fine motion during the docking. The original plan was that the crew would stay on the ISS for about a week and then return to Earth for a touchdown on to hard soil (as all Russian and Chinese crewed missions end along with all Blue Origin sub-orbital flights). The option of that return was considered, but the thrusters were on a section of the vehicle which is discarded along the way before the landing so there was no possibility of getting a look at the hardware back on Earth.  So a program of tests while docked to the ISS was started delaying the crew return. Eventually it was decided that it was too risky for the crew to return on the craft and so it returned empty on  September 7 th , landing in New Mexico. As it happened, although there were more anomalies with the thrusters the crew would have landed safely had they been on board. Now the crew was stranded in space with no designated ride home. It was decided to remove two crew from the Crew-9 launch and have the Starliner astronauts, Barry Wilmore and Sunita Williams, fly back on that SpaceX Dragon with the other two, which after additional delays is now scheduled to happen some time in March 2025. Their one week visit to the ISS will have stretched out to nine months by then. Boeing has committed to fixing the problems with Starliner. The boosters that it uses are no longer being built, but there are five existing ones reserved for the five additional contracted flights that Boeing has with NASA. They are supposed to happen once per year. We do not know at this point, but I think it would not be a huge surprise if Starliner never flies again. SpaceX Falcon 9 Once again the Falcon 9 launch system has broken all sorts of records for number of launches and reuse. During 2024 there were 132 single booster launches.  For two of those flights no attempt was made to recover the first stage (there is a performance penalty for the primary payload in order to recover the first stage). One attempted recovery failed when the booster (on its 23 rd flight) caught fire as it landed on the recovery barge. Another booster has since flown a total of 24 times. In terms of mission success all but one of these flights succeeded; one failed when the second stage failed during re-ignition for adjusting the orbit. There were also two Falcon Heavy, the three booster version, launches, both of which succeeded. One of the had successful landings for the two side boosters, but there was no attempt to recoer the central booster on that flight and no attempt to recover any of the three boosters on the other Heavy flight. This brings the total number of launches of the single booster version to 417 along with 11 launches of the three booster Heavy version.  These numbers are way beyond the number of launches for any other orbital booster.  Additionally it is the only flying orbital system that is reusable at the moment, though  Blue Origin and Rocket Lab both plan on joining the club soon. It is worth, once again, looking at how long it has taken to get to a total (across both single booster and Heavy triple booster versions) of 428 launches, with only three failures to deliver the payload to where it was intended to go. The first launch occured in June 2010, and there were a total of 4 launches in the first three years.  The first successful booster recover happened on the 20th flight, in December 2015, five and a half years in. The first reuse of a booster occured in 2017, in the 8 th year of the program. Since 2021 there has been a steady increase in the number of launches per year, SpaceX had previously gotten satellites to orbit with its first rocket, the Falcon 1.  Falcon 9 has been a spectacular success.  But it was not instantaneous.  It took time to build from the cadence of launches, about 10 years before the hockey stick curve showed up.  Deployment is never sudden but comes after a long build. SpaceX Starship Starship is SpaceX’s superheavy two stage rocket, designed to put 150 tons of payload into orbit, but also be able to go to the Moon or Mars. There is the booster which is designed only to work in Earth atmosphere with 33 Raptor engines both to get the second stage high enough and fast enough and to let the first stage have a controlled return to the launch site. The second stage, called Starship, is both a booster and the payload.  It has three Raptor engines and three Raptor vacuum engines. The Raptor engines are designed to get the Starship into orbit after the first stage drops away, and to guide the Starship as it returns to its Earth launch site. The Raptor vacuum engines are meant for breaking out of Earth orbit and going to the Moon or Mars, and to do soft landings on those two bodies where there is no or almost no atmosphere. In 2024 SpaceX made steady progress with four launches of the two stages coupled together.  The first two launches lead to both stages blowing up. The third and fourth launches were a big improvement.  As with earlier flights they launched from the coast of Texas. In both cases the second stage did a reentry burn on it first orbit and then did a soft landing in a target zone in the Indian Ocean.  In the third flight the main booster returned to the launch site and hovered next to the launch tower betweeen two giant arms which then captured it and the engines shot down successfully. It was sifficiently damaged during flight however, that it was not reusable. In the fourth flight there were health anomalies to the first stage was ditched in the Gulf of Mexico. On the fourth flight there was both less heat shielding and much less damage from heat during reentry. This is definite forward progress. But it is still quite a long way from both being operational and both stages being reusable. And it is even further away from being human rated. This is the vehicle that the CEO of SpaceX recently said would be launched to Mars and attempt a soft landing there.  He also said that if successful the humans would fly to Mars on it in 2030. These are enormously ambitious goals just from a maturity of technology standpoint. The real show stopper however may be human physiology as evidence accumulates that humans would not survive three years (the minimum duration of a Mars mission, due to orbital mechanics) in space with current shielding practices and current lack of gravity on board designs. Those two challenges may take decades, or even centuries to overcome (recall that Leonardo Da Vinci had designs for flying machines that took centuries to be developed…). The President of SpaceX may be taking a leaf out of the CEO’s always overly optimistic predictions. In November she said “I would not be surprised if we fly 400 Starship launches in the next four years” . Looking at the success of Falcon 9 it is certainly plausible that I may live to see 400 Starship launches in a four year period, but I am quite confident that it will not happen in the  next four years (2025 through 2028). One more thing. Back when I first made the predictions there had been an announcement by the CEO of SpaceX that in 2018 the company was under contract to send a very rich paying customer in a trip around the moon in 2018, launched on a Falcon Heavy. I was completely skeptical. Over the years the date got pushed back and pushed back, and the proposed flight vehicle was changed to be Starship. As we all know the flight of the Japanese billionaire around the Moon still hasn’t happened. In 2024 Yusaku Maezawa finally gave up waiting and cancelled the contract . NASA Artemis NASA’s plan is that the second Artemis mission, using the Orion Capsule, Artemis II, will fly to the Moon with four people aboard, the first crewed Artemis flight. An uncrewed flight of Orion around the Moon flew in 2022.  The crewed flight was scheduled to launch in May 2024, but it was first delayed by six months and then a little more and in the last year it has slipped another full year. It is now scheduled to fly in April 2026. Artemis III was scheduled to launch in 2025 with a return to the surface of the Moon. However that relied on using a Starship (itself refueled in LEO by 14 (yes, fourteen !!) other Starship launches) to land there.  No one any longer believes that schedule, and willlikely delay a few years, given where Starship is in its development and current capability.  The officieal schedule says mid 2027, but that seems unlikely. You can find the architecture of the Artemis III mission at this website . Blue Origin Orbital BE-4 Engines and New Glenn The suborbital tourist flights that Blue Origin operates are not its main business. It has ambitions to compete head to head with SpaceX. Another billionaire vs billionaire competition. It has developed the BE-4 engine designed to fly 100 times, and to power the first stage of its massive New Glenn rocket (see below).  But in the meantime it has started selling the BE-4 to ULA (United Launch Alliance) to power their Vulcan Centaur heavy launch vehicle. It’s first stage uses two BE-4 engines, along with a variable number of solid fuel strap ons. Vulcan Centaur flew two times in 2024 and the BE-4 engines worked perfectly both times, on January 8 th and again on October 4 th . This is a solid validation of the engine’s capabilities. Blue Origin’s own first orbital class rocket, New Glenn, is massive, and comparable to the Flacon Heavy (three boosters) rather than the Falcon 9 in capability.   It has been in development for a long time, but saw its first visits to a launch pad, fully stacked in 2024. The first stage uses seven BE-4 engines, and is intended to land on a barge and be fully reusable. The second stage uses two BE-3U engines, a variant of the single engine used on their New Shepard sub-orbital space tourism vehicle. There is a project underway to make a fully reusable version of the second stage. Launch seems imminent.  Here it is at the launch pad in November 2024. On Friday December 27 th , 2024, it was fully fueled in both stages and went through a countdown and fired its seven BE-4 engines for 24 seconds . Now it will leave the pad to have its payload installed. The launch could be as early January 6 th .  The very first launch will be an all up affair, attempting to get something to orbit and land the booster on its first flight. This is a very different development approach to that used by SpaceX. Let’s Continue a Noble Tradition! The billionaire founders of both Virgin Galactic and Blue Origin had faith in the systems they had created. They both personally flew on the first operational flights of their sub-orbital launch systems. They went way beyond simply talking about how great their technology was, they believed in it, and flew in it. Let’s hope this tradition continues. Let’s hope the billionaire founder/CEO of SpaceX will be onboard the first crewed flight of Starship to Mars, and that it happens sooner than I expect. We can all cheer for that.

0 views
Ahead of AI 9 months ago

Noteworthy AI Research Papers of 2024 (Part One)

To kick off the year, I've finally been able to complete the draft of this AI Research Highlights of 2024 article. It covers a variety of topics, from mixture-of-experts models to new LLM scaling laws for precision. Reflecting on all the major research highlights of 2024 would probably require writing an entire book. It's been an extraordinarily productive year, even for such a fast-moving field. To keep things reasonably concise, I decided to focus exclusively on LLM research this year. But even then, how does one choose a subset of papers from such an eventful year? The simplest approach I could think of was to highlight one paper per month: January through December 2024. So, in this article, I'll share research papers that I personally found fascinating, impactful, or, ideally, both. However, note that this article is just Part One , focusing on the first half of 2024 from January through June. Part 2 of this series, covering July to December, will be shared later in January. The selection criteria are admittedly subjective, based on what stood out to me this year. I've also aimed for some variety, so it's not all just about LLM model releases. If you're looking for a broader list of AI research papers, feel free to check out my earlier article ( LLM Research Papers: The 2024 List ). For those who read my previous article , I’m happy to share that I’m already feeling a bit better and slowly but steadily recovering! I also want to express my heartfelt thanks for all the kind wishes and support. It truly meant the world to me and helped me through some tough days! Happy new year and happy reading! Only a few days into January 2024, the Mistral AI team shared the Mixtral of Experts paper (8 Jan 2024), which described Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) model. The paper and model were both very influential at the time, as Mixtral 8x7B was (one of) the first open-weight MoE LLMs with an impressive performance: it outperformed Llama 2 70B and GPT-3.5 across various benchmarks. An MoE, or Mixture of Experts, is an ensemble model that combines several smaller "expert" subnetworks inside the GPT-like decoder architecture. Each subnetwork is said to be responsible for handling different types of tasks or, more concretely, tokens. The idea here is that by using multiple smaller subnetworks instead of one large network, MoEs aim to allocate computational resources more efficiently. In particular, in Mixtral 8x7B, is to replace each feed-forward module in a transformer architecture with 8 expert layers, as illustrated in the figure below. Annotated transformer architecture from Attention Is All You Need, https://arxiv.org/abs/1706.03762 "Sparse" in the context of a "Sparse Mixture of Experts" refers to the fact that at any given time, only a subset of the expert layers (typically 1 or 2 out of the 8 in Mixtral 8x7B) are actively used for processing a token. As illustrated in the figure above, the subnetworks replace the feed-forward module in the LLM. A feed-forward module is essentially a multilayer perceptron. In PyTorch-like pseudocode, it essentially looks like this: In addition, there is also a Router module (also known as a gating network ) that redirects each of the token embeddings to the 8 expert feed-forward modules, where only a subset of these experts are active at a time. Since there are 11 more papers to cover in this article, I want to keep this description of the Mixtral model brief. However, you can find additional details in my previous article, Model Merging, Mixtures of Experts, and Towards Smaller LLMs . At the beginning of the year, I would have thought that open-weight MoE models would be more popular and widely used than they are today. While they are not irrelevant, many state-of-the-art models still rely on dense (traditional) LLMs rather than MoEs though, e.g., Llama 3, Qwen 2.5, Gemma 2, etc. However, it is, of course, impossible to say what proprietary architectures like GPT-4, Gemini, and Claude are based on; they might as well be using MoE under the hood. In any case, MoE architectures are still relevant, especially as they offer a way to scale large language models efficiently by activating only a subset of the model's parameters for each input, thus reducing computation costs without sacrificing model capacity. By the way, after writing this article, there was a nice surprise release of the very well-performing DeepSeek-V3 model in December , which uses a MoE architecture. So, yes, MoEs continue to be very relevant! If you are finetuning open-weight LLMs, chances are high that you have been using low-rank adaptation (LoRA), a method for parameter-efficient LLM finetuning, at some point. If you are new to LoRA, I have written a previous article on Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) that you might helpful, and I have a from-scratch code implementation in Appendix D of my Build A Large Language Model (From Scratch) book. Since LoRA is such a popular and widely used method, and since I had so much fun implementing and playing with a newer variant, my pick for February is DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024) by Liu and colleagues. Before introducing DoRA, here’s a quick LoRA refresher: Full finetuning updates each large weight matrix W in an LLM by computing a large weight update matrix ΔW . LoRA approximates ΔW as the product of two smaller matrices A and B . So, Instead of W + ΔW , we have W + A.B . This greatly reduces computational and memory overhead. The figure below illustrates these formulas for full finetuning (left) and LoRA (right) side by side. An illustration of regular finetuning (left) and LoRA finetuning (right). In DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024), Liu and colleagues.extend LoRA by first decomposing a pretrained weight matrix into two parts: a magnitude vector m and a directional matrix V . This decomposition is rooted in the idea that any vector can be represented by its length (magnitude) and direction (orientation), and here we apply it to each column vector of a weight matrix. Once we have m and V , DoRA applies LoRA-style low-rank updates only to the directional matrix V , while allowing the magnitude vector m to be trained separately. Annotated illustration from the DoRA paper (https://arxiv.org/abs/2402.09353) This two-step approach gives DoRA more flexibility than standard LoRA. Rather than uniformly scaling both magnitude and direction as LoRA tends to do, DoRA can make subtle directional adjustments without necessarily increasing the magnitude. The result is improved performance and robustness, as DoRA can outperform LoRA even when using fewer parameters and is less sensitive to the choice of rank. Again, I am keeping this section brief since there are 10 more to go, but if you are interested in additional details, I dedicated a whole article to this method earlier this year: Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch . DoRA is a small, logical improvement over the original LoRA method. While it hasn’t been widely adopted yet, it adds minimal complexity and is worth considering the next time you finetune an LLM. In general, I expect LoRA and similar methods to remain popular. For example, Apple recently mentioned in their Apple Intelligence Foundation Language Models paper that they use LoRA for on-device task specialization of LLMs. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. As far as I can tell, instruction-finetuning is the most popular form of finetuning by LLM practitioners. The goal here is to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions. However, when it comes to taking in new knowledge, continued pretraining (sometimes also referred to continually pretraining) is the way to go. In this section, I want to briefly summarize the refreshingly straightforward Simple and Scalable Strategies to Continually Pre-train Large Language Models (March 2024) paper by Ibrahim and colleagues. This 24-page Continually Pre-train Large Language Models paper reports a large number of experiments and comes with countless figures, which is very thorough for today's standards. What were the main tips for applying continued pretraining successfully? 1. Simple re-warming and re-decaying the learning rate. 2. Adding a small portion (e.g., 5%) of the original pretraining data to the new dataset to prevent catastrophic forgetting. Note that smaller fractions like 0.5% and 1% were also effective. To be a bit more concrete regarding point 1, re-warming and re-decaying, this means we employ the exact same learning rate schedule that was used during the initial pretraining stage of an LLM as shown in the figure below. A schedule for continued pretraining. Figure based on Build a Large Language Model From Scratch, https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb As far as I know, the re-warming and re-decaying, as well as adding original pretraining data to the new data, is more or less common knowledge. However, I really appreciate that the researchers took the time to formally test this method in this very detailed 24-page report. If you are interested in additional details, I discussed this paper more thoroughly in my previous Tips for LLM Pretraining and Evaluating Reward Models article . I have no reason to believe that these methods will not continue to work for future LLMs. However, it is important to note that pretraining pipelines have become more sophisticated in recent months, consisting of multiple stages, including short- and long-context pretraining. (I’ve written more about it in New LLM Pre-training and Post-training Paradigms ). So, for optimal results, the recipes suggested in this paper may need to be tweaked under certain circumstances. April is a tough choice. For instance, Kolmogorov-Arnold Networks made a big wave that month. But as far as I can tell, the excitement fizzled out pretty quickly. This is likely because their theoretical guarantees are difficult to implement practically, they lack competitive results or benchmarks, and other architectures are much more scalable. So, instead, my pick for April goes to a more practical paper: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (April 2024) by Xu and colleagues. Before summarizing the paper itself, here's an overview of Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), both popular methods in aligning LLMs via Reinforcement Learning with Human Feedback (RLHF). RLHF is the method of choice to align LLMs with human preferences, improving the quality but also the safety of their responses. The typical (simplified) LLM training lifecycle. Traditionally, RLHF-PPO has been a crucial step in training LLMs for models and platforms like InstructGPT and ChatGPT. However, DPO started gaining traction last year due to its simplicity and effectiveness. In contrast to RLHF-PPO, DPO does not require a separate reward model. Instead, it directly updates the LLM using a classification-like objective. Many LLMs now utilize DPO, although comprehensive comparisons with PPO are lacking. Below are two resources on RLHF and DPO I developed and shared earlier this year: LLM Training: RLHF and Its Alternatives Direct Preference Optimization (DPO) for LLM Alignment (From Scratch) Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study is a well-written paper with numerous experiments and results. The key conclusions are that PPO tends to outperform DPO, and that DPO is inferior when dealing with out-of-distribution data. Here, out-of-distribution data means the language model was previously trained on instruction data (via supervised finetuning) that differs from the preference data used for DPO. For instance, a model might be trained on the general Alpaca dataset before undergoing DPO finetuning on a different preference-labeled dataset. (However, one way to improve DPO on such out-of-distribution data is to first conduct a supervised instruction-finetuning step using the preference dataset, and then perform DPO finetuning.) The main findings are summarized in the figure below. Annotated table from the Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (https://arxiv.org/abs/2404.10719) paper. 4.3 How are PPO and DPO used today? PPO might have a slight edge when it comes to the raw modeling performance of the resulting LLM. However, DPO is much easier to implement and computationally more efficient to apply (you don't have to train and use a separate reward model, after all). Hence, to the best of my knowledge, DPO is also much more widely used in practice than RLHF-PPO. One interesting example is Meta AI's Llama models. While Llama 2 was trained with RLHF-PPO, the newer Llama 3 models used DPO. Interestingly, recent models even use both PPO and DPO nowadays. Recent examples include Apple's Foundation Models and Allen AI's Tulu 3 . I found another LoRA paper this year particularly interesting (this is the last LoRA paper in this 12-paper selection, I promise!). I wouldn't call it groundbreaking, but I really like it since it formalizes some of the common knowledge around finetuning LLMs with (and without) LoRA: LoRA Learns Less and Forgets Less (May 2024) by Biderman and colleagues. LoRA Learns Less and Forgets Less is an empirical study comparing low-rank adaptation (LoRA) to full finetuning on large language models (LLMs), focusing on two domains (programming and mathematics) and two tasks (instruction finetuning and continued pretraining). Check out the February section above if you'd like a refresher on LoRA before proceeding. The LoRA Learns Less and Forgets Less study shows LoRA learns noticeably less than full finetuning, especially in tasks like coding, where new knowledge needs to be acquired. The gap is smaller when only instruction finetuning is performed. This suggests that pretraining on new data (learning new knowledge) benefit more from full finetuning than converting a pretrained model into an instruction follower. Full finetuning vs LoRA. The performance is measured on HumanEval, which is a dataset consisting of 164 coding challenges. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673 . There are some more nuances, though. For math tasks, for example, the difference between LoRA and full finetuning shrinks. This may be because math problems are more familiar to the LLM, and they likely encountered similar problems during pretraining. In contrast, coding involves a more distinct domain, requiring more new knowledge. Thus, the farther a new task is from the model’s pretraining data, the more beneficial full finetuning becomes in terms of learning capacity. When examining how much previously acquired knowledge is lost, LoRA consistently forgets less. This is particularly clear when adapting to data far from the source domain (e.g., coding). With coding tasks, full finetuning leads to significant forgetting, while LoRA preserves more original capabilities. In math, where the model’s original knowledge was already closer to the new task, the difference is less pronounced. Full finetuning vs LoRA on the original source tasks after training on programming data. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673 . Overall, there is a trade-off: full finetuning is better for absorbing new knowledge from more distant domains but leads to more forgetting of previously learned tasks. LoRA, by changing fewer parameters, learns less new information but retains more of the original capabilities. The study primarily compares LoRA to full finetuning. In practice, LoRA has gained popularity because it is far more resource-efficient than full finetuning. In many cases, full finetuning is simply not feasible due to hardware constraints. Moreover, if you only need to address specialized applications, LoRA alone may be sufficient. Since LoRA adapters can be stored separately from the base LLM, it's easy to preserve the original capabilities while adding new ones. Additionally, it's possible to combine both methods by using full finetuning for knowledge updates and LoRA for subsequent specialization. In short, I think both methods will continue to be very relevant in the upcoming year(s). It's more about using the right approach for the task at hand. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (June 2024) paper by Penedo and colleagues describes the creation of a 15 trillion token dataset for LLMs and making it publicly available, including a link to download the dataset and a code repository ( datatrove/examples/fineweb.py ) to reproduce the dataset preparation steps. Since several other large datasets for LLM pretraining are available, what's so special about this one? Other datasets are comparatively small: RefinedWeb (500B tokens), C4 (172B tokens), the Common Crawl-based part of Dolma 1.6 (3T tokens) and 1.7 (1.2T tokens), The Pile (340B tokens), SlimPajama (627B tokens), the deduplicated variant of RedPajama (20T tokens), English CommonCrawl section of Matrix (1.3T tokens), English CC-100 (70B tokens), Colossal-OSCAR (850B tokens). For example, ~360 billion tokens are only suited for small LLMs (for instance, 1.7 B, according to the Chinchilla scaling laws ). On the other hand, the 15 trillion tokens in the FineWeb dataset should be optimal for models up to 500 billion parameters according to the Chinchilla scaling laws. (Note that RedPajama contains 20 trillion tokens, but the researchers found that models trained on RedPajama result in poorer quality than FineWeb due to the different filtering rules applied.) Illustration of the dataset sizes used to pretrain LLMs over the years. Note that this is simply a general reference and is not directly related to the FineWeb paper or the Chinchilla scaling laws paper. In short, the FineWeb dataset (English-only) makes it theoretically possible for researchers and practitioners to train large-scale LLMs. (Side note: The Llama 3 models with 8B, 70B, and 405B sizes were trained on 15 trillion tokens as well, but Meta AI's training dataset is not publicly available.) In addition, the paper contains principled ablation studies and insights into how the filtering rules were developed and applied to arrive at the FineWeb dataset (starting from the CommonCrawl web corpus). In short, for each filtering rule they tried, they took a 360 billion token random sample from the original and the filtered data and then trained a small 1.71 billion parameter Llama-like model to see whether the filtering rule is beneficial or not based on the models' performances on standard benchmarks such as HellaSwag, ARC, MMLU, and others. 6.3 The relevance of FineWeb today Overall, while pretraining multi-billion parameter LLMs may still be beyond the reach of most research labs and companies, this dataset is a substantial step toward democratizing the study and development of LLMs. In summary, this paper represents a commendable effort and introduces a valuable public resource for advancing pretraining in LLMs. I hope you found the research summaries useful! Since I am still recovering from my injury, and since it would have been an excessively long article anyway, I decided to split this year's review article into two parts. The second (July to December) part is actually even more exciting (for me personally), as I am discussing the more recent papers on scaling laws, reproducing O1, and the role of synthetic data in LLM training. In addition, I will also share my thoughts for 2025 and what I expect to be on the horizon. Stay tuned! This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book . (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) Build a Large Language Model (From Scratch) now available on Amazon If you read the book and have a few minutes to spare, I'd really appreciate a brief review . It helps us authors a lot! Your support means a great deal! Thank you! Subscribe now Annotated transformer architecture from Attention Is All You Need, https://arxiv.org/abs/1706.03762 "Sparse" in the context of a "Sparse Mixture of Experts" refers to the fact that at any given time, only a subset of the expert layers (typically 1 or 2 out of the 8 in Mixtral 8x7B) are actively used for processing a token. As illustrated in the figure above, the subnetworks replace the feed-forward module in the LLM. A feed-forward module is essentially a multilayer perceptron. In PyTorch-like pseudocode, it essentially looks like this: In addition, there is also a Router module (also known as a gating network ) that redirects each of the token embeddings to the 8 expert feed-forward modules, where only a subset of these experts are active at a time. Since there are 11 more papers to cover in this article, I want to keep this description of the Mixtral model brief. However, you can find additional details in my previous article, Model Merging, Mixtures of Experts, and Towards Smaller LLMs . 1.2 The relevance of MoE models today At the beginning of the year, I would have thought that open-weight MoE models would be more popular and widely used than they are today. While they are not irrelevant, many state-of-the-art models still rely on dense (traditional) LLMs rather than MoEs though, e.g., Llama 3, Qwen 2.5, Gemma 2, etc. However, it is, of course, impossible to say what proprietary architectures like GPT-4, Gemini, and Claude are based on; they might as well be using MoE under the hood. In any case, MoE architectures are still relevant, especially as they offer a way to scale large language models efficiently by activating only a subset of the model's parameters for each input, thus reducing computation costs without sacrificing model capacity. By the way, after writing this article, there was a nice surprise release of the very well-performing DeepSeek-V3 model in December , which uses a MoE architecture. So, yes, MoEs continue to be very relevant! 2. February: Weight-decomposed LoRA If you are finetuning open-weight LLMs, chances are high that you have been using low-rank adaptation (LoRA), a method for parameter-efficient LLM finetuning, at some point. If you are new to LoRA, I have written a previous article on Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation) that you might helpful, and I have a from-scratch code implementation in Appendix D of my Build A Large Language Model (From Scratch) book. Since LoRA is such a popular and widely used method, and since I had so much fun implementing and playing with a newer variant, my pick for February is DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024) by Liu and colleagues. 2.2 LoRA Recap Before introducing DoRA, here’s a quick LoRA refresher: Full finetuning updates each large weight matrix W in an LLM by computing a large weight update matrix ΔW . LoRA approximates ΔW as the product of two smaller matrices A and B . So, Instead of W + ΔW , we have W + A.B . This greatly reduces computational and memory overhead. The figure below illustrates these formulas for full finetuning (left) and LoRA (right) side by side. An illustration of regular finetuning (left) and LoRA finetuning (right). 2.2 From LoRA to DoRA In DoRA: Weight-Decomposed Low-Rank Adaptation (February 2024), Liu and colleagues.extend LoRA by first decomposing a pretrained weight matrix into two parts: a magnitude vector m and a directional matrix V . This decomposition is rooted in the idea that any vector can be represented by its length (magnitude) and direction (orientation), and here we apply it to each column vector of a weight matrix. Once we have m and V , DoRA applies LoRA-style low-rank updates only to the directional matrix V , while allowing the magnitude vector m to be trained separately. Annotated illustration from the DoRA paper (https://arxiv.org/abs/2402.09353) This two-step approach gives DoRA more flexibility than standard LoRA. Rather than uniformly scaling both magnitude and direction as LoRA tends to do, DoRA can make subtle directional adjustments without necessarily increasing the magnitude. The result is improved performance and robustness, as DoRA can outperform LoRA even when using fewer parameters and is less sensitive to the choice of rank. Again, I am keeping this section brief since there are 10 more to go, but if you are interested in additional details, I dedicated a whole article to this method earlier this year: Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch . 2.3 The future of LoRA and LoRA-like methods DoRA is a small, logical improvement over the original LoRA method. While it hasn’t been widely adopted yet, it adds minimal complexity and is worth considering the next time you finetune an LLM. In general, I expect LoRA and similar methods to remain popular. For example, Apple recently mentioned in their Apple Intelligence Foundation Language Models paper that they use LoRA for on-device task specialization of LLMs. Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. 3. March: Tips for Continually Pretraining LLMs As far as I can tell, instruction-finetuning is the most popular form of finetuning by LLM practitioners. The goal here is to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions. However, when it comes to taking in new knowledge, continued pretraining (sometimes also referred to continually pretraining) is the way to go. In this section, I want to briefly summarize the refreshingly straightforward Simple and Scalable Strategies to Continually Pre-train Large Language Models (March 2024) paper by Ibrahim and colleagues. 3.1 Simple techniques work This 24-page Continually Pre-train Large Language Models paper reports a large number of experiments and comes with countless figures, which is very thorough for today's standards. What were the main tips for applying continued pretraining successfully? 1. Simple re-warming and re-decaying the learning rate. 2. Adding a small portion (e.g., 5%) of the original pretraining data to the new dataset to prevent catastrophic forgetting. Note that smaller fractions like 0.5% and 1% were also effective. To be a bit more concrete regarding point 1, re-warming and re-decaying, this means we employ the exact same learning rate schedule that was used during the initial pretraining stage of an LLM as shown in the figure below. A schedule for continued pretraining. Figure based on Build a Large Language Model From Scratch, https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb As far as I know, the re-warming and re-decaying, as well as adding original pretraining data to the new data, is more or less common knowledge. However, I really appreciate that the researchers took the time to formally test this method in this very detailed 24-page report. If you are interested in additional details, I discussed this paper more thoroughly in my previous Tips for LLM Pretraining and Evaluating Reward Models article . 3.2 Will these simple techniques continue to work? I have no reason to believe that these methods will not continue to work for future LLMs. However, it is important to note that pretraining pipelines have become more sophisticated in recent months, consisting of multiple stages, including short- and long-context pretraining. (I’ve written more about it in New LLM Pre-training and Post-training Paradigms ). So, for optimal results, the recipes suggested in this paper may need to be tweaked under certain circumstances. 4. April: DPO or PPO for LLM alignment, or both? April is a tough choice. For instance, Kolmogorov-Arnold Networks made a big wave that month. But as far as I can tell, the excitement fizzled out pretty quickly. This is likely because their theoretical guarantees are difficult to implement practically, they lack competitive results or benchmarks, and other architectures are much more scalable. So, instead, my pick for April goes to a more practical paper: Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study (April 2024) by Xu and colleagues. 4.1 RLHF-PPO and DPO: What Are They? Before summarizing the paper itself, here's an overview of Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), both popular methods in aligning LLMs via Reinforcement Learning with Human Feedback (RLHF). RLHF is the method of choice to align LLMs with human preferences, improving the quality but also the safety of their responses. The typical (simplified) LLM training lifecycle. Traditionally, RLHF-PPO has been a crucial step in training LLMs for models and platforms like InstructGPT and ChatGPT. However, DPO started gaining traction last year due to its simplicity and effectiveness. In contrast to RLHF-PPO, DPO does not require a separate reward model. Instead, it directly updates the LLM using a classification-like objective. Many LLMs now utilize DPO, although comprehensive comparisons with PPO are lacking. Below are two resources on RLHF and DPO I developed and shared earlier this year: LLM Training: RLHF and Its Alternatives Direct Preference Optimization (DPO) for LLM Alignment (From Scratch)

0 views
Lil'Log 10 months ago

Reward Hacking in Reinforcement Learning

Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function. With the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user’s preference, are pretty concerning and are likely one of the major blockers for real-world deployment of more autonomous use cases of AI models.

0 views