Posts in Machine-learning (20 found)
Sean Goedecke 2 days ago

Thinking Machines and interaction models

Thinking Machines just released Interaction Models . This is their first real AI model release 1 after a year of work and two billion dollars of capital. What is an “interaction model”? First, it’s not a frontier model . Thinking Machines is not yet competing with OpenAI, Anthropic and Google. Instead, they’re working on the problem of better real-time interaction with models. Some parts of what they’re doing are not new at all, other parts are slightly-questionable benchmark gaming, and still other parts represent a genuine technological advancement. I’ll try to lay it all out. If you’ve used ChatGPT in audio mode, you know that you can’t talk to it exactly how you’d talk to a human. There’s a big latency gap between when you finish talking and when the model jumps in. The model won’t interrupt you like a human, and doesn’t react to you interrupting it like a human would either. And of course you can’t give the model visual feedback like facial expressions. That’s because ChatGPT is either speaking or listening at any given time . When you’re talking, it’s in “listening” mode; when it’s talking, it’s in “speaking” mode, and isn’t absorbing any information from you. It relies on VAD (“voice activity detection”) to figure out if you’re talking. The alternative (and what “interaction models” do) is a fully-duplex system, where the model is constantly both in listening and speaking mode at the same time. Of course, the model can’t literally do this. Like all language models, it’s either doing prefill (ingesting prompt tokens) or decode (producing completion tokens). But what fully-duplex models can do is switch from listening to speaking mode in tiny chunks, called “micro-turns”. Instead of listening for ten seconds (or however long it takes you to stop talking), then speaking for ten seconds (or however long it takes to pass the model output through TTS), the model can listen for 200ms, then output for 200ms, then listen for 200ms, and so on. While the user is speaking, the model will know to output silence - most of the time. But if it decides it’s good to interrupt you or speak at the same time as you, it’s capable of doing that. So far, so unoriginal. There are plenty of examples of fully duplex audio systems that the Thinking Machines blog post already cites: Moshi , PersonaPlex , Nemotron-VoiceChat , and so on. But at least this outlines the space that “interaction models” are playing in: not “superintelligence from a frontier model”, but “better real-time conversational interaction” 2 . Given that, what is Thinking Machines doing that’s new? For existing fully-duplex models, you talk to the model itself. That’s a fairly big problem, since fully-duplex models have to be fast: fast enough that they can operate in tiny 200ms turns 3 . A model that fast cannot be particularly intelligent. Thinking Machines’ solution is to introduce an actual smart model - any regular language model will do here - in the background that the interaction model can delegate tasks to. In practice this is probably implemented as a tool call. The interaction model keeps chatting while the smart model works away, and then the smart model output is directly integrated into the interaction model’s context in the same way as audio and video input (a genuinely cool idea, I think). This is kind of neat, though it remains to be seen how well it works in practice. Will the model do a lot of “oh wait, the last thing I said was dumb, never mind” self-correction as the smarter model output trickles in? Will the fast interaction model be smart enough to delegate the right tasks at the right time? In general, the “start with a fast dumb model and have it hand off tasks” approach has been tricky for the AI labs to get right for a variety of reasons. If I’m being uncharitable, I might say that bolting on a strong reasoning model was an easy way for Thinking Machines to post impressive values for competitive benchmarks like FD-bench V3 (where they barely beat GPT-realtime-2.0) and BigBench Audio (where introducing the reasoning model bumps their score from 76% to 96%, only 0.1% below GPT-realtime-2.0). If I’m being charitable, I might say that a model fast enough for realtime conversation will have to have some way to punt hard tasks to a slower, smarter model. Both of those things are probably true. It’s also worth noting that Thinking Machines have also bolted on video input to their fully-duplex model. This is more exciting than it sounds, because face-to-face human conversation is very dependent on being able to read human expressions. In theory, this could unlock the ability to have genuine human-like conversations. The other reason why this is exciting is that it means Thinking Machines have been able to make a pretty big fully-duplex model (maybe twice the size of Moshi in terms of active parameters, and 40x the size in terms of total parameters). In fact, this is probably the biggest real technical achievement here. Other fully-duplex models are already doing micro-turns and interruptions, and could delegate reasoning fairly easily if they wanted to, but they aren’t doing video because they can’t . Being able to make a fully-duplex model the size of DeepSeek V4-Flash is pretty impressive. Much of the Thinking Machines blog post is dedicated to explaining how they’ve managed to do this: ingesting data in a more lightweight way, optimizing their inference libraries for tiny prefill/decode chunks, various decisions to make inference deterministic (a long-held hobbyhorse for Thinking Machines). There’s a lot of pressure on Thinking Machines to produce a genuine AI advancement. It doesn’t seem like they’re willing or able to compete in the frontier-model space (which makes sense, I wouldn’t want to either). Given that, I can see why they’re highlighting the parts of interaction models that are impressive to laypeople - all the fully-duplex interaction stuff - even though those parts are not truly innovative. So what are Interaction Models? A scaled-up, multimodal version of existing fully-duplex models like Moshi, with a real model bolted on for extra intelligence (and maybe better benchmarks). The scale and video parts are new and cool, and something like the overall approach has to be right. In general, I’m glad that we’ve got well-funded and high-profile AI labs tackling problems other than “build a smarter frontier model”. I think there’s a lot of low-hanging fruit waiting to be picked in other areas of AI research. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. ↩ I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. ↩ Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. ↩

0 views
wh. 4 days ago

SFT, RL, and On-Policy Distillation Through a Distributional Lens

I have been thinking about post-training methods in terms of distributions. A language model is a distribution over sequences. When we post-train it and attempt to teach it a task, we are reshaping this distribution. Different post-training methods differ in how they reshape this distribution, what they treat as the target and how directly they define this target. This is neither a very precise statement nor is it meant to be fully rigorous.

0 views
Sean Goedecke 1 weeks ago

Why hasn't longer-horizon training slowed AI progress?

Dwarkesh Patel 1 recently posted an award for the best answers to four key questions about AI. It’s partly a challenge and partly a job interview, since some of the winners will get offered a role as a “research collaborator”. I don’t want the job, but I do want to write down my answer to his first question: why hasn’t AI progress slowed down more? There are a few reasons we might think AI progress would slow down. The particular reason Dwarkesh is interested in goes like this. Training a model (specifically reinforcement learning) requires the model to perform a task and then get “graded” on the output. As models get more powerful and tasks become harder, they take longer and require more FLOPs 2 to complete, and thus more FLOPs to train: thus training harder models will take longer. But intuitively, AI progress hasn’t slowed down that much. The famous METR horizon-length graph shows that AI systems are capable of more and more complex tasks over time, and that this process is accelerating, not slowing down. Why would that be? Firstly, it might just be the case that newer models are benefiting from orders of magnitude more FLOPs . Of course, AI labs aren’t standing up orders of magnitude more GPUs (they’re trying, but there are hard physical limits on how fast you can scale up a physical datacenter). But it’s certainly possible that they’re learning to use their existing FLOPs orders of magnitude more efficiently. The efficiency of complex software systems - and the training code for a frontier AI model certainly qualifies - is not typically determined by the number of genius ideas in it. It is determined by the number of boneheaded mistakes. Take this story 3 of how the initial GPT-4 training run used FP16 when summing many small values, which will completely mess up your results if the sum of those values is large. How much training-efficiency-per-FLOP does solving bugs like that buy? Plausibly enough to outweigh any inherent lack of efficiency from training more powerful models. Secondly, intuitions about the speed of AI progress are weird and unreliable . Humans measure AI progress - and intelligence in general - on a really uneven scale. It’s easy to tell when an AI (or a person) is less smart than you, because you can just see them making mistakes. It’s very hard to tell if they’re smarter, because in that case you’re the one making mistakes. You have to rely on more subtle context clues: do they get better long-term results than you, or do they often confuse you in situations where you later end up agreeing with them, and so on. The jump from GPT-3 to GPT-4 seemed huge because GPT-4 was dumber than almost all humans, and GPT-4 was sometimes as smart as a human. However, frontier models are now smart enough to be in the realm of ambiguity on many topics. It’s thus much harder to tell the “real” rate at which they’re getting smarter. Maybe the rate of growth of “raw intelligence” really has slowed down! I don’t know how we’d be in a position to know for sure. Thirdly, many traits other than intelligence determine the capabilities of AI models . Take the jump in October last year where OpenAI and Anthropic models were suddenly “agentic” (i.e. they could reliably perform complex tasks end-to-end). That might be intelligence, but it might also just be a greater working memory, or more rote familiarity with the basic tools of a LLM harness, or more ability to attend to the context window, or even simply a personality more suited to tools like Claude Code or Codex. Of course, all of these traits are plausibly “intelligence”. But they’re traits you might instil by various clever tricks (or even just tweaking the system prompt), not by brute-forcing more FLOPs. It’s illustrative here to consider the mistake made by Apple’s infamous The Illusion of Thinking paper, where the researchers asked various models to brute-force solve Tower of Hanoi puzzles with different numbers of disks, using the results to score how good at reasoning the models were. But of course when you read the output, all of the failures were cases of the model realizing that many hundreds of steps were required, and refusing to even try. These same models could trivially write code to perform the steps, or correctly go through any smaller subset of the steps. The problem wasn’t intelligence, it was persistence : these models lacked the willingness to dig in and keep powering through steps until they got to an answer 5 . Even inside an AI lab, I don’t think anyone has a good understanding of how many “real” FLOPs are being thrown at a training run (not counting FLOPs that are wasted on bugs). We also don’t have a clear sense of whether AI progress really is slowing down or not. Mythos seems impressive, and coding agents are really good now, but once the models get close to human intelligence it becomes really tricky to monitor. Finally, almost everyone judges intelligence by capabilities, but capabilities are produced by a constellation of many traits (intelligence is just one of them). I think this stuff is really complicated. A general theory like “RL takes more flops-per-reward as tasks get longer, therefore training will gradually slow down” sounds good, but in practice AI development is dominated by lightning strikes: silly bugs that make training a hundred times worse, clever ideas that make models a hundred times more useful, and spiky capabilities that can produce dazzling results in some areas but zero improvement in others. We are still very early . If you’re reading this you probably know who Dwarkesh is, but if you don’t: he’s a well-known tech-adjacent podcaster whose gimmick is that he actually does extensive research before each guest and asks specific technical questions. A FLOP is a floating-point operation, i.e. a matrix multiplication, i.e. “time on a GPU”. I saw this in a tweet and only realized that the source was Dwarkesh when I was researching for this post. What if AI progress stalls for technical reasons, and everyone gives up on training new models? In that world, open source models will eventually catch up, and AI labs won’t be in a privileged position. Incidentally, this is my pet theory about why models got much better at agentic tasks last year: training on longer and longer agentic traces meant that models started to “believe they could do it”, and made them much less likely to just give up and take shortcuts or refuse to continue. If you’re reading this you probably know who Dwarkesh is, but if you don’t: he’s a well-known tech-adjacent podcaster whose gimmick is that he actually does extensive research before each guest and asks specific technical questions. ↩ A FLOP is a floating-point operation, i.e. a matrix multiplication, i.e. “time on a GPU”. ↩ I saw this in a tweet and only realized that the source was Dwarkesh when I was researching for this post. ↩ What if AI progress stalls for technical reasons, and everyone gives up on training new models? In that world, open source models will eventually catch up, and AI labs won’t be in a privileged position. ↩ Incidentally, this is my pet theory about why models got much better at agentic tasks last year: training on longer and longer agentic traces meant that models started to “believe they could do it”, and made them much less likely to just give up and take shortcuts or refuse to continue. ↩

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32m -- Interventions: conclusion

Last November, when I finished the main body of " Build a Large Language Model (from Scratch) ", I set myself a number of follow-on goals . One was "training the full GPT-2 base model myself". I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went. In December, I trained my first model , taking two days, but was disappointed to see that it was worse in terms of loss, and in terms of how well it could be fine-tuned to follow instructions, than the original GPT-2 model. I expected that a chunk of that difference was likely to be due to the original model having been trained for longer, but also noticed that there were a number of changes -- interventions -- that I could make to the model and the training run, and I thought they might help. In January, I got a DDP training system together that would allow me to iterate on those interventions without having to wait for two days for each result. In February, I got started by training a baseline model in the cloud , and I've since ground through all of the interventions, and come up with a set that lowered the loss nicely, both in the cloud , and locally . Along the way, I've learned about, or refined my knowledge of, a bunch of ML concepts. In increasing order of how they helped with the loss (with the first two actually making it slightly worse): I've also learned how to upload my custom models to Hugging Face , found out some interesting things about how random noise affects training , and come up with improvements in the setup I have for using an LLM as a judge for instruction fine-tuned models . There was a bit of a mystery when I tried out the instruction fine-tuning tests, though. Although two of my models were very close to GPT-2 small in terms of loss, I found that while one of them had an instruction fine-tuning result that was likewise close to GPT-2 small, the other was much worse! A mystery to dig into later, I think. But it was still very satisfying that my best model -- trained locally in 44 hours -- was almost as good as GPT-2 small, even if it did fall somewhat short. So on that positive note, I'm going to wrap up this "Interventions" series-within-a-series, and move on to the two other things I wanted to do before wrapping up the "LLM from scratch" series as a whole: The appendices first, I think -- I'll post about them shortly. But I think the big one will be the JAX implementation -- really looking forward to that. Weight tying , which I found made the loss worse, but it was interesting how simple it was to implement. PyTorch's Automated Mixed Precision , which also harmed the loss a tiny bit, but had the benefit of making training twice as fast, and 66% cheaper in the cloud -- well worth the loss penalty. Gradient clipping -- a cheap, but (somewhat to my surprise) not particularly effective intervention for this model. QKV bias -- that is, adding bias to the attention weight matrices -- which also helped a tiny bit, though I later felt that this might have been in the noise. Weight decay -- more effective, and something that's simple enough to understand with simple gradient descent. I still need to learn more about it in the context of optimisers, though -- particularly with AdamW. Dropout , which seems to be less than useful for single-epoch training: removing it helped the model quite a lot. The learning rate , which I built up quite a lot of new knowledge about, and by both increasing it and scheduling it, I got the biggest bang for the buck. Going through the appendices in the book to see if there's anything I want to highlight there. The final test as to whether I've really understood everything: building my own LLM from scratch without reference to the book. I want to do that in a different framework, not PyTorch, to minimise the risk of just regurgitating code -- I asked people on X/Twitter which one I should use, and the winner was JAX -- so it should be interesting to see how that goes!

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post , I managed to train one that was almost (if not quite) there. Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them. The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how useful it might be in reality. Unfortunately, in January, I realised that my methodology was bad ; because I was asking the LLM to score a model in isolation, the LLM's natural randomness would mean that results were not really comparable, at least for models that were reasonably close in quality. For example, if two models both replied to ...then one run of the instruction-following test might "find the judge LLM in a good mood" and get, say, 5% -- after all, the model tried to answer, and actually used a real person's name, even if the answer was totally wrong. But in another run, the judge might be in a "worse mood" and score it at 0%. My fix was to have two scripts: The details are here . Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series. I felt it would make more sense to wait until I'd tried a bunch of interventions and got a number of models to try. Now I have those, so let's give it a go! At the end of the previous round of IFT tests, I had this table. It's sorted by the loss on the test set (shown to 3 decimal places), and has the score that the model got from an instruction fine-tuning run: There's a loose correlation where lower loss means a higher IFT score, with two weird exceptions: the two FineWeb-Edu training runs, where they got much higher results than you'd expect from the loss. My working hypothesis was that there were two components that led to a model getting a good score: So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones might be dumb but knowledgeable. The ones in between, by contrast, could be relatively dumb too, but also not know very much. There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB model seemed surprisingly good on the IFT results when considering its loss -- but perhaps there was some kind of step function, where as soon as a model got better than (say) 3.7 on the loss, it suddenly became smart in whatever way mattered. All very hand-wavy, of course, but it was a hypothesis of sorts. Would the new models fit that pattern? It was time to find out. I didn't think it was worth adding all 14 models that I've trained in my intervention-testing to that table, so I decided to just add four of them: Now, I already had files containing responses from fine-tuned versions of the other models, so I just needed to run the first of my two fine-tuning scripts against all four of the new models. I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it used GPT-5.4. If you run the script multiple times, each time will normally give you different scores anyway; hopefully the ranking will remain roughly the same. So given that I was going to have to re-run the script to get new aggregate results, and those would not really be comparable to the original ones anyway, this seemed like a reasonable price to pay for (hopefully) a smarter judge. I ran that once, and got some results that surprised me -- so much that I decided to do three runs and see if the results stood up. They did; here's the new table, with scores for each run, the average, and the rank that each one got based on the average. You can see that relative rankings are fairly consistent across the IFT runs. But while in general the lower-loss runs get better IFT results, now there are even more exceptions to that trend than there were before. Let's look down the "IFT rank" column, which is based on the IFT average: That's a really odd situation. If the training runs using gradient accumulation rather than DDP had been consistently worse -- or vice versa -- then we could imagine some kind of connection. But in the first case, GA beat DDP, but in the second, it was the other way around. Apart from that, we do still see that the two FineWeb-Edu models are doing much better than the others. And the remaining models are all pretty close together, both in terms of loss and in terms of their ranking, apart from the Local FineWeb train, which is bad in both. It is, however, interesting that Local FineWeb-Edu extended train, which was trained on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of the IFT numbers, though. That wasn't the case in my tests previously. All of this puzzled me. The "lots of knowledge makes a model better at this" idea seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all, if it was true, you'd expect the model trained on more data to be consistently better). And the "smart, low-loss models are better" side seemed to be contradicted by and 's bad results. What might be going on here? Looking at the training code, one thing stood out to me. The process was: In practice, the early-exit code always cut in pretty quickly. I'd noticed that during my original generation of the results for the new models: I decided to regenerate responses for all of the models, and then run the new responses past the LLM judge again. But this time I would keep a record of how many epochs of training we got before the exit: It was getting even harder to see any useful pattern! One thing that did stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB model was being instruction-trained for seven epochs. It was also rather noticeable that the two FineWeb-Edu models had the same "advantage", if that's what it was. But the Local FineWeb train had seven epochs too, and got a poor score, the OpenAI models only got two each, and led the pack, and got a pretty poor result given its six epochs of training. Still, what would happen if we got rid of that confounder? I did yet another set of runs; this time, I changed the fine-tuning/generation script to always do four epochs -- no early exit. I chose four because it was the modal number in the previous trains -- no strong reason for it beyond that. Here's what came out at the end: Still no obvious pattern. What if we try seven epochs of training for all of them, so that they all get as much "benefit" (if that's what it is) as the FineWeb-Edu models? Just as confused as ever... Here's a table with all of the ranks we got from these tests: It's hard to draw much sense out of this, but a few things are clear: On the one hand, training different models for different numbers of epochs feels wrong for an evaluation like this, as they're being "treated differently". On the other hand, if it's meant to be a good evaluation of model usefulness in the real world, then individual models would be fine-tuned for different amounts of time, depending on validation loss. So perhaps it is better? But the differing results are still quite a puzzle. I figured that a modern AI could easily build me a data exploration interface, specifically for the original results and seven-epoch ones, so I asked Claude and got this rather nice one . After poring over that, though, I couldn't find a smoking gun -- for example, some kind of systematic error that was always making that pulled its score down. I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this. If we consider the loss landscape that these models are all in, they've all been trained to try to get to a place with as low loss as we could manage. When we do the instruction fine-tune on them, we're changing the landscape -- the objective of "be better at following instructions" is different to "be better at minimising loss". Now, those two landscapes could be completely different! You can imagine a task that we might set instead of instruction-following that could be completely uncorrelated with loss minimisation, or even inversely correlated. But instruction-following is relatively close; it at least shares features like "generate coherent text". So when we do the instruction fine-tuning, what we're trying to do is to move from the place where the model ended up after its pre-training, to a place where performance on the new goal -- instruction-following -- is best. Here's where I'm going to get more than a bit hand-wavy. You can easily imagine that some places where the loss was low, there might be downhill slopes pointing towards good locations in the new instruction-following landscape. With instruction fine-tuning, you'd be able to get a good IFT model. But other places with low loss might not have that advantage; maybe they're at or near a poor "local minimum" in the IFT landscape -- that is, a place where there is no downhill route to a better place. So simple fine-tuning like this might never get a good result! With this mindset, we might say that the OpenAI weights are pretty well-positioned, not just in the loss landscape but also in the IFT landscape. The FineWeb-Edu models happened to get lucky, and wind up in a place that (despite having poor loss), is well-positioned for the IFT objective. And by contrast, and were just unlucky: they got to a place where the loss landscape was not well-correlated with the IFT landscape. This seems plausible enough for me to use it as my working model for now, and see if I can work out some way to test it. Keeping track of the validation loss during the instruction fine-tuning process would certainly be a good start; unfortunately I only realised that after doing all of the tests above, and re-doing them would be quite a lot of work. One final thing is worth repeating. Our two "unlucky" models, and , each had a twin. The former was the DDP-trained counterpart of the gradient-accumulated , while the latter was the gradient-accumulated counterpart of . So while something odd clearly happened, it doesn't look like DDP or gradient accumulation by themselves are the culprit. I think that at this point, it's best for me to draw a line under this -- I have a bunch of other things I'd like to get to, and this is a bit of a side quest at this point. Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal. In some cases, it seems likely that lower-loss models can be worse for actual use. Coming up next: I'm going to wrap up this "interventions" mini-series, and move on to the final steps in my LLM from scratch journey. See you then! One that fine-tuned the model then got it to generate responses, then saved those responses in a file. One that took a bunch of files generated by the above, one for each of a set of different models, and presented them to the LLM together, so that it would (hopefully) be consistent in how it rated them relative to each other. Its raw intelligence: lower-loss models were smarter, so they were better at instruction-following after the fine-tune. Its knowledge. All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on what amounted to minimally-curated data from the Internet. But FineWeb-Edu is meant to be "the most educational" subset of FineWeb, so it presumably is more dense in useful facts. , the baseline cloud-trained model for all of the interventions . , the locally-trained version of the same -- the first model from this post . , the best model we managed to get in the cloud . , the best local model -- the second from this post . The first surprise is . It has the fourth-best loss, but it's the worst model out of all of them on the instruction fine-tuning test! It was trained on exactly the same data as all of the others apart from the OpenAI ones and the FineWeb-Edu ones. Even more perplexingly, it was as close a match to as I could make it, but got completely different results. You might remember from the post that those two runs started with the same weights and had exactly the same training config; the only difference was that they were trained on different architectures, and one used DDP with a real global batch size of 96, while the other used gradient accumulation to get the same batch size. also does much worse than you'd expect from its loss numbers; it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but much worse on the IFT test. Again, this one is essentially a clone of another: , which was the same training run but using DDP rather than gradient accumulation. The same problem -- one of a pair of closely-matched models has worse results on the IFT test. But in this case, it's the gradient accumulation model that turned out bad. Fine-tune the model for a maximum of 100 epochs over the training set. If loss on a held-back validation set went above the result for the previous epoch, we did an early exit and used the previous epoch's model for the generation of the responses. took 6 epochs until validation loss started rising. Performance on this test is correlated with loss, but it's far from the only factor. The OpenAI weights consistently lead the pack. Of our own models, , Cloud FineWeb, 8x A100 40 GiB, and Local FineWeb-Edu train do pretty well. Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently worse than the model it was based on. and are consistently bad. Cloud FineWeb, 8x A100 80 GiB is also not great.

1 views
Giles's blog 3 weeks ago

How an LLM becomes more coherent as we train it

I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run . What might that look like for a (relatively) modern transformers-based LLM? I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face FineWeb dataset, and over the course of that training run, I saved the current model periodically -- 57 checkpoints over two days. Here's what it looked like -- the start, the end, and some interesting waypoints in between. For each checkpoint, I asked it to generate a completion to the words "Every effort moves you". 1 When the model was first created, before any training had been done, it came up with this: If you've read the Karpathy essay, you'll see one important difference -- it's already got words in there. His RNNs were generating complete noise at this stage. Even by the 100th iteration, he gives an example like this: That's an important difference between the RNNs he was talking about, which were character-based and had to learn about words and the like, and LLMs like this one, where the text is input and then output one token at a time. ( More info here ). Still, even though it has what looks like words, it's essentially content-free token salad with no structure or coherence 2 . Let's see what happens if we train it more. In my training loop, it sees 96 sequences of 1,024 tokens, and then we update it based on its loss (an index of how wrong it was at predicting next tokens), so that's 98,304 tokens for each step. After 617 of these 3 it seems to have mostly learned something about which tokens are most common: By the next checkpoint at step 1234, we've got something that's starting to come together. It doesn't make sense, but there's some kind of glimmering of meaning: And just a little while later, at the checkpoint at step 2468, we have something that actually makes some kind of sense (at least at the start)! Now, the training data I'm using was scraped from the Internet, and unsurprisingly there's a lot of somewhat cheesy business content there. By step 9255, we're starting to get a lot of stuff like this: ...or even more cheesy self-help stuff (step 10489): To be fair, the starting point of "Every effort moves you" is probably biasing things a bit there. But let's be clear: by this point it's seen 1,031,110,656 tokens -- that is, it's about one third trained. And it's coming up with pretty coherent text! The rest of the training run is more about refining things -- the loss chart for this training run looks like this: Loosely speaking, the lower the loss number, the better the model is, so you can see that the bulk of the improvement had happened by this point. From here on, I'll just give a few of the more interesting samples: By step 14191, it's started using bullet points... Step 24680 -- more motivational stuff: Step 25297 -- small models like this do like repeating themselves. You might remember seeing ChatGPT output back in 2023 or so that had tics like this: And again at step 26531 At step 27765 it decides that it has had enough after generating just a couple of words and tries to start a new document: But step 28382 is actually rather good. I particularly like the "however": And finally, the training run finishes at step 33164 with these wise words of caution: Well worth remembering, I'm sure we can all agree. I wonder what deep wisdom we'd have gained if I had asked it to generate more than 20 new tokens... What I found most surprising when I first started playing with this is how fast even simple LLMs got to a stage where they could generate plausible text. Just one third of the way through the training run, this model was making some kind of sense. The problem, of course, is that we don't just want generators of plausible content -- we want that content to make sense and be correct. And that's why it's worth grinding through the other two thirds -- in the hope that when you ask it to complete "The capital of France is", it will reply with "Paris" rather than a coherent but wrong answer like "Rouen". Technical details: 20 GPT-2 tokens generated on top of the initial text, with a temperature of 1. I've added line breaks to make it easier to read the samples.  ↩ Well, it mentions " despicable capitalists", but I suspect that's just randomness rather than some kind of primitive political consciousness. Including the space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser.  ↩ So, 60,653,568 tokens seen.  ↩ Technical details: 20 GPT-2 tokens generated on top of the initial text, with a temperature of 1. I've added line breaks to make it easier to read the samples.  ↩ Well, it mentions " despicable capitalists", but I suspect that's just randomness rather than some kind of primitive political consciousness. Including the space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser.  ↩ So, 60,653,568 tokens seen.  ↩

0 views
<antirez> 4 weeks ago

AI cybersecurity is not proof of work

The proof of work is the wrong analogy: finding hash collisions, while exponentially harder with N, is guaranteed to find, with enough work, some S so that H(S) satisfies N, so an asymmetry of resources used will see the side with more "work ability" eventually winning. But bugs are different: 1. Different LLMs executions take different branches, but eventually the possible branches based on the code possible states are saturated. 2. If we imagine sampling the model for a bug in a given code M times, with M large, eventually the cap becomes not "M" (because of saturated state of the code AND the LLM sampler meaningful paths), but "I", the model intelligence level. The OpenBSD SACK bug easily shows that: you can run an inferior model for an infinite number of tokens, and it will never realize(*) that the lack of validation of the start window, if put together with the integer overflow, then put together with the fact the branch where the node should never be NULL is entered regardless, will produce the bug. So, cyber security of tomorrow will not be like proof of work in the sense of "more GPU wins"; instead, better models, and faster access to such models, will win. * Don't trust who says that weak models can find the OpenBSD SACK bug. I tried it myself. What happens is that weak models hallucinate (sometimes causally hitting a real problem) that there is a lack of validation of the start of the window (which is in theory harmless because of the start BTW, this is why with this bug, the stronger the model you pick (but not enough to discover the true bug), the less likely it is it will claim there is a bug. Stronger models hallucinate less, so they can't see the problem in any side of the spectrum: the hallucination side of small models, and the real understanding side of Mythos. Comments

0 views
Giles's blog 4 weeks ago

Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". I've trained various versions of it in the cloud to work out which interventions to the model and training code had the best effects on the loss it gets on a specific test dataset, and now I wanted to do a training run locally to match the best of those. For that, I wanted to match the batch size I was using for the cloud training runs. When I first started learning this stuff, batching seemed like a performance thing -- with highly parallel systems like GPUs, it generally turned out that you could run a batch of (say) two inputs through a model in less than twice the time you could run one, so it made sense to batch them up. For inference, that is exactly the advantage you get, but when training, it's become increasingly clear to me that you can also get an improvement in the quality of the model from batching. The best intuitive model I have is that if you run inputs through one-by-one, adjusting parameters after each, then it's easy for the model to "overcorrect" each time. With batches, you get an average set of gradients across all of the items -- which smooths things out and stabilises the training. Of course, it's possible to overdo it. As an extreme example, imagine that you were somehow able to fit your whole training set into one batch -- then you could train by running that single batch through, doing a single backward pass, and then adjusting the parameters once. It's pretty clear that that would not work very well -- just one single update of the initially-random parameters. When training on my local machine, I could fit a batch of six sequences into my RTX 3090. I'd found that when I moved to cloud machines, it had a very positive effect on the loss I got out of the models when I tested them. From a quick-and-dirty bit of curve-fitting , I estimated that the optimal batch size for this model, with that training run, was somewhere around 97. Conveniently, that was close to the maximum I could fit onto an 8x A100 40 GiB/GPU machine, so I used a batch size of 96 to test the different interventions I was trying. And when I finally put all of the interventions that helped with training together , I found (somewhat to my surprise) that their combined effect -- an improvement in loss of 0.113765 -- was less than half of the loss improvement of 0.252474 that I had got from increasing the batch size. What that all made clear was that if I wanted to do a local training run that matched the quality of the cloud-trained model, I'd need to not only add on the interventions that I'd been testing in detail, but I'd need to match the cloud batch size. And for that, I needed to learn about gradient accumulation. Gradient accumulation is pretty much what it sounds like; instead of the normal technique of doing a forward pass, working out the loss, getting gradients with a backward pass, and then applying them by stepping the optimiser, you do multiple forward-backward phases, letting the gradients accumulate, and then do one optimiser step after that. When you do that, you're getting the training stabilisation benefits of a larger batch size, even though you're not getting the performance boost. Sounds simple enough, and it is, in theory, but implementation got a little more complicated. Let's work through it step-by-step. To start with, imagine you have a really simple training loop: Adding gradient accumulation to that is really simple! Let's assume that has a length divisible by , the number of steps we want to run through before we step the optimiser. As a first (not quite correct) cut, you could just do this: You can see that we're just stepping the optimiser every steps. An alternative way to do it would be with an inner loop: Which of those is better would depend on the details of the training loop -- in general, if you wanted the "other stuff" to be done once per training batch, then you'd want to use the first option, whereas if you wanted it to be done once per optimiser step, the second would be easier. As you'll see in a bit, I went for the second one for my code. However, there's one small correction that we need to do to make either of these properly. Remember that when you calculate loss across a batch -- for example, cross entropy loss like this: ...you're getting the average loss across the batch, so when you do the backward pass, you're getting the average gradients. By contrast, in the code above, we're doing a backward pass on the complete loss at each step, so the gradients that are being generated in each backward pass are being added to each other -- you wind up with the sum of all of them rather than the average. So the gradients that the optimizer applied would be times larger than they should be -- it would be as if we'd multiplied the learning rate by that number! But that's easy enough to fix. The average gradients over a number of steps are the sum divided by the number of steps, and we can do that division ahead of time just by scaling the loss down. Adding that into the first example above: And that's basically it; with those changes, the original basic training loop becomes one that uses gradient accumulation. The effective batch size is whatever the real batch size is, times the number of gradient accumulation steps. However, the real training loop that I'm using for these experiments is a bit more complicated than that simple example. There's checkpointing, AMP, and -- most importantly -- it can handle multi-GPU training using DistributedDataParallel. That made things a little bit more complicated. The first thing was to look into the way I was selecting the data to train on. My dataset was already in batches, but we had to split those batches up between GPUs. The solution in the code was to work out how many global steps there were -- each global step being one batch going through each GPU on the machine -- like this: , if you remember from the DDP post , is the number of processes running in a multi-GPU training run -- one per GPU. Next, in the training loop, I iterated over the global steps: ...for each one, getting the appropriate batch out for the specific GPU that was running the code: is a zero-indexed number, unique to each of the per-GPU processes. So this basically split into chunks of length , and then each GPU was fed the batch at its 's offset into the chunk. I wanted to keep things shaped such that when I was running with gradient accumulation locally, it would be similar to a cloud run with per-GPU batching. Specifically: when I was training in the cloud, I had eight GPUs with a per-GPU microbatch size of 12, giving a total batch size of 96. Locally, I could fit a batch size of six on my GPU, so I needed to do gradient accumulation over 96 / 6 = 16 steps. To keep things as similar as possible, I decided that I wanted the concept of a "global step" to match between the runs. In other words, it would expand slightly, from meaning "one batch per GPU" to being "one optimiser step per GPU". So, each time through that loop, we'd do multiple forward-backward passes, and then one optimiser step. That would mean that the best way to do things would be with something much more like the second of the two bits of sample code above -- the one with the inner loop rather than the modulus. Maybe that's easier to show in code: That required a change to the data lookup; I decided that would be split into chunks of size , and then each of those would be split into chunks of size , so the code to get the appropriate batch for a given run through the loop became this: That required a corresponding change in to make sure that was divisible by both the world size, the per-GPU batch ("microbatch") size, and the number of gradient accumulation steps, but that was easy: ...became this: That was enough to get the gradient accumulation happening! Next, I needed to change the backward pass code to scale down the loss so that we got averaged rather than summed gradients. Because we might be using AMP with a scaler, the code wasn't just a simple : ...but the change was obvious enough: All of those changes put together, plus a bit of shuffling around of code, were enough to get a correct gradient accumulation training loop! But there was one small tweak I needed to add. When you're using DDP, gradients need to be synchronised between the different per-GPU processes. As a reminder, what happens is: Now, with my first cut of the gradient accumulation code above, what would have happened is this: That would be correct, but not very efficient. We're sending out gradients and averaging on every accumulation step. But because each of our per-GPU processes is keeping its own "local" average (by accumulating the scaled-down gradients), we only really need to send those local averages out and get a global average once, just before we step the optimiser. If we do that, we can save quite a lot of work. The trick to avoid that was to use the method on the class that our own model is wrapped in. What we wanted to do was suppress the gradient synchronisation for each of the accumulation steps apart from the last one. It was easy to work out whether we were on the last gradient accumulation step: Now, what we needed to do was to wrap this: ...in , but only if was false. Conditional statements can be a little fiddly, but Python has a "do-nothing" context manager in -- that is, ...is identical to just: So we can combine that with the ternary operator like this: ...which does exactly what we want 1 . With that change, I had something I was happy with; you can see the diff here . So now it was time to do a training run! I'd originally been planning to jump right in and do a training run based on my last cloud run , with all of the interventions I'd decided were worth using, but locally with gradient accumulation. However, I decided that it would be interesting to try doing a new "baseline" train first. I'd done my local training runs, and then established a baseline version in the cloud by taking exactly the same configuration and doing the training run on an 8x A100 40 GiB with an overall batch size of 96. So I could repeat that locally with gradient accumulation, and that would show two things (or perhaps, the same thing but in different lights): That would help confirm my understanding that it was the increased batch size that helped in the cloud, and not, say, some architectural difference -- and would also act as a good test of the gradient accumulation code. Here's the training run config . I kicked it off: That looked like the right number of global steps; it matched the numbers I saw when training in the cloud. And 44 hours for the training run seemed correct: my original local runs took 48, but with them I was spending quite a lot of time on validation, which this code didn't do. Just less than two days later: That all looked good. The loss chart looked like this: For comparison, here's the one from the cloud training run with the same config (but using larger batches rather than gradient accumulation): You can see that they're similar, but not identical. That's pretty much what you'd expect! The two training runs were on different architectures -- RTX 3090 vs A100 -- and so there will probably be differences in the CUDA kernels, and also PyTorch's AMP (which uses 16-bit instead of 32-bit in cases where it makes sense) might make different decisions. I think that if we'd run it on a machine with one A100, then the results of using gradient accumulation would be even closer (perhaps even identical) to a larger batch size, especially if we were training without AMP. I uploaded the model to Hugging Face and it was time for the evals. The smoke test first: As usual, reasonably coherent. But the important one was the loss on the test set: That's solid! The cloud-trained baseline model got 3.691526, so this local one was actually very slightly better, by 0.007691. But that's very close indeed, which is what we wanted to see :-) It was time to see what effect adding on the interventions would have. As a reminder, here are the changes I made to the config for this run: It did not include QKV bias. Here's the config . I kicked it off, and: It looked like it was going to take 40 hours; that matched what happened in the cloud runs, as removing dropout speeds things up quite a lot. Just less than two days later: The loss chart over the training run looked like this: That's very smooth, with no loss spikes. For comparison, here's the chart when we did the same training run in the cloud; you can see that it was a bit choppier than the local one. The gradient norm chart was also interesting: If you compare it to the one from the cloud training run below, you can see that the local one was actually noisier -- the cloud run has a few gradient spikes near the start but calms down from around global step 6,000 or so, whereas the local one is spiky up to about 3,000, then calm, but has a massive spike at around 10,000. The learning rate we don't need to compare, but it was worth sanity checking to make sure we really did train the right way: So that all looked good. The training run did have some differences to the cloud one, but (as with the previous baseline train) it looked similar enough. Architectural differences between the A100s in the cloud and the local RTX 3090 seemed like a plausible cause. I uploaded the model to Hugging Face , and it was time to run the evals. The smoke test first: Reasonably coherent -- and I think that's the first time I've seen an token in a smoke test output! But the important one is, as ever, the loss, and: Let's add both this one and the local baseline to the results table for all interventions: That's really weird! The local run with the interventions, , is 0.039600 points better than the cloud version of the same training run, . That's nice, in that lower loss is always better, but it's also rather confusing -- that's a bigger loss improvement than some of the interventions. In theory, all that we changed between the cloud version of this training run, and the local one was the architecture. I was expecting that to have an effect, but thought that it would be small -- as, indeed, it was with the baseline trains and , where you can see the loss difference was just 0.007691 -- about five times smaller. Now, when I was looking into the effects of noise on training loss , I found that changing the random seed that was used to initialise the weights (but starting the training run itself at the same random seed) had a much bigger effect on the resulting model quality than keeping the weights identical but varying the seed at the start of the post-initialisation phase of the training run. The standard deviation of the varied-weights, same-train models was about double the SD of the same-weights, varied-train. That was interesting, though not directly comparable -- those tests were done with the same training run, but the architecture held constant -- a 8x A100 40 GiB machine for each test. However, it felt like it would be a good idea to at least see whether we started with the same weights locally and when training in the cloud. My suspicion was that we probably would; the weight initialisation uses deterministic non-GPU code, so with the same seed we'd expect the same weights regardless of the computer. The similarity of the loss results for the local and cloud baseline training runs also seemed to point in that direction. But it was worth testing. I created a throwaway branch of the training code, which -- after creating the model -- just dumped the model weights to a file, then exited. I ran it locally using the config, and then I fired up yet another 8x A100 40 GiB machine on Lambda, ran the same code there, this time with the config, and then ed down the weights. Identical. That was reassuring! I considered doing more analysis on this; for example, in my investigations into noise, I found that keeping the same weights but altering the random seed for the rest of the training run, I got results with a standard deviation of 0.008672 -- more than four times smaller than the difference between the local and cloud trains with the interventions. Might that be a number I could use for some kind of comparison? However, I decided that it's not really comparable. That number was from varying the random seed, but keeping the same architecture. There's not really any solid reason to believe that keeping the seed constant but changing the architecture would cause the same kind of differences. They might be more similar, they might be less. I think that all we can really say here is that the change of machine changed some aspects of the training dynamics in a way that happened to get us a lower loss. I can easily imagine that if I'd done something slightly different -- used a local RTX 4090, for example -- it could equally well have gone in the other direction. And at least it's reassuring that the improvement was smaller than the interventions I was most convinced by; the only smaller ones were full-fat float32, gradient clipping, and QKV bias -- ones that I'd already decided might have only been beneficial due to noise. Most importantly, it was orders of magnitude smaller than the 0.252474 improvement I originally saw when I moved from local training to larger-batch cloud training. So, I think that that brings me to the end of this set of training experiments. We started with a locally-trained model that got a loss of 3.943522 on our test set, compared to the original GPT-2 small model, which got 3.499677 2 . I've tried a bunch of interventions to try to get my model closer, and finally I've managed to get almost all of the way there, to 3.538161. That's really pleasing! I think that there are two things to do before I can fully wrap up this "interventions" mini-series, and get back to the main-line LLM from scratch stuff. Firstly, I should revisit the instruction fine-tuning tests, which I put on hold while doing these training runs. That would give us some indication as to whether the loss improvement was just a technical improvement that made a number go down, or whether it actually improved the usefulness of the model. Secondly, I think I really need to write a wrap-up. I've been working on this stuff on and off since December, and I think a summary of what I did would be quite nice! I'll post soon; don't touch that dial :-) Thanks to this Stack Overflow answer for that trick.  ↩ I'm going to switch to six decimal places from now on -- previously I was rounding it to three, hence 3.500.  ↩ Each process does a forward pass. Each process does a backward pass. When they have the gradients, they essentially share them so that each process has an average of the gradients from all of those backward passes. Then they all step their optimisers to apply the average gradients to each process's copy of the model. For each gradient accumulation step: Each process does a forward pass. Each process does a backward pass. The average is worked out They all step their optimisers based on the most recent average Whether the increased effective batch size had as positive an effect on the loss as the increased real batch size did when I did my cloud runs. Whether the locally-trained gradient accumulation model was similar to the cloud-trained big-batch model in terms of its loss. Gradient clipping at 3.5 Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Weight decay changed from 0.1 to 0.01 Dropout removed Thanks to this Stack Overflow answer for that trick.  ↩ I'm going to switch to six decimal places from now on -- previously I was rounding it to three, hence 3.500.  ↩

1 views
Simon Willison 1 months ago

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

Meta announced Muse Spark today, their first model release since Llama 4 almost exactly a year ago . It's hosted, not open weights, and the API is currently "a private API preview to select users", but you can try it out today on meta.ai (Facebook or Instagram login required). Meta's self-reported benchmarks show it competitive with Opus 4.6, Gemini 3.1 Pro, and GPT 5.4 on selected benchmarks, though notably behind on Terminal-Bench 2.0. Meta themselves say they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows". The model is exposed as two different modes on meta.ai - "Instant" and "Thinking". Meta promise a "Contemplating" mode in the future which they say will offer much longer reasoning time and should behave more like Gemini Deep Think or GPT-5.4 Pro. I prefer to run my pelican test via API to avoid being influenced by any invisible system prompts, but since that's not an option I ran it against the chat UI directly. Here's the pelican I got for "Instant": And this one for "Thinking": Both SVGs were rendered inline by the Meta AI interface. Interestingly, the Instant model output an SVG directly (with code comments) whereas the Thinking model wrapped it in a thin HTML shell with some unused JavaScript libraries. Which got me curious... Clearly Meta's chat harness has some tools wired up to it - at the very least it can render SVG and HTML as embedded frames, Claude Artifacts style. But what else can it do? I asked it: what tools do you have access to? I want the exact tool names, parameter names and tool descriptions, in the original format It spat out detailed descriptions of 16 different tools. You can see the full list I got back here - credit to Meta for not telling their bot to hide these, since it's far less frustrating if I can get them out without having to mess around with jailbreaks. Here are highlights derived from that response: Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. Let's talk about that one. Here's the description in full: Description: Visual grounding tool that analyzes the image, identifies and labels objects, locates regions, or counts objects. Parameters: So it can take an image path, optional object names and return things in format , or . Sounds to me like Meta's Segment Anything . I decided to take it for a spin. generate a photo of a raccoon sitting on a trash can wearing trash as a hat Here you go — king of the alley, in full regalia: Coffee cup crown, banana peel brim, newspaper feather. Peak raccoon fashion. Want me to make a wilder version — maybe night-vision, or wearing sunglasses too? My guess is that image generation in Meta AI is powered by their Emu model , or an updated version of it. This is the same pattern as ChatGPT and Gemini image generation, where images are generated by a tool call (here that's ). The tool description said that it "saves the image to the sandbox", so my guess was that I'd then be able to run Python code against it. I tried: use python OpenCV to analyze that image and find out neat things about it Annoyingly it did not show me the code it ran, but it did create this image: Pretty neat! We can both generate images and then use them with Python tools. Next, the tool. Knowing the names of the tools makes it easy to refer to them in prompts: Now try the container.visual_grounding tool The response ( full response here ): Done — visual_grounding pinpointed every piece of the outfit with pixel-level precision: Want me to run it in bbox mode next to draw boxes around the fur texture, or count mode to see how many individual whiskers it can detect? Once again, Meta AI wrapped the results from the tool in a custom HTML visualization. That's clearly mode, but how about the others? I prompted: yes draw bounding boxes, can it do masks too? Yes — bounding boxes work great. Masks? Not directly in ; it only returns three formats: point, bbox, and count. No pixel-level segmentation masks. Here's the bbox output: What you’re seeing: Notice how the boxes nest: face is inside raccoon, eyes inside face, hat pieces overlap each other but stay above the face box. The tool is doing real object localization, not just guessing centers. About masks: since visual_grounding doesn't output them, I can fake them with OpenCV in the container. Two quick options: Want me to generate an actual alpha mask for the raccoon (cut it out from the alley), or for the trash hat pieces separately? Takes about 10 seconds and I can give you a transparent PNG. Here's the HTML for that bbox visualization . Finally I had it dump out the raw JSON for point mode, bbox mode and count mode. You can see that all here . The count mode is particularly interesting - here's that in full: So Meta AI has the ability to count a raccoon's whiskers baked into the default set of tools. Which means... it can count pelicans too! Here's that overlay exported as HTML . Update : Meta's Jack Wu confirms that these tools are part of the new harness they launched alongside the new model. On Twitter Alexandr Wang said : this is step one. bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with plans to open-source future versions. I really hope they do go back to open-sourcing their models. Llama 3.1/3.2/3.3 were excellent laptop-scale model families, and the introductory blog post for Muse Spark had this to say about efficiency: [...] we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison. So are Meta back in the frontier model game? Artificial Analysis think so - they scored Meta Spark at 52, "behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6". Last year's Llama 4 Maverick and Scout scored 18 and 13 respectively. I'm waiting for API access - while the tool collection on meta.ai is quite strong the real test of a model like this is still what we can build on top of it. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Browse and search . can run a web search through an undisclosed search engine, can load the full page from one of those search results and can run pattern matches against the returned page content. Meta content search . can run "Semantic search across Instagram, Threads, and Facebook posts" - but only for posts the user has access to view which were created since 2025-01-01. This tool has some powerful looking parameters, including , , , and . "Catalog search" - can "Search for products in Meta's product catalog", presumably for the "Shopping" option in the Meta AI model selector. Image generation . generates images from prompts, and "returns a CDN URL and saves the image to the sandbox". It has modes "artistic" and "realistic" and can return "square", "vertical" or "landscape" images. container.python_execution - yes! It's Code Interpreter , my favourite feature of both ChatGPT and Claude. Execute Python code in a remote sandbox environment. Python 3.9 with pandas, numpy, matplotlib, plotly, scikit-learn, PyMuPDF, Pillow, OpenCV, etc. Files persist at . Python 3.9 is EOL these days but the library collection looks useful. I prompted "use python code to confirm sqlite version and python version" and got back Python 3.9.25 and SQLite 3.34.1 (from January 2021 ). container.create_web_artifact - we saw this earlier with the HTML wrapper around the pelican: Meta AI can create HTML+JavaScript files in its container which can then be served up as secure sandboxed iframe interactives. "Set kind to for websites/apps or for vector graphics." container.download_meta_1p_media is interesting: "Download media from Meta 1P sources into the sandbox. Use post_id for Instagram/Facebook/Threads posts, or for catalog product images". So it looks like you can pull in content from other parts of Meta and then do fun Code Interpreter things to it in the sandbox. container.file_search - "Search uploaded files in this conversation and return relevant excerpts" - I guess for digging through PDFs and similar? Tools for editing files in the container - , (with and ), . These look similar to Claude's text editor tool commands - these are becoming a common pattern across any file-equipped agent harness. container.visual_grounding - see below, this one is fun . subagents.spawn_agent - the sub-agent as a tool pattern. "Spawn an independent sub-agent for research, analysis, or delegation. It returns its final text response." third_party.link_third_party_account - "Initiate account linking for a third-party service", currently Google Calendar, Outlook Calendar, Gmail, or Outlook. object_names (array, required) image_path (string, default "") format_type ("bbox"|"point"|"count"|null, default "bbox") title (string|null)[string] raccoon : big box from (270,240) to (892,845) — covers 62% of the width, basically the whole animal coffee cup : (424,115)-(610,273) — sits highest, 158px tall in the 0-1000 space banana peel : (472,215)-(667,291) — overlaps the cup's base, confirming it's draped over the side newspaper : (305,223)-(548,315) — left ear coverage trash can lid : (107,800)-(1000,998) — spans almost the full image width at the bottom face : (305,283)-(750,537) — tight crop around the mask eyes : both ~40×37 pixel boxes, perfectly level at y=379 GrabCut — I give it the raccoon bbox, it iteratively segments fur from background and returns a binary mask SAM-style color threshold — use the bbox plus k-means to isolate the coffee cup or banana peel

0 views
devansh 1 months ago

On LLMs and Vulnerability Research

I have been meaning to write this for six months. The landscape kept shifting. It has now shifted enough to say something definitive. I work at the intersection of vulnerability triage. I see, every day, how this landscape is changing. These views are personal and do not represent my employer. Take them with appropriate salt. Two things happened in quick succession. Frontier models got dramatically better (Opus 4.6, GPT 5.4). Agentic toolkits (Claude Code, Codex, OpenCode) gave those models hands. The combination produces solid vulnerability research. "LLMs are next-token predictors." This framing was always reductive. It is now actively misleading. The gap between what these models theoretically do (predict the next word) and what they actually do (reason about concurrent thread execution in kernel code to identify use-after-free conditions) has grown too wide for the old frame to hold. Three mechanisms explain why. Implicit structural understanding. Tokenizers know nothing about code. Byte Pair Encoding treats , , and as frequent byte sequences, not syntactic constructs. But the transformer layers above tell a different story. Through training on massive code corpora, attention heads specialise: some track variable identity and provenance, others develop bias toward control flow tokens. The model converges on internal representations that capture semantic properties of code, something functionally equivalent to an abstract syntax tree, built implicitly, never formally. Neural taint analysis. The most security-relevant emergent capability. The model learns associations between sources of untrusted input (user-controlled data, network input, file reads) and dangerous sinks (system calls, SQL queries, memory operations). When it identifies a path from source to sink without adequate sanitisation, it flags a vulnerability. This is not formal taint analysis. No dataflow graph as well. It is a statistical approximation. But it works well for intra-procedural bugs where the source-to-sink path is short, and degrades as distance increases across functions, files, and abstraction layers. Test-time reasoning. The most consequential advance. Standard inference is a single forward pass: reactive, fast, fundamentally limited. Reasoning models (o-series, extended thinking, DeepSeek R1) break this constraint by generating internal reasoning tokens, a scratchpad where the model works through a problem step by step before answering. The model traces execution paths, tracks variable values, evaluates branch conditions. Symbolic execution in natural language. Less precise than formal tools but capable of handling what they choke on: complex pointer arithmetic, dynamic dispatch, deeply nested callbacks. It self-verifies, generating a hypothesis ("the lock isn't held across this path"), then testing it ("wait, is there a lock acquisition I missed?"). It backtracks when reasoning hits dead ends. DeepSeek R1 showed these behaviours emerge from pure reinforcement learning with correctness-based rewards. Nobody taught the model to check its own work. It discovered that verification produces better answers. The model is not generating the most probable next token. It is spending variable compute to solve a specific problem. Three advances compound on each other. Mixture of Experts. Every frontier model now uses MoE. A model might contain 400 billion parameters but activate only 17 billion per token. Vastly more encoded knowledge about code patterns, API behaviours, and vulnerability classes without proportional inference cost. Million-token context. In 2023, analysing a codebase required chunking code into a vector database, retrieving fragments via similarity search, and feeding them to the model. RAG is inherently lossy: code split at arbitrary boundaries, cross-file relationships destroyed, critical context discarded. For vulnerability analysis, where understanding cross-module data flow is the entire point, this information loss is devastating. At one million tokens, you fit an entire mid-size codebase in a single prompt. The model traces user input from an HTTP handler through three middleware layers into a database query builder and spots a sanitisation gap on line 4,200 exploitable via the endpoint on line 890. No chunking. No retrieval. No information loss. Reinforcement-learned reasoning. Earlier models trained purely on next-token prediction. Modern frontier models add an RL phase: generate reasoning chains, reward correctness of the final answer rather than plausibility of text. Over millions of iterations, this shapes reasoning to produce correct analyses rather than plausible-sounding ones. The strategies transfer across domains. A model that learned to verify mathematical reasoning applies the same verification to code. A persistent belief: truly "novel" vulnerability classes exist, bugs so unprecedented that only human genius could discover them. Comforting. Also wrong. Decompose the bugs held up as examples. HTTP request smuggling: the insight that a proxy and backend might disagree about where one request ends and another begins feels like a creative leap. But the actual bug is the intersection of known primitives: ambiguous protocol specification, inconsistent parsing between components, a security-critical assumption about message boundaries. None novel individually. The "novelty" was in combining them. Prototype pollution RCEs in JavaScript frameworks. Exotic until you realise it is dynamic property assignment in a prototype-based language, unsanitised input reaching object modification, and a rendering pipeline evaluating modified objects in a privileged context. Injection, type confusion, privilege boundary crossing. Taxonomy staples for decades. The pattern holds universally. "Novel" vulnerabilities decompose into compositions of known primitives: spec ambiguities, type confusions, missing boundary checks, TOCTOU gaps, trust boundary violations. The novelty is in the composition, not the components. This is precisely what frontier LLMs are increasingly good at. A model that understands protocol ambiguity, inconsistent component behaviour, and security boundary assumptions has all the ingredients to hypothesise a request-smuggling-class vulnerability when pointed at a reverse proxy codebase. It does not need to have seen that exact bug class. It needs to recognise that the conditions for parser disagreement exist and that parser disagreement at a trust boundary has security implications. Compositional reasoning over known primitives. Exactly what test-time reasoning enables. LLMs will not discover the next Spectre tomorrow. Microarchitectural side channels in CPU pipelines are largely absent from code-level training data. But the space of "LLM-inaccessible" vulnerabilities is smaller than the security community assumes, and it shrinks with every model generation. Most of what we call novel vulnerability research is creative recombination within a known search space. That is what these models do best. Effective AI vulnerability research = good scaffolding + adequate tokens. Scaffolding (harness design, prompt engineering, problem framing) is wildly underestimated. Claude Code and Codex are general-purpose coding environments, not optimised for vulnerability research. A purpose-built harness provides threat models, defines trust boundaries, highlights historical vulnerability patterns in the specific technology stack, and constrains search to security-relevant code paths. The operator designing that context determines whether the model spends its reasoning budget wisely or wastes it on dead ends. Two researchers, same model, same codebase, dramatically different results. Token quality beats token quantity. A thousand reasoning tokens on the right code path with the right threat model outperform a million tokens sprayed across a repo with "find vulnerabilities." The search space is effectively infinite. You cannot brute-force it. You narrow it with human intelligence encoded as context, directing machine intelligence toward where bugs actually live. "LLMs are non-deterministic, so you can't trust their findings." Sounds devastating. Almost entirely irrelevant. It confuses the properties of the tool with the properties of the target. The bugs are deterministic. They are in the code. A buffer overflow on line 847 is still there whether the model notices it on attempt one or attempt five. Non-determinism in the search process does not make the search less valid. It makes it more thorough under repetition. Each run samples a different trajectory through the hypothesis space. The union of multiple runs covers more search space than any single run. Conceptually identical to fuzzing. Nobody says "fuzzers are non-deterministic so we can't trust them." You run the fuzzer longer, cover more input space, find more bugs. Same principle. Non-determinism under repetition becomes coverage. In 2023 and 2024, the state of the art was architecture. Multi-agent systems, RAG pipelines, tool integration with SMT solvers and fuzzers and static analysis engines. The best orchestration won. That era is ending. A frontier model ingests a million tokens of code in a single prompt. Your RAG pipeline is not an advantage when the model without RAG sees the whole codebase while your pipeline shows fragments selected by retrieval that does not know what is security-relevant. A reasoning model spends thousands of tokens tracing execution paths and verifying hypotheses. Your external solver integration is not a differentiator when the model approximates what the solver does with contextual understanding the solver lacks. Agentic toolkits handle orchestration better than your custom tooling. The implication the security industry has not fully processed: vulnerability research is being democratised. When finding a memory safety bug in a C library required a Project Zero-calibre researcher with years of experience, the supply was measured in hundreds worldwide. When it requires a well-prompted API call, the supply is effectively unlimited. What replaces architecture as the competitive advantage? Two things. Domain expertise encoded as context. Not "find bugs in this code" but "this is a TLS implementation; here are three classes of timing side-channel that have affected similar implementations; analyse whether the constant-time guarantees hold across these specific code paths." The human provides the insight. The model does the grunt work. Access to compute. Test-time reasoning scales with inference compute. More tokens means deeper analysis, more self-verification, more backtracking. Teams that let a model spend ten minutes on a complex code path will find bugs that teams limited to five-second responses will miss. The end state: vulnerability discovery for known bug classes becomes a commodity, available to anyone with API access and a credit card. The researchers who thrive will focus where the model cannot: novel vulnerability classes, application-level logic flaws, architectural security review, adversarial creativity. This is not a prediction. It is already happening. The pace is set by model capability, which doubles on a timeline measured in months. Beyond next-token prediction Implicit structural understanding Neural taint analysis Test-time reasoning The architecture that enabled this Mixture of Experts Million-token context Reinforcement-learned reasoning The myth of novel vulnerabilities Scaffolding and tokens Non-determinism is a feature Orchestration is no longer your moat

0 views
Langur Monkey 1 months ago

Fine-tuning Qwen3.5 for Gaia Sky

A little over a year ago I set up a local pipeline to use different LLM s to respond to Gaia Sky questions using RAG . In that post, I built a dynamic scrapper that parsed the Gaia Sky website and documentation and ingested the content it into a vector database. Then, I built a minimal terminal chatbot interface that received the user prompt, queried the database for semantically similar data, and built up the context for each LLM call. The results were promising, and I found that they (obviously) strongly depended on the model used. Fast forward a few months, and the Qwen 3.5 models were released by Alibaba. The general consensus is that they are quite good for their size. I’ve been testing them for local inference with a similar impression. I thought that it would be interesting to repeat the exercise of creating a Gaia Sky AI assistant, but using a radically different approach: Instead of RAG, I would fine-tune the model itself. In this post, I describe this fine-tuning project, from the creation and engineering of the training dataset to the fine-tuning and production of the final GGUF models. This project is composed by two, very distinct parts, which map to top-level chapters in this post: At the end I quickly evaluate the results in the testing section. The source code, dataset, and models discussed in this post are in the following repositories: Here is the hardware I have used to create the dataset and fine-tune the model: The creation of the training dataset is the most important piece of work in this project. It is composed of three parts: When I started this project, my first instinct was “more is better.” I thought that if I fed the model every single , , , and file in the Gaia Sky repositories (project, documentation, etc.), it would emerge as an expert. Oh boy, was I wrong. A large codebase contains a lot of boilerplate noise. Getters, setters, license blocks, and infrastructure code that doesn’t actually help a model understand how the engine works or how to write scripts for it. I soon realized that the dataset is the single most important part of the project, and it needed a surgical approach. The plan was to automate the process of creating the dataset to a degree, and then use it to fine-tune the Qwen 3.5 4B and 8B model variants. I wrote to act as a high-pass filter. Instead of a blind crawl, I implemented an allowlist system. I would only let in the load-bearing files: Almost every source file in Gaia Sky starts with a copyright header. This is “dead weight” for training. I added a regex-based stripper to ensure the model’s limited context window was filled with code logic, not license text: The output of this first phase was a file where each line represented a single, cleaned-up file. It looked like this: This provided the “Context” for the next phase. However, a model trained directly on this would just learn to autocomplete files. To make it an assistant , we had to turn these files into a conversation. Once I had a clean extraction of the most relevant information pieces, I faced a new problem. A raw dump of a file is great for a search engine, but it is not a conversation. To turn these files into training data, I used a “teacher” model, Qwen 3.5 27B, to look at each file and generate a specific number of Q&A pairs. I wrote to handle this. The script calculates how many questions a file is worth based on its length and type. A long documentation file might get 25 Q&A pairs, while a short shader might only get 4. Below is an extract of the method that computes the number of target pairs for a file. Initially, I used the MoE Qwen 3.5 30B A3B, but it was consistently outputting the wrong format. Then I switched to the 27B dense model, and it performed a little better. Even so, I had to tell the model exactly how to behave. Here are the key items I learned the hard way: I also found that, at these model sizes, it is better to batch Q&A pairs instead of asking the model to provide 20 of them in one go. I finally gravitated to 3 Q&A pairs per inference call. To prevent the model from repeating itself across batches, I tracked existing questions and fed them back into the prompt as exclusions. The prompt is constructed as follows: It consists of the following parts: However, LLMs are chatty. Even with such strict instructions, even the 27B model sometimes messes up. It would still sometimes leak its own reasoning into the output. It would start its response with or it would include meta-talk like This created “dirty” data that polluted the dataset and undermined the fine-tuning process. If I trained on this, the final model would start every answer by talking to itself. To fix this, I built . This script is a heavy-duty cleaner that uses regex to strip out training artifacts. It first tries to rescue bad rows, and if it fails, it deletes them. If the model accidentally put both the question and answer in the “output” field, the sanitizer attempts to detect the question mark and splits them back into the correct structure. Here is a look at what the data looked like before and after the sanitization process. The sanitizer also had to deal with Javadoc remnants. Since Gaia Sky is a Java project, class- and method-level comment blocks are full of HTML tags like , , and (Javadoc syntax). The script converts these into clean Markdown so the LLM learns a consistent documentation style. By the end of this process, I had . This contains a curated, clean list of questions and answers based on the whole project documentation. Documentation is important, but I want the model to learn some Gaia Sky scripting as well. To do so, a new API/scripting dataset needed to be generated. To solve this, I built a synthetic data factory designed to teach the model both the content of the API and the context of how to use it. The first step was grounding the model. I wrote a script ( ) that scans the Java source files and uses Regex to pair Javadoc comments with their corresponding method signatures. The method uses regular expressions and a little bit of logic to generate Q&A pairs based on method signatures and their Javadoc documentation: This process produces the , which is used in the next step. It contains the API calls with their respective documentation. However, knowing a function exists isn’t enough. The model needs to know how to script with it. To address this, I developed to transform those raw Java signatures into a diverse pedagogical dataset. As input, it gets all test and showcase scripts in the Gaia Sky repository, and the raw API JSONL file. It produces four types of output, termed A, B, C, and D: Type A: The API reference These are direct “How do I use X?” pairs. They include the parameters, the return types, and a basic Python example. Type B: The task synthesis This step is optional, and I ended up not including it in the final dataset. However, I think it is still worth mentioning. I used the larger teacher dense model (27B) to generate complex tasks (e.g., “Write a script that navigates to Mars, waits 5 seconds, and takes a screenshot”). The script provided the teacher model with a safe list of real functions extracted in Step 1 as a sort of guardrail. If the teacher tried to hallucinate a command, the script flagged and discarded it. The results of this section were kind of underwhelming, possibly because more parameters are needed for such open-ended tasks. Type C: Adversarial error correction This is my favorite part. I programmatically broke the API calls to teach the model how to fix its own mistakes. The script would generate a wrong script (e.g., using instead of or missing a required argument) and then provide the correct version. The end goal was to prevent common LLM failures before they happen. Type D: The “gold standard” Library Finally, I indexed the actual test and showcase scripts from the Gaia Sky repository. These are human-written, battle-tested scripts that show the model how to handle complex logic, loops, and math. Finally, I prepared a small file with essential project information that must appear in the final integrated training dataset. It only contains 17 lines of Q&A, but it is rather important. Here is an excerpt of a few lines (formatted for readability): The final dataset was composed by concatenating the three parts, documentation, API, and identity. It can be explored here: Once the dataset was ready, it was time for the actual fine-tuning. With a dataset of 3,800+ specialized Gaia Sky pairs ready, it was time for the actual training. For this, I leaned on two heavy hitters in the open-source world: Unsloth and Qwen 3.5 . I started by training the 4B model, and then realized that I could also fit the 9B one in my GPU. In the post I’ll focus on the larger version of the model. I went as high as my local hardware allowed. Otherwise, I would have tried the 27B model, or even the 122B-A10B. Training a model with 9 billion parameters typically requires a massive server cluster, but by using 4-bit LoRA (Low-Rank Adaptation) , I was able to squeeze the entire process onto a single RTX 5080 (16GB) . The RTX 5080 is a beast, but to get the most out of it, I enabled TensorFloat-32 (TF32) . This allows the GPU to handle the heavy matrix multiplications of deep learning much faster than standard , without the precision loss of . I used the following parameters for the fine-tuning: LoRA Rank: 32 A balance between learning new patterns (like the Gaia Sky API) and retaining general knowledge. Target Modules: All major projection layers) Learning rate: \(2.0\times10^{-4}\) Optimizer: AdamW 8-bit The dataset is downloaded directly from the hub. It gets tokenized properly and passed into the SFT Trainer object for fine-tuning. The full code is in the file , and is surprisingly compact. The heavy lifting is done by and , of course. Once the LoRA weights are trained, they are dead weight until converted into a format people could actually use. I wrote to automate the most tedious part of the pipeline: After roughly an hour of training, and another one of quantizing and uploading, I had a fine-tuned Gaia Sky expert that performs great. Or did I? The next section discusses the performance of the model. Given the limited time I have and the low-parameter models used, my tests have been scarce. My expectations are not very high, but I still wanted to test the output of the fine-tuning and see how well the new knowledge was retained by the model. I only tested the Qwen 3.5 9B fine-tuned variant and compared it to the base model. You can get this model from HuggingFace: I used the same questions as in the Gaia Sky AI assistant post from 2025. My evaluation of the results is totally subjective, and includes a comparison with the base Qwen 3.5 9B. First, I recommend using a low temperature setting ( 0.2 in my tests) to avoid hallucinations and make sure the model does not get very “creative”. Also, I would suggest using the following system prompt, which frames the conversation and sets the tone: You are the Gaia Sky AI Expert, a specialized assistant with deep knowledge of the Gaia Sky universe, its Python API (v1 and v2), and its Java source code. Your goal is to help users explore the cosmos, write scripts, and understand the engine’s inner workings. That said, let’s go for the first question. Q: State the possible usages of Gaia Sky. A: Gaia Sky is a powerful open-source universe explorer and planetarium software designed for both interactive exploration and scientific visualization. Its primary use cases include: This is an excellent response to an easy question. It did not invent anything and correctly applied documentation knowledge like camera paths or LOD. Compared to base Qwen, which hallucinated a lot, this is much more grounded. Next is a question about the installation procedure on Linux. Q: How do I install Gaia Sky on Linux? A: To install Gaia Sky on Linux, follow these steps: Download the Installer Visit https://gaiasky.space/download and download the package for your architecture (e.g., ). Install via Package Manager Open a terminal and run: If dependency errors occur, fix them with: Launch Gaia Sky Run the command: The application will start in your default browser or terminal window. This is not bad. I did not specify the distribution, so it just assumed Debian. The instructions themselves look fine. Still, I wanted to see if it had absorbed the knowledge for other distros, so I asked specifically for Arch Linux: Q: How do I install it on Arch Linux specifically? A: To install Gaia Sky on Arch Linux , you have two primary options depending on whether you prefer the official repositories or building from source: The most straightforward method is via the Arch User Repository (AUR) . You will need an AUR helper like , , or . Install via AUR Helper : Open your terminal and run: This will automatically resolve dependencies (Java, OpenGL libraries) and install the application. Verify Installation : Once installed, launch it with: If you prefer to compile Gaia Sky yourself or need a specific version not yet in the AUR: Clone the Repository : Build and Install : Use Maven to build the project (requires Java Development Kit): After building, you can run it directly from the directory or install it using the provided Gradle wrapper if available in newer versions. Very good response. It recommends using the AUR package (correct), and also provides instructions to build and run it from source (also correct). However, it hallucinates the use of Maven instead of Gradle. The base model produced a long wall of text with mostly nonsense that would look plausible to the non-expert. In general, the responses from the base model were much longer and their content was very often factually wrong. The next question is a bit more technical. Q: How are cubemap textures used in Gaia Sky? A: Cubemaps in Gaia Sky serve a specific purpose related to planetary rendering rather than general environment mapping for the entire scene. They are primarily utilized to generate high-resolution, seamless spherical textures for planets (like Earth or Mars) by stitching together six square faces into a continuous surface. This technique allows for detailed texture representation without the distortion often seen in standard equirectangular projections near poles. Key Technical Details: This feature is distinct from standard environment maps used for sky reflections or star glow effects, focusing instead on accurate planetary visualization. This is excellent, even though it seems to imply that cubemaps are not used for environment maps and reflections, which is incorrect. However, it correctly identifies that they are mainly used to avoid distortions at the poles. Maybe it increased the importance of it because there is an entire documentation section devoted to cubemap textures in planets. The base model produced a long, Markdown-formatted, pretentious output that even included totally made-up scripts. Again, the fine-tuned response was much more grounded and correct. Now, onto the scripting questions. Here is where the fine-tuned model starts to hesitate and invent slightly. Here are the possible reasons: I personally think it has more to do with the first two reasons than the third. Q: How do I get the camera position with the APIv1 and scripting in Gaia Sky? A: To retrieve the current camera position using the Legacy API (v1) , you call : For the Modern Component API (v2) , you use from the module: First, it invented (correct is ) for APIv1. For APIv2, it correctly used , but the way it gets the object is incorrect. It should have done something like So, I think either this model is too small to be effectively fine-tuned for Gaia Sky scripting, or the training dataset is insufficient to achieve sufficient retention. Could also be both, as I said above. The base model has no idea about Gaia Sky scripting or anything related to it, so it just makes stuff up. Not even worth further mention. I used the Laptop 2 described above for testing and inference, with 28/32 layers on the GPU and a context window of ~4k, and I consistently got about 12 tok/s. Performance is exactly the same as with the base models, so this section is this short. This fine-tuning experiment has yielded valuable insights into the strengths and limitations of domain-specific model adaptation at lower parameter counts for local use. I think the foundational approach was sound. The dataset curation process, with its surgical filtering, teacher-based distillation, and rigorous sanitization, successfully encoded domain knowledge into the model. Proof of this is evident in the testing: the fine-tuned model, as opposed to the base one, correctly answered conceptual and documentation-heavy questions about Gaia Sky’s purpose, installation, and rendering techniques without hallucinating. It understood architectural details like LOD and cubemaps, and avoided inventing features that don’t exist. This demonstrates that fine-tuning can be an effective alternative to RAG for teaching models about a specific domain. However, it also struggled. The 9B model hit a hard ceiling when it came to API scripting and method names. It invented instead of , misunderstood how to instantiate APIv2 objects, and generally lacked the capacity to reliably retain the specific, syntactic details of the API surface. This is a classic problem: smaller models can absorb concepts and documentation , but struggle to memorize exact function signatures and usage patterns. With only 3,800+ training pairs and a 9B parameter budget, the model simply didn’t have enough capacity to encode both general knowledge and precise API details. So, what are the next steps? I believe the 4B and 9B models are too small for reliable Gaia Sky scripting assistance. My next experiment will be to fine-tune the Qwen 3.5 27B model . The jump from 9B to 27B parameters should provide substantially more capacity to encode API signatures without sacrificing general knowledge. Additionally, I could increase the scripting dataset by: That said, the hardware constraint is real. 27B requires more than my RTX 5080 can reasonably handle for full fine-tuning. However, with careful quantization (using 8-bit optimizers or even lower precision), 4-bit LoRA, and possibly gradient checkpointing, it may fit. If not, a cloud provider like Lambda Labs or Paperspace might be the way forward for a single training run. All in all, I think fine-tuning is a viable path for building domain-expert models, but it requires the right balance of dataset quality, model size, and hardware. For Gaia Sky specifically, a 27B model with a more robust scripting dataset would likely be the sweet spot before considering the jump to 70B+ models. I consider the infrastructure proven. It’s now a matter of scale. Training dataset creation Fine tuning Dataset creation and fine-tuning – gaiasky-finetune Gaia Sky training dataset repository – gaiasky-training-dataset Qwen3.5 Gaia Sky fine-tuned models – gaiasky-qwen-3.5-gguf Desktop PC – Arch Linux, Intel(R) Core(TM) i7-7700 (8) @ 4.20 GHz, 32 GB RAM, NVIDIA GeForce GTX 1070 8 GB. Laptop 1 – Windows 11, WSL2 (Arch Linux), Intel(R) Core(TM) Ultra 9 275HX (24) @ @ 3.07 GHz, 32 GB RAM, NVIDIA GeForce RTX 5080 Mobile 16 GB. Laptop 2 – Arch Linux, Intel(R) Core(TM) i7 8750H (12) @ 4.10 GHz, 16 GB RAM, NVIDIA GeForce GTX 1060 Mobile 6 GB. Documentation dataset API dataset Documentation: By far, the most important data, containing exhaustive human-written documentation pages. We convert the documentation RST files to Markdown with , and then we add some additional key files, like the project’s . Core Logic: Selected Java files that are representative of the brain of the engine (main loop, scene, renderer, etc.). Visual Logic: Selected shader files that define the look of the cosmos (stars, particles, PBR, etc.). Match the answer type: If the question doesn’t ask for code, don’t provide it. Grounding: Every claim must be directly grounded on the source text. Diversity: Every question must cover a different detail. Base text – This is composed by the raw strings in the variable. The file hint ( ) – We add hints depending on the filetype. The following table displays the hint for each type. Filetype Extensions Hint Java This is Java source code. Focus on class responsibilities, method signatures, and architectural patterns. Do NOT generate Python scripting examples. Python This is Python scripting code for the Gaia Sky API. Questions about usage and parameters are appropriate. Shader This is a GLSL shader. Focus on the rendering technique, uniforms, and mathematical operations. Do NOT generate Python scripting examples. Docs This is documentation. Focus on concepts, features, workflows, and user-facing features. The pair count ( ) – Contains the number of Q&A pairs to generate. The previous Q&A pairs, if any ( ) – This is constructed by listing the existing pairs, as parsed by the program in the output, or accumulated in the current run. The filepath and content ( , ) – Contain the file name and the actual content, which is capped to fit within the context length. Type A: The API reference These are direct “How do I use X?” pairs. They include the parameters, the return types, and a basic Python example. Type B: The task synthesis This step is optional, and I ended up not including it in the final dataset. However, I think it is still worth mentioning. I used the larger teacher dense model (27B) to generate complex tasks (e.g., “Write a script that navigates to Mars, waits 5 seconds, and takes a screenshot”). The script provided the teacher model with a safe list of real functions extracted in Step 1 as a sort of guardrail. If the teacher tried to hallucinate a command, the script flagged and discarded it. The results of this section were kind of underwhelming, possibly because more parameters are needed for such open-ended tasks. Type C: Adversarial error correction This is my favorite part. I programmatically broke the API calls to teach the model how to fix its own mistakes. The script would generate a wrong script (e.g., using instead of or missing a required argument) and then provide the correct version. The end goal was to prevent common LLM failures before they happen. Type D: The “gold standard” Library Finally, I indexed the actual test and showcase scripts from the Gaia Sky repository. These are human-written, battle-tested scripts that show the model how to handle complex logic, loops, and math. gaiasky-training-dataset@HuggingFace . LoRA Rank: 32 A balance between learning new patterns (like the Gaia Sky API) and retaining general knowledge. Target Modules: All major projection layers) Learning rate: \(2.0\times10^{-4}\) Optimizer: AdamW 8-bit Quantization: Converting the model to Q4_K_M GGUF , or whatever other quant. This reduces the model size enough that it can run on almost any modern laptop while keeping its capabilities mostly intact. HF upload: Automatically pushing the finished file to HuggingFace so the community can pull it directly into LM Studio or Ollama. gaiasky-qwen-3.5-gguf@HuggingFace . API Expertise: You know both the Legacy API (v1) and the Modern Component API (v2: camera, time, data, scene, graphics, camcorder, etc.). Prioritize the Modern API (v2) for new scripts unless Legacy is requested. Never invent method names, be 100% sure, otherwise suggest the user to look up the docs. Source Code: You understand the underlying Java architecture (libGDX, Ashley ECS, LWJGL3, OpenGL) and can explain how features like LOD, shaders, and data loaders are implemented. Documentation: You are familiar with the Gaia Sky capabilities, installation guides, and dataset management. Be Grounded: If a feature or method does not exist in your training data, do not invent it. State: “I don’t see that specific feature in the current Gaia Sky implementation.” Code Style: Always provide clean, commented Python or Java code. Use snake_case for Modern API calls. Tone: Professional, scientific, and enthusiastic about astrovisualization. Answer greetings warmly, then offer Gaia Sky-specific assistance. Interactive Exploration : Navigate the Milky Way, explore exoplanets, view the solar system in high detail, and time-travel through cosmic history using an intuitive interface. Scientific Visualization : Visualize complex astrophysical datasets (catalogs, simulations) with advanced rendering techniques like LOD (Level of Detail), shaders, and particle effects. Education & Outreach : Create immersive presentations or tutorials to teach astronomy concepts through dynamic 3D models of celestial objects. Scripting & Automation : Automate camera paths, data loading, and scene manipulation via Python (Modern API v2) or Java for reproducible workflows and research presentations. Data Analysis : Load and manipulate large-scale astronomical datasets directly within the application to analyze spatial distributions without external tools. Download the Installer Visit https://gaiasky.space/download and download the package for your architecture (e.g., ). Install via Package Manager Open a terminal and run: If dependency errors occur, fix them with: Launch Gaia Sky Run the command: The application will start in your default browser or terminal window. Install via AUR Helper : Open your terminal and run: This will automatically resolve dependencies (Java, OpenGL libraries) and install the application. Verify Installation : Once installed, launch it with: Clone the Repository : Build and Install : Use Maven to build the project (requires Java Development Kit): After building, you can run it directly from the directory or install it using the provided Gradle wrapper if available in newer versions. Dependencies : Ensure your system has up-to-date OpenGL drivers and Java (JDK 17+ is recommended). The AUR package typically handles these automatically. Data Location : By default, datasets are stored in . You may want to create symbolic links from if you prefer system-wide data management. Ensure you have at least 4GB of free disk space for datasets and cache files. For GPU acceleration, verify that your graphics drivers support OpenGL 4.3+. Application : Used specifically for planet textures where high fidelity is required across all latitudes and longitudes. Structure : Consists of six square images (front, back, left, right, top, bottom) that are algorithmically mapped onto a sphere. Rendering Engine : The underlying rendering engine uses these cubemap textures to apply surface details dynamically as the planet rotates or changes perspective. Low parameter count not being able to retain the information properly. Very low number of test and showcase scripts used in fine-tuning. Bad design of the training dataset. Expanding the test and showcase scripts used in Type D training. Generating more synthetic scripting examples in Type A (API reference) with better coverage of edge cases and parameter variations. Adding adversarial examples (Type C) for the most commonly confused API patterns.

0 views
Ahead of AI 1 months ago

A Visual Guide to Attention Variants in Modern LLMs

I had originally planned to write about DeepSeek V4. Since it still hasn’t been released, I used the time to work on something that had been on my list for a while, namely, collecting, organizing, and refining the different LLM architectures I have covered over the past few years. So, over the last two weeks, I turned that effort into an LLM architecture gallery (with 45 entries at the time of this writing), which combines material from earlier articles with several important architectures I had not documented yet. Each entry comes with a visual model card, and I plan to keep the gallery updated regularly. You can find the gallery here: https://sebastianraschka.com/llm-architecture-gallery/ Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Note that the attention matrix is not a separate hand-written object. It emerges from , , and softmax. Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. SWA is essentially a knob that can be tuned more or less aggressively. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE. These are not changes on the scale of MLA or linear attention but merely stability and control changes applied to an otherwise familiar attention block. Figure 23: In Qwen3-Next and Qwen3.5, gated attention appears as the full-attention layer that periodically breaks up runs of Gated DeltaNet blocks. Note that the figure above also includes Gated DeltaNet, which we will cover in the next section below. Hybrid attention is a broader design pattern rather than a specific, single mechanism. The overall idea is to keep a transformer-like stack, but replace most of the expensive full-attention layers with cheaper linear or state-space sequence modules. The motivation is long-context efficiency. Full attention grows quadratically with sequence length, so once models move to contexts like 128k, 256k, or 1M tokens, attention memory and compute become expensive enough that using cheaper sequence modules in most layers while keeping only a smaller number of heavier retrieval layers starts making more sense. (Note that this comes with a bit of a modeling performance trade-off, though.) In Qwen3-Next, this pattern appears as a 3:1 mix of Gated DeltaNet and Gated Attention blocks. Gated DeltaNet is also closely related to Mamba-2 (see the Gated Delta Networks: Improving Mamba2 with Delta Rule paper, for instance), and the mechanism can be read as a DeltaNet-style fast-weight update combined with Mamba-style gating. Later architectures keep the same overall idea but swap in other lightweight sequence mixers, such as Kimi Delta Attention, Lightning Attention, or standard Mamba-2. Figure 24: The basic hybrid pattern, where most blocks are cheaper sequence mixers and every fourth block restores a heavier attention layer (Original source The Big LLM Architecture Comparison ). To my knowledge, the first prominent example of a close-to-flagship LLM with hybrid attention was Qwen3-Next in 2025, which does not remove attention completely but mixes three Gated DeltaNet blocks with one Gated Attention block. Here, lightweight Gated DeltaNet blocks do most of the long-context work and keep memory growth much flatter than full attention. The heavier gated-attention layer remains because DeltaNet is less exact at content-based retrieval. Inside a Gated DeltaNet block, the model computes query, key, and value vectors together with two learned gates (α, β). Rather than forming the usual token-to-token attention matrix, it writes to a small fast-weight memory using a delta-rule update. In rough terms, the memory stores a compressed running summary of past information, while the gates control how much new information is added and how much previous state is retained. That makes Gated DeltaNet a linear-attention or recurrent-style mechanism rather than just another tweak to MHA. Relative to Mamba-2, the close connection is that both belong to the linear-time gated sequence-model family, but Gated DeltaNet uses a DeltaNet-style fast-weight memory update instead of the Mamba state-space update. Figure 25: The practical motivation behind the hybrids is shown here in the memory curve. Hybrid stacks with Gated DeltaNet grow much more slowly with context length than ordinary full attention (Original source LLMs-from-scratch DeltaNet materials ). Qwen3.5 moves the former Qwen3-Next hybrid into Qwen’s main flagship series, which is an interesting move. This basically signals that the hybrid strategy is a success and that we may see more models with this architecture in the future. Figure 26: Qwen3.5 shows the Qwen team promoting the former Qwen3-Next side-branch into the main model line rather than leaving it as a one-off efficiency variant (Original source A Dream of Spring for Open-Weight LLMs ). 7.2 Kimi Linear And Modified Delta Attention Kimi Linear keeps the same broad transformer skeleton and the same 3:1 pattern, but it changes both halves of the recipe. On the lightweight side, Kimi Delta Attention is a refinement of Gated DeltaNet. Where Qwen3-Next uses a scalar gate per head to control memory decay, Kimi uses channel-wise gating, which gives finer control over the memory update. On the heavier side, Kimi replaces Qwen3-Next’s gated-attention layers with gated MLA layers. So, it’s still the same broader pattern as in Qwen3-Next and Qwen3.5, but both ingredients (slightly) change. I.e., most layers are still handled by a cheaper linear-style mechanism, and periodic heavier layers still remain for stronger retrieval. Figure 27: Kimi Linear keeps the same overall hybrid pattern while changing both the lightweight side and the heavier attention side of the stack (Original source The Big LLM Architecture Comparison ). 7.3 Ling 2.5 And Lightning Attention Ling 2.5 shows another swap on the lightweight side. Instead of Gated DeltaNet, Ling uses a slightly simpler recurrent linear attention variant called Lightning Attention. On the heavier side, it keeps MLA from DeepSeek. Most sequence mixing happens in the cheaper linear-attention blocks, while a smaller number of heavier layers remain to preserve stronger retrieval. The difference is that the specific lightweight mechanism is now Lightning Attention rather than DeltaNet or Kimi Delta Attention. Figure 28: Ling 2.5 and Qwen3.5 are both linear-attention hybrids, even though Ling swaps in Lightning Attention and MLA instead of the Qwen recipe (Original source A Dream of Spring for Open-Weight LLMs ). Ling 2.5 is aimed more at long-context efficiency than at absolute benchmark leadership. According to the Ling team, it was reported as substantially faster than Kimi K2 at 32k tokens, which is the practical payoff these hybrids are aiming for. Figure 29: Ling 2.5 was presented as a strong efficiency upgrade, with much higher 32k-token throughput than Kimi K2 at the same 1-trillion-parameter scale (Original source Ling 2.5 model hub page ). Nemotron And Mamba-2 Nemotron pushes the pattern further away from the transformer baseline. Nemotron 3 Nano is a Mamba-Transformer hybrid that interleaves Mamba-2 sequence-modeling blocks with sparse MoE layers and uses self-attention only in a small subset of layers. This is a more extreme version of the same basic tradeoff discussed above. Here, the lightweight sequence module is a Mamba-2 state-space block rather than a DeltaNet-style fast-weight update, but the basic tradeoff is similar. Figure 30: Nemotron 3 Nano uses Mamba-2 for most of the sequence modeling work, with self-attention only appearing in a small subset of layers (Original source The Big LLM Architecture Comparison ). The larger Nemotron 3 Super keeps the Mamba-2 hybrid attention approach and adds other efficiency-oriented changes such as latent MoE and shared-weight multi-token prediction (MTP) for speculative decoding. Figure 31: Nemotron 3 Super keeps the Mamba-2 hybrid attention pattern while adding latent MoE and shared-weight MTP on top (Original source The Big LLM Architecture Comparison ). Conclusion Of course, there are many more (mostly niche) attention variants throughout the literature that I haven’t covered here. The focus of this article was on those that are currently used in state-of-the-art (open-weight) models. In particular, I am looking forward to (1) seeing the brand new Mamba-3 layers getting integrated into the aforementioned hybrid architectures (replacing Gated DeltaNet) and (2) attention residuals being used in general. In practice, you may also wonder what the “best” architecture is at the moment. This is hard to answer, as there are no public experiments that train different architectures on the same training data etc. Hence, we can currently only answer what the best (trained) model choice is for a given problem. In my opinion, hybrid architectures are still a novelty, and the main selling point is mainly (long-context) efficiency versus just modeling performance. Hence, I think they are a great candidate for agent contexts (like OpenClaw). Personally, I think the problem with hybrid architectures is also that the inference stacks are not quite as optimized, yet, and I find that I get better tok/sec throughput when running LLMs locally using more classic setups like GPT-OSS with grouped-query attention. Anyways, I am curious to see what DeepSeek V4 has in store, since DeepSeek has been quite the reliable trend-setter in the recent 2 years. Figure 1: Overview of the LLM architecture gallery and its visual model cards. After I shared the initial version, a few readers also asked whether there would be a poster version. So, there is now a poster version via Redbubble . I ordered the Medium size (26.9 x 23.4 in) to check how it looks in print, and the result is sharp and clear. That said, some of the smallest text elements are already quite small at that size, so I would not recommend the smaller versions if you intend to have everything readable. Figure 2: Poster version of the architecture gallery with some random objects for scale. Alongside the gallery, I was/am also working on short explainers for a few core LLM concepts. So, in this article, I thought it would be interesting to recap all the recent attention variants that have been developed and used in prominent open-weight architectures in recent years. My goal is to make the collection useful both as a reference and as a lightweight learning resource. I hope you find it useful and educational! 1. Multi-Head Attention (MHA) Self-attention lets each token look at the other visible tokens in the sequence, assign them weights, and use those weights to build a new context-aware representation of the input. Multi-head attention (MHA) is the standard transformer version of that idea. It runs several self-attention heads in parallel with different learned projections, then combines their outputs into one richer representation. Figure 3: Olmo 2 as an example architecture using MHA. The sections below start with a whirlwind tour of explaining self-attention to explain MHA. It’s more meant as a quick overview to set the stage for related attention concepts like grouped-query attention, sliding window attention, and so on. If you are interested in a longer, more detailed self-attention coverage, you might like my longer Understanding and Coding Self-Attention, Multi-Head Attention, Causal-Attention, and Cross-Attention in LLMs article. EXAMPLE ARCHITECTURES GPT-2 , OLMo 2 7B , and OLMo 3 7B 1.2 Historical Tidbits And Why Attention Was Invented Attention predates transformers and MHA. Its immediate background is encoder-decoder RNNs for translation. In those older systems, an encoder RNN would read the source sentence token by token and compress it into a sequence of hidden states, or in the simplest version into one final state. Then the decoder RNN had to generate the target sentence from that limited summary. This worked for short and simple cases, but it created an obvious bottleneck once the relevant information for the next output word lived somewhere else in the input sentence. In short, the limitation is that the hidden state can’t store infinitely much information or context, and sometimes it would be useful to just refer back to the full input sequence. The translation example below shows one of the limitations of this idea. For instance, a sentence can preserve many locally reasonable word choices and still fail as a translation when the model treats the problem too much like a word-by-word mapping. (The top panel shows an exaggerated example where we translate the sentence word by word; obviously, the grammar in the resulting sentence is wrong.) In reality, the correct next word depends on sentence-level structure and on which earlier source words matter at that step. Of course, this could still be translated fine with an RNN, but it would struggle with longer sequences or knowledge retrieval tasks because the hidden state can only store so much information as mentioned earlier. Figure 4: Translation can fail even when many individual word choices look reasonable because sentence-level structure still matters (Original source LLMs-from-scratch ). The next figure shows that change more directly. When the decoder is producing an output token, it should not be limited to one compressed memory path. It should be able to reach back to the more relevant input tokens directly. Figure 5: Attention breaks the RNN bottleneck by letting the current output position revisit the full input sequence instead of relying on one compressed state alone (Original source LLMs-from-scratch ). Transformers keep that core idea from the aforementioned attention-modified RNN but remove the recurrence. In the classic Attention Is All You Need paper, attention becomes the main sequence-processing mechanism itself (instead of being just part of an RNN encoder-decoder.) In transformers, that mechanism is called self-attention, where each token in the sequence computes weights over all other tokens and uses them to mix information from those tokens into a new representation. Multi-head attention is the same mechanism run several times in parallel. 1.3 The Masked Attention Matrix For a sequence of tokens, attention needs one row of weights per token, so overall we get a matrix. Each row answers a simple question. When updating this token, how much should each visible token matter? In a decoder-only LLM, future positions are masked out, which is why the upper-right part of the matrix is grayed out in the figure below. Self-attention is fundamentally about learning these token-to-token weight patterns, under a causal mask, and then using them to build context-aware token representations. Figure 6: A concrete masked attention matrix where each row belongs to one token, each entry is an attention weight, and future-token entries are removed by the causal mask (Original source Understanding and Coding Self-Attention ). 1.4 Self-Attention Internals The next figure shows how the transformer computes the attention matrix ( ) from the input embeddings , which is then used to produce the transformed inputs ( ). Here , , and stand for queries, keys, and values. The query for a token represents what that token is looking for, the key represents what each token makes available for matching, and the value represents the information that gets mixed into the output once the attention weights have been computed. The steps are as follows: , , and are weight matrices that project the input embeddings into , , and produces the raw token-to-token relevance scores softmax converts those scores into the normalized attention matrix that we discussed in the previous section is applied to to produce the output matrix Figure 7: The full single-head pipeline, from input embeddings X to the normalized attention matrix A and output representations Z (Original source Understanding and Coding Self-Attention ). The next figure shows the same concept as the previous figure but the attention matrix computation is hidden inside the “scaled-dot-product attention” box, and we perform the computation only for one input token instead of all input tokens. This is to show a compact form of self-attention with a single head before extending this to multi-head attention in the next section. Figure 8: One attention head is already a complete mechanism. One set of learned projections produces one attention matrix and one context-aware output stream (Original source Understanding and Coding Self-Attention ). 1.5 From One Head To Multi-Head Attention One set of matrices gives us one attention head, which means one attention matrix and one output matrix . (This concept was illustrated in the previous section.) Multi-head attention simply runs several of these heads in parallel with different learned projection matrices. This is useful because different heads can specialize in different token relationships. One head might focus on short local dependencies, another on broader semantic links, and another on positional or syntactic structure. Figure 9: Multi-head attention keeps the same basic attention recipe, but repeats it across several heads in parallel so the model can learn several token-to-token patterns at once (Original source Understanding and Coding Self-Attention ). 2. Grouped-Query Attention (GQA) Grouped-query attention is an attention variant derived from standard MHA. It was introduced in the 2023 paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Joshua Ainslie and colleagues. Instead of giving every query head its own keys and values, it lets several query heads share the same key-value projections, which makes KV caching much cheaper (primarily as a memory reduction) without changing the overall decoder recipe very much. Figure 10: GQA keeps the same overall attention pattern as MHA, but collapses the number of key-value heads by sharing them across multiple query heads (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Dense: Llama 3 8B , Qwen3 4B , Gemma 3 27B , Mistral Small 3.1 24B , SmolLM3 3B , and Tiny Aya 3.35B . Sparse (Mixture-of-Experts): Llama 4 Maverick , Qwen3 235B-A22B , Step 3.5 Flash 196B , and Sarvam 30B . 2.1 Why GQA Became Popular In my architecture comparison article , I framed GQA as the new standard replacement for classic multi-head attention (MHA). The reason is that standard MHA gives every head its own keys and values, which is more optimal from a modeling perspective but expensive once we have to keep all of that state in the KV cache during inference. In GQA, we keep a larger set of query heads, but we reduce the number of key-value heads and let multiple queries share them. That lowers both parameter count and KV-cache traffic without making drastic implementation changes like multi-head latent attention (MLA), which will be discussed later. In practice, that made and keeps it a very popular choice for labs that wanted something cheaper than MHA but simpler to implement than newer compression-heavy alternatives like MLA. 2.2 GQA Memory Savings GQA results in big savings in KV storage, since the fewer key-value heads we keep per layer, the less cached state we need per token. That is why GQA becomes more useful as sequence length grows. GQA is also a spectrum. If we reduce all the way down to one shared K/V group, we are effectively in multi-query attention territory, which is even cheaper but can hurt modeling quality more noticeably. The sweet spot is usually somewhere in between multi-query attention (1 shared group) and MHA (where K/V groups are equal to the number of queries), where the cache savings are large but the modeling degradation relative to MHA stays modest. Figure 11: Lower is better. Once the context window grows, KV-cache savings become more pronounced. (Original source: LLMs-from-scratch GQA materials ) 2.3 Why GQA Still Matters In 2026 More advanced variants such as MLA are becoming popular because they can offer better modeling performance at the same KV efficiency levels (e.g., as discussed in the ablation studies of the DeepSeek-V2 paper ), but they also involve a more complicated implementation and a more complicated attention stack. GQA remains appealing because it is robust, easier to implement, and also easier to train (since there are fewer hyperparameter tunings necessary, based on my experience). That is why some of the newer releases still stay deliberately classic here. E.g., in my Spring Architectures article, I mentioned that MiniMax M2.5 and Nanbeige 4.1 as models that remained very classic, using only grouped-query attention without piling on other efficiency tricks. Sarvam is a particularly useful comparison point as well: the 30B model keeps classic GQA, while the 105B version switches to MLA. Figure 12: Total KV cache sizes for 105B Sarvam (using MLA) versus 30B Sarvam (using GQA), versus using plain MHA. 3. Multi-Head Latent Attention (MLA) The motivation behind Multi-head Latent Attention (MLA) is similar to Grouped-Query Attention (GQA). Both are solutions for reducing KV-cache memory requirements. The difference between GQA and MLA is that MLA shrinks the cache by compressing what gets stored rather than by reducing how many K/Vs are stored by sharing heads. Figure 13: Unlike GQA, MLA does not reduce KV cost by grouping heads. It reduces it by caching a compressed latent representation. Note that it is also applied to the query, which is not shown for simplicity (Original source: The Big LLM Architecture Comparison ). MLA, originally proposed in the DeepSeek-V2 paper, became such a defining DeepSeek-era idea (especially after DeepSeek-V3 and R1). It is more complicated to implement than GQA, more complicated to serve, but nowadays also often more compelling once model size and context length get large enough that cache traffic starts to dominate, because at the same rate of memory reduction, it could maintain better modeling performance (more on that later). EXAMPLE ARCHITECTURES DeepSeek V3 , Kimi K2 , GLM-5 , Ling 2.5 , Mistral Large 3 , and Sarvam 105B 3.1 Compression, Not Sharing Instead of caching full-resolution key and value tensors as in MHA and GQA, MLA stores a latent representation and reconstructs the usable state when needed. Essentially, it is a cache compression strategy embedded inside attention, as illustrated in the previous figure. The figure below shows the savings compared to regular MHA. Figure 14: Once context length grows, the savings from caching a latent representation instead of full K/V tensors become very visible (Original source: LLMs-from-scratch MLA section). 3.2 MLA Ablation Studies The DeepSeek-V2 paper provided some ablations where GQA looked worse than MHA in terms of modeling performance, while MLA held up much better and could even outperform MHA when tuned carefully. That is a much stronger justification than “it (also) saves memory.” In other words, MLA is a preferable attention mechanism for DeepSeek not just because it was efficient, but because it looked like a quality-preserving efficiency move at large scale. (But colleagues also told me that MLA only works well at a certain size. For smaller models, let’s say <100B, GQA seems to work better, or, is at least easier to tune and get right.) Figure 15: GQA drops below MHA here, while MLA remains competitive and can even slightly outperform it. Underlying paper: DeepSeek-V2 . Below is again the comparison between GQA in 30B Sarvam versus MLA in 105B Sarvam. Figure 16: GQA and MLA are solving the same bottleneck from different directions. The tradeoff is simplicity versus better modeling performance for larger models. 3.3 How MLA Spread After DeepSeek Once DeepSeek V3/R1, V3.1 etc. normalized the design after its introduction in V2, it started showing up in a second wave of architectures. Kimi K2 kept the DeepSeek recipe and scaled it up. GLM-5 adopted MLA together with DeepSeek Sparse Attention (from DeepSeek V3.2). Ling 2.5 paired MLA with a linear-attention hybrid. Sarvam released two models where the 30B model stayed with classic GQA and the 105B model switched to MLA. That last pair is particularly useful as it puts the technical-complexity discussion aside. I.e., the Sarvam team implemented both variants and deliberately chose to then use GQA for one variant and MLA for the other. So, in a sense, that makes MLA feel less like a theoretical alternative and more like a concrete architectural upgrade path once a family scales up. 4. Sliding Window Attention (SWA) Sliding window attention reduces the memory and compute cost of long-context inference by limiting how many previous tokens each position can attend to. Instead of attending to the entire prefix, each token only attends to a fixed window of recent tokens around its position. Because attention is restricted to a local token neighborhood, this mechanism is often referred to as local attention. Some architectures combine these local layers with occasional global attention layers so that information can still propagate across the entire sequence. Figure 17: The conceptual shift is simple. Regular attention is global attention, while sliding-window attention is local attention. Global attention lets every token see the full prefix; SWA turns many of those layers into local attention layers (Original source: The Big LLM Architecture Comparison ). EXAMPLE ARCHITECTURES Gemma 3 27B , OLMo 3 32B , Xiaomi MiMo-V2-Flash , Arcee Trinity , Step 3.5 Flash , and Tiny Aya 4.1 Gemma 3 As A Reference Point Gemma 3 is still one of the clearest recent SWA examples because it is easy to compare against Gemma 2. Gemma 2 already used a hybrid attention setup with a 1:1 ratio between local and global layers and a 4096-token window. Gemma 3 pushed this further to a 5:1 ratio and reduced the window size to 1024. The key finding was not that local attention is cheaper, because that was already known. Here, the more interesting takeaway from the Gemma 3 ablation study was that using this more aggressively seemed to hurt modeling performance only slightly. The Gemma ablation study suggests that the smaller window and more aggressive local:global ratio have little effect on perplexity. Underlying paper: Gemma 3 article (Original source: The Big LLM Architecture Comparison ). 4.2 The Ratio And Window Size In practice, saying that a model “uses SWA” does not mean it relies on SWA alone. What usually matters are the local-to-global layer pattern and the attention window size. For example: Gemma 3 and Xiaomi use a 5:1 local-to-global pattern. OLMo 3 and Arcee Trinity use a 3:1 pattern. Xiaomi also uses a window size of 128, which is much smaller, and therefore more aggressive, than Gemma’s 1024. Figure 18: The long-context savings come from turning many full-attention layers into local ones, which reduces how much cached context those layers need to consider (Original source: LLMs-from-scratch SWA materials ). 4.3 Combining SWA with GQA SWA often appears together with GQA because the two ideas address different parts of the same inference problem. SWA reduces how much context a local layer has to consider. GQA reduces how much key-value state each token contributes to the cache. That is why many recent dense models use both rather than treating them as alternatives. Gemma 3 is again a good reference point here, since it combines sliding window attention with grouped-query attention in the same architecture. 5. DeepSeek Sparse Attention (DSA) DeepSeek Sparse Attention is one of the architectural changes that appeared in the DeepSeek V3.2 line and later showed up again in GLM-5. Specifically, DeepSeek V3.2 combines it with Multi-head Latent Attention (MLA) , and GLM-5 adopts the same pair for the same general reason, namely, reducing inference cost when context lengths get large. EXAMPLE ARCHITECTURES DeepSeek V3.2 and GLM-5 5.1 Changes Relative To Sliding-Window Attention In sliding-window attention, the current token does not attend to the full prefix but only to a fixed local window. This is the same broad idea behind DeepSeek Sparse Attention, where each token also only attends to a subset of previous tokens. However, the selected tokens are not determined by a fixed-width local window. Instead, DeepSeek Sparse Attention uses a learned sparse pattern. In short, it uses an indexer-plus-selector setup, where a lightning indexer computes relevance scores, and a token selector keeps only a smaller set of high-scoring past positions. The way the subset of tokens is selected is the main difference from sliding-window attention. Sliding-window attention hard-codes locality. DeepSeek Sparse Attention still limits attention to a subset, but it lets the model decide which prior tokens are worth revisiting. Figure 19: Similar to sliding-window attention, DeepSeek Sparse Attention also restricts each token to a subset of prior tokens, but does not do so with a fixed local window (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). 5.2 DeepSeek Sparse Attention and MLA DeepSeek V3.2 uses both Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention. MLA reduces KV-cache cost by compressing what gets stored. DeepSeek Sparse Attention reduces how much of the prior context the model has to revisit. Put differently, one optimizes the cache representation, the other optimizes the attention pattern on top of it. Figure 20: DeepSeek V3.2 is the obvious reference point, because this is the model family most closely associated with the sparse-attention idea. The sparse pattern is not random. The first stage is a lightning indexer that scores previous tokens for each new query token. It uses MLA’s compressed token representations and computes a learned similarity score over the prior context, so the model can rank which earlier positions are worth revisiting. The second stage is a token selector. It keeps only a smaller high-scoring subset, for example, a top- set of past positions, and turns that subset into the sparse attention mask. So the main point is that DeepSeek Sparse Attention does not hard-code the sparsity pattern. It learns which past tokens to keep. Figure 21: The mechanism consists of a lightning indexer that scores prior tokens and a selector that keeps only a smaller subset for attention (Original source: From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates ). DeepSeek Sparse Attention is relatively new and relatively complicated to implement, which is why it has not been so widely adopted as Grouped-Query Attention (GQA) yet. 6. Gated Attention Gated attention is best understood as a modified full-attention block rather than as a separate attention family. It usually appears inside hybrid stacks that still keep an occasional full-attention layer for exact content retrieval, but add a few stability-oriented changes on top of an otherwise familiar scaled dot-product attention block. Figure 22: Trinity Large is a useful comparison because gated attention is not only a Qwen idea (more on that later). Here the gate appears after the scaled dot-product attention output and before the output projection in a different long-context architecture (Original source: A Dream of Spring for Open-Weight LLMs ). 6.1 Where Gated Attention Appears The Qwen3-Next and Qwen3.5 architectures show that recent hybrids (covered in the next section) do not replace attention everywhere. Instead, they replace most attention layers with a cheaper alternative and keep a smaller number of full-attention layers in the stack. Those remaining full-attention layers are where gated attention typically appears. Qwen3-Next and Qwen3.5 use it together with Gated DeltaNet in a 3:1 pattern. But hybrid architectures aside, Trinity uses a related gating idea in a more conventional attention stack, as shown in the previous figure above. 6.2 Gated Attention Relative To Standard Attention The gated attention block in Qwen-style hybrids or Trinity (not a hybrid) is essentially standard scaled-dot-product attention with a few changes on top. In the original Gated Attention paper , those changes are presented as a way to make the retained full-attention layers behave more predictably inside a hybrid stack. The block still looks like standard (full) attention, but it adds: an output gate that scales the attention result before it is added back to the residual, a zero-centered QK-Norm variant instead of standard RMSNorm for q and k, partial RoPE.

0 views
Simon Willison 1 months ago

GPT-5.4 mini and GPT-5.4 nano, which can describe 76,000 photos for $52

OpenAI today: Introducing GPT‑5.4 mini and nano . These models join GPT-5.4 which was released two weeks ago . OpenAI's self-reported benchmarks show the new 5.4-nano out-performing their previous GPT-5 mini model when run at maximum reasoning effort. The new mini is also 2x faster than the previous mini. Here's how the pricing looks - all prices are per million tokens. is notably even cheaper than Google's Gemini 3.1 Flash-Lite: I used GPT-5.4 nano to generate a description of this photo I took at the John M. Mossman Lock Collection : Here's the output: The image shows the interior of a museum gallery with a long display wall. White-painted brick walls are covered with many framed portraits arranged in neat rows. Below the portraits, there are multiple glass display cases with dark wooden frames and glass tops/fronts, containing various old historical objects and equipment. The room has a polished wooden floor, hanging ceiling light fixtures/cords, and a few visible pipes near the top of the wall. In the foreground, glass cases run along the length of the room, reflecting items from other sections of the gallery. That took 2,751 input tokens and 112 output tokens, at a cost of 0.069 cents (less than a tenth of a cent). That means describing every single photo in my 76,000 photo collection would cost around $52.44. I released llm 0.29 with support for the new models. Then I had OpenAI Codex loop through all five reasoning effort levels and all three models and produce this combined SVG grid of pelicans riding bicycles ( generation transcripts here ). I do like the gpt-5.4 xhigh one the best, it has a good bicycle (with nice spokes) and the pelican has a fish in its beak! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32e -- Interventions: the learning rate

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: The values in there -- for the learning rate, and for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book. What do those values actually mean, and are those really the right values for them? I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before. The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that. In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later. If you're reading this blog, you almost certainly know what the learning rate is, but let's go over it briefly to build a solid foundation. The way it's normally explained, using simple gradient descent, goes something like this. Let's assume that we're training a model with just one parameter, and it starts off set to − 5 . We run some training data through, and get a loss, let's say 44.44: We don't know what shape our loss curve is (if we did, we might be able to find the lowest loss algebraically), but we do know the differential of the parameter versus the loss at the point we've measured; it happens to be -13. That is reasonably large and negative: We use that information to say that we want to move in the direction of a larger value for our parameter -- that is, in our case where the gradient is negative, so we have a downhill slope towards the right, we want to increase the parameter to move rightwards on that chart, whereas if it were positive (an uphill slope) we'd want to decrease the parameter to move leftwards. Simply subtracting the gradient from the parameter would lead to an update in the right direction, but it would be a very large one in this case -- we'd move 13 units to the right -- so we multiply the gradient by a small positive number, the learning rate (often written as a lower-case eta, like this: η ), to move a small distance in that direction. Let's say η = 0.3 . That means we want to update our parameter: So now we run that through and get a new loss -- let's say it's 9.06 -- and a new gradient, which happens to be -5.2. Now we can do another update, and our parameter will become 0.46, so we use that and work out another loss and gradient, which come to 3.3816 and -2.08. Let's plot that one, but this time we'll draw back the veil and show the actual loss curve. Now, it's worth reiterating that while we're training this model we don't know what that curve looks like -- we're just finding points on it, along with its gradient at those points, and using that information to work out which parameter value to explore next. But it's pretty clear that as we continue, if the learning rate is set correctly, we'll get to the minimum eventually if the learning rate is the right kind of size, because -- due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum 1 . It's also pretty clear that if the learning rate is smaller than an optimal value, in this simple case we will still find the right point, but it will take more steps because each one is smaller: And, of course, if the learning rate is too high, we might never converge -- we'd "bounce out of" the dip, and wind up with a parameter value that endlessly cycles between increasingly smaller and increasingly larger values, zooming off to infinity: OK, that's the basics. Why might we want to change from something that seems so logical and simple? A few paragraphs back I said: due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum What if it doesn't? Imagine if we had something more like a V-shaped curve, like this: The gradient does not decrease as we get closer to the minimum, and so while we're in the downward-sloping part, each update is exactly the same distance: Now, eventually we'll jump over the minimum: In this example, I've used a gradient of − 8.33 on the downward-sloping part of the curve, and + 8.33 on the upward-sloping part, so that means that our next update just bounces us back to where we were before! Because the gradient isn't decreasing the closer we get to the minimum, we wind up just oscillating around it. That's not very helpful. That's a slightly contrived example (though not entirely -- intuitively, with functions like ReLU or GELU in our real LLMs, it's easy to imagine crazy loss landscapes). But it does show that perhaps we might want to add in our own "artificial" way to decrease the size of the steps we take over the course of training our model rather than just relying on the gradients naturally flattening out for us. Another way of looking at things is that as the model gets trained, we don't want batches of very new-looking data to cause big updates, taking us away from what was a good part of the loss landscape in terms of what we've seen so far. For example, imagine you've been training an LLM on a bunch of documents, which have so far been in English. Halfway through, it encounters a document in Byzantine Greek, the loss skyrockets, and you do a big update. That would be a problem! You might want it to learn a bit from it to push it slightly in a "the world is multi-lingual" direction, but you don't want it to lose a big chunk of the value from its previous training. You might also see a kind of connection to the way that people learn over the course of their lives -- for babies, everything is new and they "update their parameters" constantly as they try to understand the world. Children are still pretty flexible, but as we get older we tend to update our beliefs less and less. That's not always optimal, but as a heuristic it's pretty adaptive. Anyway, in general: for most training runs, we're going to want the learning rate to adjust over time. Most of the time this will be by reducing it, though there can be cases for increasing it again for periods. The general case of doing this is called "learning rate scheduling". There are a bunch of ways that people adjust the learning rate over the course of a train; here are a few that cropped up a lot while I was researching this. If we want the learning rate to go down over time, and we know how many steps we're training for, we can just set it to (say) 0.0004 for the first quarter of our train, then 0.0002 for the next, then 0.0001, then finish off with 0.00005, like this: That can work pretty well! But there is one obvious oddity -- the big step changes in learning rate mean that the exact placement of the drops and the training data before and after can matter. Why are we treating the data and the state of the model immediately before and immediately after so differently? It would make more sense to have a smoother schedule. What functions decay smoothly like that? An exponential curve does: let's say we just multiply the learning rate by a number that is a little smaller than one every step, so that it drops smoothly like this: But there are lots of other curves like that, and one is particularly interesting: As you change θ from 0 to π , the value of cos θ goes smoothly from 1 to − 1 , so it's easy enough to rescale that so that our learning rate follows the same curve: This is called a "cosine annealing" or "cosine decay" schedule, and was apparently inspired by the algorithms used for simulated annealing (an optimisation algorithm that was in turn inspired by how the atomic structures form in metals as they cool -- another one for the list of things to look into in the future...) That solves the mystery from earlier: the cosine that the Chinchilla paper was talking about was exactly this. As it turns out, the cosine decay scheduling curve is quite popular in deep learning, because it has what amounts to two well-defined phases -- an initial high learning rate where lots of exploration of the loss landscape can happen, followed by a smooth transition to something more like fine-tuning to optimise the location in whatever part of the loss landscape we've wound up in. Now, all of the above are assuming that we want the learning rate to start high and finish low, so that we can mimic the textbook gradient descent that we had at the start of this post. Intuitively that feels nice, but on further thought, the important thing is really that we have a low learning rate at the end of the train, so that we can find as close a point as possible for the minimum at the part of the loss landscape we've found ourselves in. But perhaps there's a case for having both high and low periods during the train, so that we don't get stuck in a local minimum -- something to jolt us out of where we were every now and then? 2 With a step function, that's easy: you could, for example, do this: With an exponential, you could do something like this: With cosine decay, of course, things are even easier, because the cosine function is inherently cyclical, so we can just do this: However, at least for our purposes, training an LLM using a Chinchilla-optimal number of training tokens, it makes sense to be guided by what the authors of the Chinchilla paper did. Appendix B says: We find that setting the cosine cycle length too much longer than the target number of training steps results in sub-optimally trained models, as shown in Figure A1. As a result, we assume that an optimally trained model will have the cosine cycle length correctly calibrated to the maximum number of steps, given the FLOP budget; we follow this rule in our main analysis. So, at this point, I think we have one important part of the intervention we want to make: we want to use a cosine learning rate scheduler, going from high near the start of the training run, down to low at the end over one cycle. Additionally, and also from appendix B in the paper: we use a 10x learning rate decay in line with Rae et al. (2021) ...which means that if our learning rate starts at η , then we want it to decay down to η / 10 by the end. So, we just need to work out an initial value for η , and let it rip, right? Well, not so fast... When our model is uninitialised, right at the start of the train, gradients are going to be pretty wild. It's going to be making random errors all of the time, and we'll be making huge jumps across the loss landscape. That sounds bad. Additionally those kind of wild jumps can get the optimiser into a -- well, sub-optimal -- state. I haven't read enough about optimisers yet to have a solid handle on that, but that can wait -- intuitively it makes some kind of sense that erratic gradient updates might confuse it. So, it makes a certain amount of sense to start off with a low learning rate so that we don't do that, and then to increase it gradually to the peak, and only then to schedule the gradual cosine decay. According to this (rather nice looking) masterclass on LLM training , it's typical to do this over "a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size", and we would just use a linear increase over that period: I think we should do that; a simple linear warmup at the start -- let's relatively arbitrarily say 5% of our training steps going up to our desired peak learning rate. So our learning rate schedule should look something like this: So far I've written a lot about how we vary the learning rate over time, and that's all been very useful. But we still need to know what the value should be initially! In smaller-scale experiments you might just try a bunch of different numbers to see what worked well, but at more than US$30 per train, that's not practical here. Unfortunately it's really quite hard to find good suggestions published anywhere. The GPT-2 paper is (as usual) reticent: The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText ...and if you search for "learning rate training llm", you'll see lots of results for when people are fine-tuning existing LLMs ( 2 × 10 − 4 comes up a lot), but almost nothing about when you're training one from scratch. I eventually came across this (long!) post from Hugging Face , which I definitely need to spend time going through in the future, because it covers a lot of the ground I've been going over in this post series. But for this post, I think the most relevant part is in the section " Scaling Laws for Hyperparameters ", where they include a figure from this DeepSeek paper . Here it is, with some of the (also relevant) surrounding text: In our trains we're using something like 5 × 10 18 total FLOPs. Now, they are specifically charting things in terms of non-embedding FLOPs, but I'm going to play a little fast and loose here and ignore that, so reading off their chart, that looks like we should be using about 1.4 × 10 − 3 as our learning rate. We can double-check that against their formula, where C is the compute budget: Nice, a close match! However, it's definitely worth noting that we're using a simple GPT-2 architecture, and they are using something quite different -- RMSNorm instead of LayerNorm, SwiGLU as the activation function on the feed-forward networks, Rotary Position Embedding rather than the fixed ones we're using, and so on. As a sanity check: you can see that they also give a formula for the optimal batch size in terms of tokens. For our FLOP budget, that comes in at 381,782, which is about 373 of our 1,024-token sequences. That is quite a lot higher than the 97-or-so sequences that we appeared to be optimal in our earlier experiments . That is a little concerning, though of course the 97 number came out of a very ad-hoc bit of curve-fitting. For now, I'm going to hope that that doesn't matter too much for the learning rate. This may come back to bite me; if the results of a train with 1.4 × 10 − 3 are radically worse than the existing rate of 4 × 10 − 4 , I'll have to do a bit more investigation. So, now I think we have all of the theoretical pieces in place to do a train. Let's move on to the practicalities. We started by looking at this: What should we change -- disregarding the until the next post? Based on the above, we want to do a linear warmup of about 5% of our steps, going up to a learning rate of 1.4 × 10 − 3 , followed by a cosine decay down to one tenth of that, 1.4 × 10 − 4 . What does that look like in code? The relevant API for scheduling the learning rate in PyTorch is, logically enough, in the module, and there are a bunch of different scheduling classes. You create your optimiser, then create a scheduler for the shape you want, and then you can call on the scheduler (after the on the optimiser) to adjust the optimiser's learning rate over time. Let's make that more concrete; one of the schedulers is , which is what we'll need for our linear warmup period. It takes as its parameters: Let's say that we want to go from almost-zero to our optimiser's learning rate over 1,600 steps -- we'd create our scheduler like this: ...then in our training loop, after we've done the scaled step of the optimiser, we'd also step the scheduler: This confused me a little bit the first time I saw it; after all, if the scheduler hasn't been "triggered" when we step the optimiser, how does the optimiser know what learning rate to use? Surely it would just use whatever it was initialised with? The answer is that when you create the optimiser, it stores away the learning rate that you give it in two places -- an "initial learning rate" and a "current learning rate". Next, when you create your scheduler, it uses the initial learning rate to work out the start and end values, and then sets the current one to the start value immediately. Just by creating a scheduler, you're changing the optimiser's current learning rate -- but not the initial one, which is important, as we'll see in a moment. So, we have a scheduler that handles our warmup period nicely. Another scheduler that's relevant to our interests is the CosineAnnealingLR . This takes: On creation, this scheduler will read in the optimiser's initial learning rate -- note, not the current one -- and then the first time it's stepped, it will set the current learning rate to that value, and then for steps after that it will reduce it so that it follows a nice cosine decay, reaching after steps. So those two cover the two regimes that we want -- the warmup and then the cosine decay. But now we need to put them together; we want to do one and then the other. There's a very useful class, , which allows you to chain schedulers and tell it when each one takes over from the previous one. Let's sketch out some code to use that to do a train with our new peak learning rate of 1.4 × 10 − 3 , a warmup of 1,600 steps, followed by a cosine decay for the next 32,000 steps to one tenth of the peak learning rate: That actually works quite nicely! I wrote a dummy training loop to plot the current learning rate over a fake train using code like the above , and got this: ...with the output confirming that the values were good at the "milestone" point, the start and the end: I was initially a bit surprised by that, as at the time I ran it, I didn't realise that there was that split between the initial and the current learning rates on the optimiser, so I thought that the cosine scheduler would pick up whatever tiny starting value the warmup scheduler had overwritten the optimiser's learning rate with -- but that split saves the day. That means that now we have the outline of how to schedule our learning rate. But before we can put that into the code, we need to think about how it affects our checkpoints. Just like the scheduler and the optimiser, the learning rate scheduler -- or, indeed, our two schedulers here -- contain information about the state of the train. That means that if we recover from a checkpoint, we need to provide them with the information they need. If we just created them afresh, they'd start from the beginning -- for example, if we restarted from step 20,000 in a train like the one above, we'd start a new warmup from pretty much zero, and then start a fresh cosine decay. That would be bad: (Dummy test code here .) Now, we could use the parameter to initialize them with the correct current global step. But they have a state dict, like most other PyTorch objects, so the simplest thing to do is just to write that to another checkpoint file: ...and then load it likewise: (Dummy test code here .) Conveniently, if you save the state dict of a , it will also include the state of all of its component schedulers, and likewise if you reload it, it will load the components' states back in too. The one thing you have to be careful about is what they warn about in the PyTorch docs: Initializing a scheduler overwrites its optimizer’s s. When restoring a checkpoint, initialize the scheduler before calling your optimizer's to avoid overwriting the loaded learning rates. Luckily enough, in our code as it stands, we create all of the things that are checkpointed -- the optimiser and the scaler so far, but shortly the scheduler as well -- before we load in the state dicts, so that drops out quite nicely. So, we have some sketched-out code -- it's time to put it in place for the real training run. I won't go through the details of the changes to my existing DDP training code, though you can see the diff here if you're interested. Much of the complexity was due to keeping backward compatibility so that we don't have to always use a learning rate scheduler; remember that in this mini-series, I'm trying making various changes ("interventions") to the training loop in isolation, seeing whether each one improves things. So it's important to be able to easily train with or without learning rate scheduling; I did that with a flag in the Implementation-wise, initially I was thinking that it would be easiest to always have a scheduler, and in the "non-scheduled" case to just set it to a linear one that didn't change the value over the course of the train. But in the end it turned out to be easier to use as being the switch to tell the training loop which "mode" it was in. The placement of the code to create the schedulers was also a little tricky; the "natural" place was just after the optimiser is created, like it is in the example code above. However, at that point, we don't know how many global steps we're going to have in the train, because we don't have the dataset -- which means that working out the numbers to pass in to the schedulers for the warmup and decay steps would be impossible. It turned out to be easiest to put it in the function , just after the datasets are loaded, as at that point we have all of the information we need. Anyway, that's the code done, so let's see what happens! I wanted to do two trains; one with the learning rate scheduling, and one with just the new value for the learning rate, instead of . I was expecting the updated learning rate alone to be too high and to cause a very choppy train, but had high hopes for the train with the scheduling. Here's how it did; the scheduled learning rate train first: Here's what the training loss looked like over that: Quite a few loss spikes early on in the train when the learning rate is at its peak, but nothing unmanageable -- and, as you'd expect, things calmed down quite a lot later on. I also charted the learning rate, to make sure it really was doing what I thought it was doing: So, a pretty smooth train, and we definitely did the right learning rate scheduling. Time to upload it to Hugging Face , and see what the evals look like. Firstly, the smoke test: Reasonably coherent, at least, though it's not super-impressive. On to the loss on our test set: That's our best loss so far! Let's put it into the table: So, it definitely looked like it was worth it. But was it the scheduling of the learning rate that helped, or just the change from 0.0004 to 0.0014? I kicked off a second run with no scheduling, just a learning rate of 0.0014, to see what would happen. After about an hour, I noticed that the loss chart had stopped updating. The last point had a maximum and minimum loss but no average -- but after that, nothing: However, the learning rate was still being charted, so the train was definitely running: Looking at the checkpoint metadata showed what had happened. At global step 1851, we had this 3 : ...and at the next checkpoint at step 2468, we had this: ...and the same for all checkpoints thereafter. Clearly the parameters had gone off the rails -- exactly what we'd expect with an excessive learning rate: There was no point in continuing the train, as it was pretty much certainly unrecoverable, so I stopped it. Out of interest, I downloaded the model, but I couldn't even run the smoke test on it: So it was pretty clear that just updating the learning rate to 0.0014 was actively harmful. No need to upload that one to HF! And time to wrap up this experiment. While this has been quite a long post, I've really only scratched the surface of how learning rates are set. If I were doing things in more detail, the best would probably be to do a "sweep" over multiple values to try to at least approximate the best possible rate for this model. That would be pretty expensive for me, though, so I decided to stick with the DeepSeek number. It might not be ideal for the specific architecture that I'm using, given how different that is to theirs, but given the results, it's a decent one compared to what I was using. 4 Something that I found interesting is that exactly how to schedule your learning rate is still an area being actively researched. Even in my relatively minimal research, I came across three alternatives to the mainstream warmup-cosine decay pattern: I'm sure there are many more. But for this train, I decided to stick to the mainstream, and the results were pretty good! To reiterate, this has been the most positive intervention so far: So I'll stick with that, and move on to the next thing: what is the parameter that we're passing in to the AdamW optimiser? Tune in next time :-) Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. , which is the optimiser we're applying it to. , which the optimiser's learning rate is multiplied by to work out where we want to start up. , which is likewise applied to the optimiser's learning rate to work out the value we're heading for. , which is the number of steps over which it should go from the initial learning rate to the final one. , which lets the scheduler know how many steps into its schedule it currently is -- this defaults to , meaning it hasn't started yet. This can be useful if you're resuming from a checkpoint, but for our purposes we can ignore it. , which is the same as the 's. , which is the number of steps before it reaches its minimum , the minimum learning rate we want to get to. , again the same as the 's. Per the Hugging Face paper, some people do warmup, then pause at a set level for a while, then start the cosine decay (warmup-stable-decay). DeepSeek use a relatively simple stepped function after a warmup. 5 I came across a 2025 paper " Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs " which says that a linear decay (after a warmup) outperforms cosine. Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: ↩ You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32c -- Interventions: removing dropout

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something! This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better? In a blog post last summer about architectural advances in LLMs since GPT-2 , Sebastian Raschka wrote: Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. That makes quite a lot of sense. My own understanding of dropout was that it was a bit broader than just preventing overfitting -- it seemed to me to be similar to the mandatory vacation policies that financial firms user to prevent over-dependence on individuals . My instinct was that having knowledge distributed across different weights in the model was good in and of itself, even beyond its benefit on multiple-epoch training. But it is quite a high price to pay. With the training parameters we've been using we're literally discarding 10% of our calculations' results -- attention weights, feed-forward neuron activations, and so on -- as we do the forward pass. It's easy to see why it would harm training. Let's give it a go. The nice thing about this one is that, unlike the gradient clipping experiment, I didn't have to write any new code. The dropout level was already controlled by a setting in the file , so by setting that to zero for this run, I could just kick it off and let it do its thing while I worked on something else: Here's what the training run chart looked like (please disregard the stuff about grad norms in the title and the axis -- I'll remove that for the next train): As you can see, we still have loss spikes, including one just after global step 20,000 that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping might have helped with that, but I'm very deliberately testing each intervention in isolation. At the end of the training run, we got this: So, interestingly, it took 967 seconds -- about 16 minutes -- less time than the gradient clipping run, and about 15 minutes less than the baseline train. So while gradient clipping added on a small amount of time (or maybe that was just noise), dropping dropout certainly seems to speed things up! I guess there's quite a lot of work involved in generating and applying the random masks that drop things out as we're doing the forward pass. Anyway, with the model trained, it was time to download it, upload it to Hugging Face Hub , and run the evals. Firstly, the smoke test, where it just needs to continue the sequence , it came up with something reasonably coherent: ...but it was on the test of the loss on the training set that it was most impressive: That's a bigger improvement on the baseline train's 3.692 than gradient clipping: 0.051, which is more than three times the improvement! Let's start keeping a table of these: Now, of course, we don't know how these different interventions combine together -- it would be naive to think that if we did both gradient clipping and dropout removal, we'd get a total loss reduction of 0.014 + 0.051 -- but, especially with that long-lived loss spike in our training run -- it does feel like they might play well together. So, that's dropout covered. Which one next? I think a nice easy one that I should be able to get done on a Friday will be adding bias to the attention weight calculations. Let's give that a go and see if it makes things worse or better! Stay tuned...

3 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

I'm rounding out my series of posts on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090 , and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better. For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud . That led to some refinements in the prompt-following test I was using , and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub . Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to. I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list: I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'. The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU. But first, we need a baseline model. I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one? In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on. In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss). Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it. I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point. Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility. So: what was the baseline train going to look like? The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed , which I set to 42. A couple of bugfixes, and we were left with this version of the code . One thing to highlight: in the file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that: Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together. If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud. Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1 . On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12 × 8 = 96 , which means we'd be able to do gradient accumulation and do a parameter update every 96 ÷ 6 = 16 steps. Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help! So, having coded that up and set up the configuration, it was time to run it. Here's the training chart it came up with: Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later. The training run reported this at the end: So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost. Here is the model on Hugging Face Hub . Let's see how it looks. For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other . But the smoke test -- how does it complete the sequence is worthwhile: Looks good! Reasonably coherent. Now we can find the loss on our held-back test set: That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216). So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart. Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training. If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case! So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35. All in all, it was good to see them there, as it sets us up well for that experiment. So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests! Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping? Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow. Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩ The amount of training data. I'm not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we'd need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring. The number of epochs. I'm going to stick to single-epoch training -- that is, I'll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine. The bias on the W q , W k and W v matrices. This one definitely sounds worth looking into -- easy, as it's just a change to a config flag, and makes the model more like the original GPT-2. I'll give that a go. Dropout. I've read that for single-epoch training, dropout doesn't help (which doesn't quite work with my mental model of what it's for, but does sound plausible). Worth a look! The learning rate, and weight decay. The values I've used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit. The precision. I'm using AMP , which means that some calculations are done in 16-bit rather than 32-bit, and calling with to let PyTorch choose to use the GPU's tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you'd get if you trained with full-fat . Would reverting that be worth the longer train time? I should probably at least poke at that. The batch size. I've already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation. Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training -- maybe 10-15% of training time -- to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating! Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩

0 views
Shayon Mukherjee 3 months ago

Software engineering when machine writes the code

In 1968, a group of computer scientists gathered at a NATO conference in Garmisch, Germany, and coined the term “software crisis.” The problem they identified wasn’t that computers were bad or unreliable. It was that computers had become too powerful for the existing methods of programming to handle. Edsger Dijkstra later put it memorably: “As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

0 views
Kaushik Gopal 4 months ago

AI model choices 2026-01

Which AI model do I use? This is a common question I get asked, but models evolve so rapidly that I never felt like I could give an answer that would stay relevant for more than a month or two. This year, I finally feel like I have a stable set of model choices that consistently give me good results. I’m jotting it down here to share more broadly and to trace how my own choices evolve over time. GPT 5.2 (High) for planning and writing, including plans Opus 4.5 for anything coding, task automation, and tool calling Gemini ’s range of models for everything else: Gemini 3 (Thinking) for learning and understanding concepts (underrated) Gemini 3 (Flash) for quick fire questions Nano Banana (obv) for image generation NVIDIA’s Parakeet for voice transcription

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Alex Jacobs 4 months ago

Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification

“Just use an LLM.” That was my advice to a colleague recently when they asked about a classification problem. Who fine-tunes BERT anymore? Haven’t decoder models eaten the entire NLP landscape? The look I got back was… skeptical. And it stuck with me. I’ve been deep in LLM-land for a few years now. When your daily driver can architect systems, write production code, and reason through problems better than most junior devs, you start reaching for it reflexively. Maybe my traditional ML instincts had atrophied. So I decided to actually test my assumptions instead of just vibing on them. I ran 32 experiments pitting small instruction-tuned LLMs against good old BERT and DeBERTa. I figured I’d just be confirming what I already believed, that these new decoder models would obviously crush the ancient encoders. I was wrong. The results across Gemma 2B, Qwen 0.5B/1.5B, BERT-base, and DeBERTa-v3 were… not what I expected. If you’re trying to decide between these approaches for classification, you might want to actually measure things instead of assuming the newer model is better. All the code is on GitHub if you want to run your own experiments. BERT Family (Fine-tuned) For the LLMs, I tried two approaches: Four classification benchmarks ranging from easy sentiment to adversarial NLI: For anyone who wants to reproduce this or understand what “fine-tuned” and “zero-shot” actually mean here: BERT/DeBERTa Fine-tuning: LLM Zero-shot: LLM Few-shot (k=5): All experiments used a fixed random seed (99) for reproducibility. Evaluation metrics are accuracy on the validation split. Hardware: RunPod instance with RTX A4500 (20GB VRAM), 20GB RAM, 5 vCPU. I’d forgotten how pretty text-only land can be. When you spend most of your time in IDEs and notebooks, SSH-ing into a headless GPU box and watching nvitop do its thing feels almost meditative. Let’s dive into what actually happened: DeBERTa-v3 wins most tasks—but not all DeBERTa hit 94.8% on SST-2, 80.9% on RTE, and 82.6% on BoolQ. For standard classification with decent training data, the fine-tuned encoders still dominate. On ANLI—the hardest benchmark, specifically designed to fool models—Gemma few-shot actually beats DeBERTa (47.8% vs 47.4%). It’s a narrow win, but it’s a win on the task that matters most for robustness. Zero-shot LLMs actually beat BERT-base The LLMs aren’t losing to BERT—they’re losing to DeBERTa. Qwen2.5-1.5B zero-shot hit 93.8% on SST-2, beating BERT-base’s 91.5%. Same story on RTE (78.7% vs 61.0%) and BoolQ (Gemma’s 80.9% vs BERT’s 71.5%). For models running purely on prompts with zero training? I’m calling it a win. Few-shot is a mixed bag Adding examples to the prompt doesn’t always help. On RTE, Qwen2.5-1.5B went from 78.7% zero-shot down to 53.4% with few-shot. On SST-2, it dropped from 93.8% to 89.0%. But on ANLI, few-shot helped significantly—Gemma jumped from 36.1% to 47.8%, enough to beat DeBERTa. Few-shot helps on harder tasks where examples demonstrate the thought process, but can confuse models on simpler pattern matching tasks where they already “get it.” Sometimes examples add noise instead of signal. Okay, so the accuracy gap isn’t huge. Maybe I could still justify using an LLM? Then I looked at throughput: BERT is ~20x faster. BERT processes 277 samples per second. Gemma-2-2B manages 12. If you’re classifying a million documents, that’s one hour vs a full day. Encoders process the whole sequence in one forward pass. Decoders generate tokens autoregressively, even just to output “positive” or “negative”. Note on LLM latency: These numbers use for tokenization. When I bumped it to , latency jumped 8x—from 57ms to 445ms per sample for Qwen-0.5B. Context window scales roughly linearly with inference time. For short classification tasks, keep it short or make it dynamic. These models struggled on nuanced reviews. Can you do better? Try classifying some of the trickiest examples from my experiments: Classify these tricky movie reviews Despite the efficiency gap, there are cases where small LLMs are the right choice: Zero Training Data If you have no labeled data, LLMs win by default. Zero-shot Qwen2.5-1.5B at 93.8% on SST-2 is production-ready without a single training example. You can’t fine-tune BERT with zero examples. Rapidly Changing Categories If your categories change frequently (new product types, emerging topics), re-prompting an LLM takes seconds. Re-training BERT requires new labeled data, training time, validation, deployment. The iteration cycle matters. Explanations with Predictions LLMs can provide reasoning: “This review is negative because the customer mentions ‘defective product’ and ‘waste of money.’” BERT gives you a probability. Sometimes you need the story, not just the number. If you’re processing 100 support tickets a day, throughput doesn’t matter. The 20x speed difference is irrelevant when you’re not hitting any resource constraints. High-Volume Production Systems If you’re classifying millions of items daily, BERT’s 20x throughput advantage matters. That’s a job finishing in an hour vs. running all day. Well-Defined, Stable Tasks Sentiment analysis. Spam detection. Topic classification. If your task definition hasn’t changed since 2019, fine-tuned BERT is proven and stable. No need to fix what isn’t broken. You Have Training Data With a few thousand labeled examples, fine-tuned DeBERTa will beat small LLMs. It’s a dedicated specialist vs. a generalist. Specialization still works. Latency Matters Real-time classification in a user-facing app where every millisecond counts? BERT’s parallel processing wins. LLMs can’t compete on speed. Before you @ me on Twitter—yes, I know this isn’t the final word. Some caveats: I only tested small LLMs. Kept everything under 2B parameters to fit comfortably on a 20GB GPU. Bigger models like Llama-3-8B or Qwen-7B would probably do better, but then the efficiency comparison becomes even more lopsided. You’re not beating BERT’s throughput with a 7B model. Generic prompts. I used straightforward prompts without heavy optimization. Task-specific prompt engineering could boost LLM performance. DSPy-style optimization would probably help too—but that’s another blog post. Four benchmarks isn’t everything. There are plenty of classification scenarios I didn’t test. Your domain might be different. Measure, don’t assume. So, can small LLMs beat BERT at classification? Sometimes, and on the hardest task, they actually do. Gemma few-shot edges out DeBERTa on adversarial NLI, the benchmark specifically designed to break models. DeBERTa-v3 still wins 3 out of 4 tasks when you have training data. And BERT’s efficiency advantage is real—~20x faster throughput matters when you’re processing millions of documents and paying for compute. Zero-shot LLMs aren’t just a parlor trick either. Qwen2.5-1.5B hits 93.8% on sentiment with zero training examples—that’s production-ready without a single label. For cold-start problems, rapidly changing domains, or when you need explanations alongside predictions, they genuinely work. Hopefully this gives some actual data points for making that call instead of just following the hype cycle. All the code is on GitHub . Go run your own experiments. Surely I’ve made some embarrassing mistakes here. Don’t just tell me—tell everyone! Share this post on your favorite social media with your corrections :) BERT-base-uncased (110M parameters) DeBERTa-v3-base (184M parameters) Qwen2-0.5B-Instruct Qwen2.5-1.5B-Instruct Gemma-2-2B-it Zero-shot - Just prompt engineering, no training Few-shot (k=5) - Include 5 examples in the prompt Standard HuggingFace Trainer with AdamW optimizer Learning rate: 2e-5, batch size: 32, epochs: 3 Max sequence length: 128 tokens Evaluation on validation split (GLUE test sets don’t have public labels) Greedy decoding (temperature=0.0) for deterministic outputs Task-specific prompts asking for single-word classification labels No examples in context—just instructions and the input text Same as zero-shot, but with 5 labeled examples prepended to each prompt Examples randomly sampled from training set (stratified by class)

0 views