Latest Posts (20 found)
Giles's blog 6 days ago

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of them would help? I found a list of solid-looking interventions, and in my last post I came to the conclusion that the improvements in loss I had seen with all of them -- with two possible exceptions -- seemed unlikely to be in the noise. What would happen if I tried to put them into a new model? Let's start by looking at the results that we have for the interventions so far -- this is the table I've been using as I go through them, but I've updated it to contain the loss figures for each model to six decimal places instead of three, and made each model name link to the associated post. I've also corrected the loss for the model, which was mistakenly using the training loss at the end of the run rather than the loss on the test set 1 . As I've mentioned before, simply moving to training in the cloud improved things markedly, getting loss down from 3.944 to 3.691526; I suspect this was due to having a closer-to-optimal batch size (more about that in my next post). What to do about the other interventions, though? It seemed clear that two of them were not helping: weight tying, and the one using the figure for weight decay that I'd (I suspect incorrectly) derived from a paper by Cerebras Research. The "no-AMP" run (which would be better described as "full-fat float32") had a small positive effect, but was so costly in terms of both time and money that it wasn't worthwhile. So we had five interventions to try: How would they stack up? It seemed pretty unlikely that their independent contributions would just sum up neatly so that we got a total improvement of 0.013209 + 0.022141 + 0.048586 + 0.050244 + 0.089609 = 0.223789 (though that would certainly be nice!). One question to consider was how independent they were. For any set of interventions, you can imagine them being independent and adding up nicely, or pulling in separate directions so that the combined effect is worse than the sum, or pulling in the same direction so that they amplify each other. My intuition was that gradient clipping and removing dropout were pretty independent, at least conceptually. They might affect other interventions indirectly (eg. via changing the training run's use of the random number generator) but they'd be unlikely to have a direct effect. QKV bias I was less sure about, but it seemed -- again, just intuitively -- at least reasonably independent of the others, with one important exception (which I'll get into below). By contrast, weight decay and the learning rate interact together quite strongly, at least in standard gradient descent, and I'd tested them in isolation. The result for changing the weight decay to 0.01 was based on a fixed learning rate of 0.0004, and the result for scheduling the learning rate was based on a weight decay of 0.1. That felt like an issue, and definitely needed some thought. Additionally, there were some issues with which interventions might have not had a real effect, and instead just been the results of the use of randomness. While my analysis of how that might have affected things was somewhat limited by the number of test runs I could afford to do, it did show up two plausible issues: After some thought, I came up with a plan. If I were doing this properly and scientifically, I suppose I'd try every combination of interventions, but that would be ruinously expensive 2 , so a sensible minimal set of training runs felt like this: When those completed, I'd find the test set loss for both models. I'd choose the best run, and then do another run with those settings, but with weight decay switched back to the original value of 0.1. I chose to revert weight decay rather than the learning rate stuff because this was the one I was least sure about -- the updated "GPT-2" value of 0.01 is very unusual by today's standards, and I'd come to it via a rather circuitous route -- see the post for more details. The best of the three runs would be the winning combination of interventions. Again, this was not an exhaustive plan 3 . But it seemed to make sense. Let's see how it turned out. Just to recap, this one had these interventions against the baseline: It did not have QKV bias. You can see the config here . Here's the loss chart over the course of the training run: As normal with learning rate scheduling, I also charted that to make sure it was doing the right thing (you can see that it was): And I also tracked the gradient norms -- you can see that there was some clipping happening near the start of the run: At the end of the run, it reported this: That's a slightly lower final train loss than normal, and it took 3h10m, which is faster than usual, but about the same as the other train we did without dropout -- that makes sense, as the process of zeroing out random activations isn't free. I downloaded the model -- here it is -- and then ran the smoke test: ...and got its loss on the test set: Not bad at all -- the best result we've had so far, albeit not quite up to the standard of the original GPT-2 weights. Now the next one, with QKV bias. This one had these interventions: You can see the config here . Here's the loss chart: ...the learning rate: ...the gradient norms (note that we had more clipping, about halfway through): ...and the final printout at the end. That final train loss is slightly higher, which is normally an indicator that the test loss will be higher, but we'll have to see. Time to download the model -- here it is -- and on to the smoke test: ...and then the moment of truth -- what was its loss on the test set? As I suspected from the training loss at the end, slightly worse than the run without QKV bias. So, that meant that we should do the next run, with a weight decay of 0.1, with no QKV bias. Given the above results, this one had these interventions vs the baseline: Weight decay was back to the baseline value of 0.1, rather than the value of 0.01 used in the previous two runs, and QKV bias was switched back off. You can see the config here . Here's the loss chart: You can see that it's much choppier than the previous two runs; that initially surprised me, as the higher weight decay means that we're regularising the model more than we were with those, which I thought would "calm things down". But on reflection, I had it backward. Hand-waving a bit, a more regularised model is fitting less closely every detail to the data it has seen, considering the typical stuff more than it does the outliers. That means that when something a bit more out-of-distribution appears, it might not have yet learned how to integrate it into its model of the world. Well, it sounds plausible, anyway :-) On to the learning rate (just to double-check), and it's fine: And again, the gradient norms: ...which similarly to the loss chart show more occasions where gradients spiked and had to be clipped -- even towards the end of the training run this time. The final printout at the end: Once again, although the final train loss is not definitive, it tends to be indicative of the test loss. It's in between the last two runs, so we'd expect the test loss to be likewise in between theirs: Time to download the model -- here it is -- and on to the smoke test: Hmm. At least vaguely coherent, though I'm not 100% convinced. It looks like ads for personal injury lawyers have crept into FineWeb somehow... Still, it's time for the test loss (drumroll): As predicted from the train loss, it's in between the two runs above. Let's put these three runs into the results table: As a reminder: You can see that adding on QKV bias actually made the model worse than the learning-rate-only intervention. That pushes me slightly away from the "it's all about the initial weights" direction; perhaps instead the bias adds some kind of stability that the learning rate scheduling also provides, and they fight against each other? Unfortunately I think the only way to pick it apart would be to do a full set of runs, switching each intervention on and off independently, and that would be too costly. The fact that the weight decay change from 0.1 to 0.01 actually did help when combined with the learning rate change and scheduling was a bit of a surprise; because they're both coupled when we think about standard gradient descent, I was expecting them to be too intertwined for my tests of them in isolation to have been valid. Quite pleased that it didn't work out that way, though, because sweeping across values for different parameters is much easier than it would be if they were connected. However, at this point it occurs to me that it might be because we're using the AdamW optimiser. As I understand it, its big difference versus Adam is that it decouples weight decay. I don't have a solid mental model of what that means exactly (will read up and post about it eventually), but it certainly seems pertinent here. Anyway, I have to say, I'm both pleased with and disappointed by these results. Pleased because we got a result by putting interventions together that was better than any of them in isolation, but disappointed that the end result wasn't even better. The difference between 's loss, at 3.691526, and original GPT-2 small's, at 3.5, was 0.191526. Our best result, for , was 3.577761, so an improvement of 0.113765. That's about 60% of the way there. That said, by sheer chance, while trying out the different sizes of cloud machines, I'd got from a loss of 3.944 training locally to the baseline's value of 3.691526 -- I suspect due to the fact that training in the cloud meant that I could use batch sizes of 96. So a different way of looking at it is that we should include that in the calculations too. From 3.944 to 3.5, the gap with GPT-2 small was 0.444. And we went from 3.944 to 3.577761, an improvement of 0.366239. And that means that we managed to get 82% of the improvement we needed. On the other hand, it means that in terms of my improvements, 0.252474 came from a happy accident, while all of my careful work on interventions only got me 0.113765. :-( Anyway, I think that for now, I'll have to rest happy with that as a result -- and next time around, let's see if we can get to the same level of improvement locally, using gradient accumulation. Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩ Gradient clipping. QKV bias (that is, adding bias to the attention weight matrices). Changing weight decay to the GPT-2 value (0.01 rather than the 0.1 that is typical nowadays). Removing dropout Updating the learning rate from 0.0004 to 0.0014, but also scheduling it so that it varies over the course of the training run. Adding gradient clipping looked like it might have been within the training run noise. Adding QKV bias would have had a large effect on the model's initial weights. All of the others would have started with essentially the same weights (apart from weight tying, though even that would have had the same values for the initial weights apart from the tied ones). But adding the bias would have completely changed them, and its effect size was comfortably within the range of differences you might expect from that. Start a training run with all of the interventions apart from QKV bias. In parallel (Lambda instance availability permitting) run another one, with all of the interventions including QKV bias. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. QKV bias switched on. Gradient clipping at 3.5 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, but no QKV bias was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, with QKV bias was gradient clipping at 3.5, dropout removed, and the learning rate intervention, but no QKV bias, and no change to weight decay . Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 32h -- Interventions: full fat float32

This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Back when I did my first training run for a base model, on my local RTX 3090 , I used two optimisations: The first of those boosted training speed from 12,599 tokens per second to 15,402 in my test harness, while AMP on its own boosted it to 19,921 tps (and also allowed me to increase the batch size from 5 to 6). Doing both appeared to hit some kind of diminishing returns -- it maxed out at 19,997 tps, only a little better than AMP on its own. But intuitively, you'd expect that might come at a cost. While I'm sure the PyTorch developers have solid understanding of where switching to 16-bit will have a minimal impact on training quality, it seems too good to be true that it would have no impact at all. Let's see what happens if we switch both of these optimisations off! I added a new flag to the config file for the training harness, with a default of 1 . The core implementation was pretty simple; where we had the call to , we needed to guard it: ...and where we did the forward pass and the loss calculation, we had to not wrap it in a : We also had to avoid unscaling when clipping gradients ; I did that by just not creating a scaler when in non-AMP mode, and then: ...and likewise, instead of using the scaler to step the optimiser, we step it directly if we don't have one: However, there was an issue: non-finite gradients. As I discovered when looking into gradient clipping , the scaler was actually doing something quite useful for us. Somewhat buried in the AMP recipes page is a comment: Now, from the gradient clipping train, I'd come to the conclusion that we were occasionally getting non-finite gradients, and the scaler was saving us from applying junk updates when that happened. If our new code was stepping the optimiser directly, we'd not have that safety net. We'd need something to save us from that. My first cut at this was to use the one other API feature I'd seen that handled non-finite gradients for you: has a parameter, so if we were using gradient clipping, we could set that to and use the exception to skip stepping the optimiser if it was raised. To avoid actually doing any gradient clipping when that happened, if we did not have gradient clipping explicitly enabled, we could set the to infinity. Here's the code for that version . I wasn't very happy with it, though. The use of a gradient clipping API just for its side-effect of telling us about non-finite gradients felt a bit ugly, and even worse, the exception it raised was just a generic , not a custom exception type, which meant that I had to distinguish between it and other by looking at the exception message -- not terribly safe, as that's something that could easily change in the future. So I switched to a more explicit, simpler version: scan through the parameters looking for non-finite gradients, and skip the optimiser step if any are found: I did have some concerns about the performance impact of that; on my local machine it took about 0.13 seconds to scan all of the parameters like that for one step. However, it's better than failing to train the model at all due to garbage updates! So with that, it was time to do the training run. It was pretty clear that I would not be able to run this with my normal microbatch size of 12 on the 8x A100 40 GiB machines that I'd been using so far for these intervention tests -- AMP and the lower-precision matrix multiplications save a bit of VRAM, and I was already pretty much at the limit of what would fit in there. Changing the batch size would make this a poor test of the effects of removing the FP precision stuff in isolation, so I decided that the safest minimal change was to use a machine with more VRAM -- specifically an 8x A100 80 GiB, as that was the closest to what I was using (switching to eg. H100s would add all kinds of confounding changes). The next problem was getting any kind of machine at all! Lambda (they appear to have rebranded away from "Lambda Labs") very rarely seemed to have any available instances, never mind the specific type that I wanted. Eventually, I put together a system to poll their API and launch an instance when one was available. At 3:25am today 2 , I got a Telegram message from the script saying that it had managed to find and start one. I kicked off the training run, and watched as it got started. I could see it was using 43.8 GiB/GPU, so it definitely did need the larger instance type. And it quickly became clear that this was going to be a long one -- it was estimating 8 hours to do the complete run! In a way that was good news, though, as I could just set an alarm and go to bed. When I woke up, it was done: That's 8h7m. For comparison, the baseline train took 3h24m, so we're taking more than double the time. Cost-wise, things were even worse -- more than US$135 in server costs, because as well as needing the server for much longer, being a larger machine it cost US$16.48/hour rather than $11.84. So that's more than three times as expensive as the US$42 that a typical recent train has cost me (Lambda raised their prices, so it went up from about US$35 in February). Still, at least it looked like a solid run: Very similar to the others we've seen in this series. Time to upload it to Hugging Face Hub , and on to the evals to see if all of this extra cost was worthwhile. Firstly, the smoke test -- how did it complete ? Not bad at all! But the important metric is the loss on the test set, and for that I got 3.679. Let's add it to the table to see how that compares to the other training runs: So, a tiny improvement over our baseline. Taking more than twice as long on the training run, and spending three times as much, gained us a loss improvement that's smaller than any other successful intervention. The first question is, did removing AMP and lower-precision matrix multiplications lead to a better model? The answer appears to be "yes" -- but it's a tiny enough difference that it could well be in the noise. But the follow-up has to be, was it worth the extra cost in time and money? And for that I'm certain that the answer is "no". If we'd spent twice the time training with AMP -- on an extra 3B-odd tokens, or on a second epoch with the same 3B -- it seems implausible that the resulting loss would not have been better. And anyway, given that my goal with these interventions is to train the best model I can in two days locally (or 3h30m or so on an 8x A100 40 GiB), it's pretty clear that if we'd cut this run off about halfway through it would have been worse -- and that's not even accounting for it being more memory-hungry. So, I think the takeaway from this is that AMP appears to be a huge win, at least for this model. It has a tiny cost (if any) in model quality, and a huge benefit in training speed, plus a smallish but still useful benefit in training VRAM requirements. 3 And with that, I've reached the end of the interventions that I wanted to try ! Next, I'll need to think through what we need to do to try to stack them up. In particular, is there any easy way to work out whether any of the improvements I've seen might be due to random noise? After all, even though I've been carefully using explicit seeds, each intervention will have changed the way the training run uses the random number stream, and that could easily have an effect. Stay tuned! The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩ Setting the 32-bit floating point matrix multiplication precision to "high" rather than to "highest" , which means that it uses lower-precision (but still technically 32-bit) TF32 for those operations rather than normal float32. Using PyTorch's Automated Mixed Precision (AMP) , which allows it to use 16-bit calculations rather than 32-bit in places where it makes sense to do so. The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩

0 views
Giles's blog 1 weeks ago

Automating starting Lambda Labs instances

I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series , but they're really busy at the moment, and it's rare to see anything. Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager . It has three commands: Let's see if that helps -- though it's been running for six hours now, with no luck... , which prints which kinds of instances are available. , which prints out all of the possible instance types (available or not) with both their "friendly" names -- what you'd see on the website -- and the instance type names that the API uses. , which polls the API until it sees a specified type of instance, at which point it starts one and sends a Telegram message.

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32g -- Interventions: weight tying

In Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters. After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too? Worth a try! Let's give it a go. I'll start with a quick refresher on what weight tying is, and how it works. This is really targeted at people who've been reading along with this series -- if it's all new to you, you might find my post on Maths for LLMs a useful catch-up guide first. In our LLM code, right at the start, we use an embedding layer to take our input token IDs, and turn them into embeddings -- each token becomes a vector in a high-dimensional space (768 in our case), which we see as representing in some manner the "meaning" of the token. A useful way to think about that is that we could start with a one-hot vector for the token -- that is, with our 50,257-token vocabulary, it would be 50,257 items long, and have zeros in every position apart from the position corresponding to the token's ID. We'll treat that as being a vector in a "vocab space". The process of converting the token into an embedding turns out to be equivalent to multiplying that vocab space representation by an embedding matrix -- one with one row per possible token, the values in that row being the values for the appropriate embedding. 1 Because matrix multiplications can be seen as projections between different spaces, we can see that as a projection from our vocab space to the embedding space. Once we've projected our sequence of tokens into a sequence of embeddings, we do all of the steps required for the LLM -- we add in positional information, run it through the Transformers layers, normalise it, and then we have a new sequence of embeddings. The embedding at position n in that output sequence, if our model is working well, should be something that represents an appropriate next-token prediction for the portion of the input sequence from zero to position n . What we want as our final output is to map that back to the vocab space. We want logits: a list of numbers that (after being run through softmax) will represent the probability that our next token is a particular one. Just as we mapped from vocab space to embedding space with (conceptually) a matrix multiplication at the start of the process, we can map back with another one. More specifically, if we treat the embedding matrix as having the same number of rows as there are input tokens (which we'll call d vocab ) and columns as there are embedding dimensions ( d emb ), then the original vocab-space-to-embedding-space matrix will have this shape: So it's projecting from a d vocab -dimensional space to a d emb -dimensional one. Similarly, our matrix to do the projection at the end is just a matrix with the numbers of rows and columns swapped around: ...to do a projection in the other direction. The trick with weight tying is to see that these two projections can potentially be just the opposite of each other. If we assume that the embedding space on the way in to the LLM is essentially the same as the embedding space on the way out, then we can use one projection to go into it from vocab space, and the opposite to go back. The "opposite" in this case is the transpose -- that is, if we use W emb for our embedding matrix and W out for the output one, we have: That means we can re-use all of the embedding parameters for the output projection matrix, and fewer parameters means not only a smaller model, but hopefully faster training. Sounds like a win! But of course, there's no such thing as a free lunch. By constraining the output head to be the transpose of the input one, we're essentially enforcing that assumption above: we're saying that the embedding space on the way out must be the same as the embedding space on the way in. That limits what the LLM can do -- if it were able to use different embedding spaces at each end, it would have more flexibility, which might help it learn to model things better. That's the theory: what does it mean in practice? Let's take a quick look at the GPT-2 code -- just the for the top level class: For our embedding layer, we use PyTorch's class, and for the output head we use . Now, provides us with access to the underlying matrix with a field: (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . So, that's exactly the d vocab × d emb matrix that we'd expect -- it's the input dimension as the rows, and the output dimension as the columns. If we look at , we see something very similar: weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features That's actually the other way around, output dimension as the rows and input as the columns. If you're wondering why, remember that we transpose the weights matrix for a neural network before using it . But that's actually really convenient in our situation, because if we want to use the same weights for both, they're already "compatible"! And that means that adding weight tying to our code above is as simple as adding two lines at the end: For the model code, it literally is just that! There is a tiny inefficiency in that PyTorch is going to spend a bit of time initialising the weights in to appropriately-sized random values, only to have them all replaced -- but that actually works in our favour, because it means that we'll use up the same amount of the random number stream when creating the LLM in both the weight-tying and non-weight-tying cases, which is a bit better for reproducibility. There is one other change needed, though. I ran a test train with that code, and checkpointing failed like this: Safetensors doesn't like it when you reuse weights like we're doing here. The good news is that the help page the error links to is exactly about this problem with weight tying, and the suggested fix -- to replace ...and similarly for loading -- appears to work fine. Saving and loading checkpoints works, and it's compatible with the old checkpoint files too. So that's good news :-) So, that's how we code it. How much actual saving do we get in terms of the parameter count by doing this? A quick-and-easy way to count the parameters is just to create an instance of the model and see: So, we've gone from a 163M-parameter model to a 124M-parameter one. That's certainly quite some saving -- 38,597,376 fewer parameters, which is a reduction of almost a quarter. We can also sanity check the size of that saving -- our output head was, as we know, a d emb × d vocab matrix, so it should have 50257 × 768 parameters -- which is, indeed, 38,597,376. Excellent. Now, there's one thing we should consider here. We're training on a Chinchilla-optimal number of tokens, 20x our parameter count. Is that what we want to keep stable? Or is the total number of training tokens the important bit, so we wind up technically overtraining? My instinct is that the total training tokens is the important thing. Chinchilla optimality is a training heuristic rather than a true aspect of the model, so sticking with it would mean that we're training a model with fewer parameters on less data. It seems very unlikely that would do anything other than produce a worse model! So: we'll keep the same number of training tokens, and just introduce weight tying. How does it train? I kicked it off on the usual 8x A100 40 GiB machine, and after a little while I checked the loss chart. It looked like this: Yikes! It started off with a loss of about 460. Normally, we start with a loss of about 11. The normal loss makes a lot of sense. If you consider it in terms of perplexity, that value of 11 comes out at e 11 ≈ 59 , 874 -- that is, the model is giving pretty much equal probabilities to every one of the 50,257 possible tokens. A loss of 460 means that the model is making incorrect predictions and is very certain about them. How could that be? Well, let's look at the documentation again. (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features They're initialised completely differently. Embeddings are set to values in a normal distribution (that is, a Gaussian bell curve) with a mean of 0 and a standard deviation of 1. But linear layers are set to random values in a uniform distribution (that is, a completely flat one) within a range based on the number of input features. In particular, those numbers for the linear layer are really small! Our output head has set to 768, so that means that the k would be: So instead of getting that kind of "ideal" linear layer initialisation within the range ( − 0.0360 , 0.0360 ) , we're getting numbers which roughly 2/3 of the time will be in the range ( − 1 , 1 ) , and the rest of the time will be even further from zero -- we could be getting -3 or +4, or potentially even crazier numbers! That means that the output logits (coming from a linear layer with higher weights) will be larger, which in turn will push softmax to come up with higher probabilities: I considered changing things to initialise the weights differently, but given that the loss had fallen to 8 or so by the second checkpoint, I decided to just let the run complete. Here's the final loss chart, with the Y axis fixed to run from 0 to 12: That's a nice smooth curve, at least! The output is: Timing-wise, that's about 180 seconds faster than our baseline model training run, only a 1.5% speedup -- clearly the lower number of parameters doesn't actually save us much time. Loss-wise, the final train loss on the baseline model was 3.743, so that's not particularly promising. Still, the proof is, as ever, in the evals. Smoke test first: Borderline coherent, but maybe worse than normal? Let's see what our test set loss looks like. That's bad -- let's see it in our comparison table: Our worst model so far :-( Weight tying certainly didn't help our train. It is worth noting that the GPT-2 small weights -- which do use it -- got 3.500 on the same test set as we're using for that table, so it is possible to get a better model with weight tying. But there was clearly something different about their train, and my suspicion, as I've said before, is that it was trained for many more epochs ( I estimated 40 ), slowly grinding that loss down. But what I'm trying to do in this mini-series of interventions is find tricks that will allow us to approach the original weights' loss without a very long training run. And for the purposes of that, I think we can safely say that weight-tying is not one of those. Next time around, our last intervention test! What happens if we switch off the use of automated mixed precision (AMP)? That is something I added right back at the start as a performance enhancement; it means that PyTorch can do certain calculations in 16-bit rather than 32-bit if it thinks there's no harm in doing so. Might we get better loss by training without it? In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩ In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32f -- Interventions: weight decay

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: In my last post I looked into the learning rate, the parameter in that code, and found a value for that, plus some extra code to schedule it -- that is, to vary it over time -- which gave better training results. This time I want to go into the weight decay. What is it, what is it for, and is 0.1 really the best value? I was a little concerned going into this that in order to understand this hyperparameter, I'd need to have a good understanding of how the optimiser works; I've been building what I think is a solid mental model of optimisers, but I don't think I understand them well enough to explain them yet, and I've been hoping to delay posting about them to a separate blog post series after this one. The good news is that while weight decay is an important aspect of how optimisers work -- the "W" in AdamW, the thing that makes it different to the older Adam optimiser, is a nod to its different treatment of weight decay -- you don't need to know how the optimiser itself works to understand what weight decay is. Instead, you just need to consider an older and more fundamental aspect of building ML systems -- regularisation. In order to dig into that, let's start with overfitting. Let's imagine a simple classification task: we want to build a model that can -- for any point on this chart -- predict whether a cross or a circle should go there, training it using the sample data points that we already have: Let's say that we train a powerful model on this dataset, and it comes up with this: Now, ab initio we don't know whether that's a good result or not; we need to use our validation set to evaluate it. Let's say that the validation points are these blue ones: We can see that it looks like our powerful model has overfit. The training set is all nicely split by the boundary, but the validation points are not. A common solution to how to handle that kind of issue that you might see in introductory ML courses is to try using a less powerful model. A less powerful model in this case might come up with a less "wiggly" line to separate the two categories, perhaps because it didn't have enough parameters to make it wiggle so much, so you might find that it came up with a classifier that looked more like this: So: we use our validation set to detect overfitting, and we can adjust the complexity of our model to try to avoid it. Now, this is all very well, but it does require manual intervention. We had to do a training run, identify that we were overfitting, and then decide on parameters for the new simpler model (how many parameters should it have?). We could, perhaps have gone too far and wound up with something like this: ...and underfit. There's no way when we start out knowing what the right number of parameters is, so we need to try various values and then try to work out the optimum balance. Regularisation techniques are designed to try to automate this -- to prevent overfitting without all that tedious mucking about with the model. We've already looked at Dropout , which is one of the standard ways to do that. Although my own mental model of what it does goes some way beyond just helping to prevent overfitting, I may well be wrong -- and given that our LLM train is never seeing the same training data twice, being a single-epoch run, removing it turned out to improve our model . Another technique is just stopping the training run when you start seeing the validation loss rise, also known as "early stopping". That's such an obvious thing to do that I came up with it independently back when I was doing my early experiments with fine-tuning . Now, we don't have a separate validation set for these training runs, but because we're doing a single epoch, the training data it sees is just as "new to it" as a held-back validation set would be, so we could use a similar trick and treat "train loss starts rising" instead of validation loss rising as a reason to stop the train early. It's not exactly the same thing, but perhaps it would be close enough. But in all of the trains in this series, that's never happened -- while sometimes the train loss blips up for a bit, in the longer term it keeps going down. But there are other techniques that rely on a neat trick. Let's think back to the manual, boring way of trying to find how many parameters are appropriate for a modelling task. We tried one number, found that it overfit, then we might try a lower one, find that it underfit, then try something in the middle and find that it's better but still not perfect one way or the other, and rinse and repeat until we find something we're happy with. This kind of searching through a solution space to find an optimum is exactly what we're doing when training a model. It would be really nice to automate it in the same way. One trick is: if we want to minimise the complexity of our model so that it doesn't overfit, we can try adding a measure of the model's complexity to the loss function -- and then our normal process of gradient descent will try to minimise that, just like it will try to minimise the loss from the training results themselves. And that brings us on to weight decay. Regularisation by weight decay starts off with the hypothesis that the "size" of all of the model's weights, taken together, is a measure of the model's complexity. If the model's weights are small, then it's a simpler model than if they're large. 1 The "size" in this sense is the square of the L2 norm -- that's something we came across in gradient clipping . The L2 norm is basically all of the weights squared, added together and then the resulting sum square-rooted. You can think of it as the length of the vector that the weights represent -- that is, for our 163M-parameter model, it would be the length of the model's weights considered as a vector in 163-million dimensional space. 2 And by using its square, we get something that penalises larger values more (and we also save the time in calculating a square root). To me, it's not intuitively obvious that that measure really does express the complexity of the model in any clear sense. After all, you'd think that doubling all parameters would leave it no more complex than it was before, but it would double the L2 norm. 3 But I imagine there is solid maths behind it to say that it does work in a more general way, so in the interests of not disappearing down a mathematical rabbit hole at this stage, I'll take it as given. So: we're using the squared L2 norm as a measure of model complexity, and we're going to add that on to the training loss as a way to try to minimise both. The next question is, how do we balance between the two -- the training loss and the model complexity penalty? This is, in a somewhat hand-wavy way, similar to the decision of how much of the current loss function's gradient to use when adjusting the weights. For that, we use η , the learning rate to scale the gradients before applying them: And the balance between the "real" loss and the model complexity penalty is done in a similar way -- we have a number, the weight decay, normally represented by a lower-case lambda, λ , and we multiply the squared L2 norm by that, something like this: ...where I'm using ℒ for the normal loss on the training inputs vs the targets, N 2 for the squared L2 norm of the weights, and ℒ ′ for the combined loss. And ℒ ′ is what we -- in theory -- actually try to minimise using our optimiser. But there's actually a neat simplification that we can apply to make this even easier. Firstly, let's make one small change to the equation above: we'll halve the squared L2 norm before multiplying it by λ . That obviously doesn't change the underlying maths, it just means that we'd need to use larger values for λ to get the same effect. You'll see why that's useful in a bit. Now let's think about normal gradient descent. Again, we work out the gradient of the loss function for each weight, and subtract that times the learning rate η from the weight's value to update it: Let's reformulate that a bit. The gradient of the loss function for the weight is its partial derivative against that weight, so we can write the above like this for the version of the loss function including weight decay, ℒ ′ : Now, we defined ℒ ′ above as ℒ + λ · N 2 2 , so we can substitute that in there: Now, let's think about that L2 norm, N . It's the square root of the sum of all of the weights squared, or equivalently we can square it (like we do in the formula above) and say: Let's drop that in: Now, the derivative of a bunch of things added together is just each of them differentiated separately and then added together. Let's apply that to the two terms in the brackets: ...and now pull the constant λ and the 2 out of the second partial derivative: Then we apply the rule for the derivative of a bunch of things added together again: Now, we're doing a partial derivative versus one specific weight, w , which is one of the w 0 , w 1 , and so on in there. From that perspective, all of the other weights are constant -- which means that their derivative with respect to w is zero. So we can just get rid of all of them apart from the one that actually is w , and we wind up with this: The derivative of w 2 with respect to w is just 2 w . Thanks to that crafty halving of the N 2 earlier, that means that we can go to this: Multiplying that − η across the bracketed terms, we get: That's exactly the same as the normal gradient descent update, using the unmodified loss function without weight decay -- except that we're additionally subtracting the weight's original value scaled down by both the learning rate η and the weight decay value λ . Much simpler :-) (As an aside: the description above is correct for "traditional" simple gradient descent and -- loosely -- for Adam, but AdamW's trick is to do things somewhat differently. That's something I'll go into in more detail when I get round to writing my post on optimisers.) So: weight decay is a regularisation technique that tries to prevent our model from getting any more complex than it needs to be. We have one number, λ , which determines how much to weight complexity against the normal training loss. And, as we can see from the code: ...right now we're setting λ to 0.1. Is that the right value? As usual, the GPT-2 paper is light on the details of the hyperparameters they used, but nostalgebraist wrote a really nice post on Tumblr where they dug into what the number might have been. As they say: It does say it follows the first GPT paper in most respects, and that paper used weight decay of 0.01. Their link for the paper appears to be mistaken, as it's a different (albeit very interesting) paper from 2020, a year after the GPT-2 one, but I believe this is the paper normally called the GPT-1 one . They do indeed use 0.01 there: We also employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or gain weights. The link to the GPT-3 paper looks right, though, and as they say, it uses a weight decay of 0.1: All models use weight decay of 0.1 to provide a small amount of regularization They then do a bit of maths to work out whether the GPT-2 weights are likely to have been regularised by something like weight decay, and come to the conclusion that they probably used 0.01, just like the GPT-1 paper. It seems plausible, but of course not certain. But: tentatively, GPT-2 used 0.01, while we're using 0.1, perhaps because the GPT-3 paper does. What other data points do we have? The Hugging Face "Smol training playbook" has some interesting stuff (including not using weight decay on embeddings, which they say they found helped), but the value that they use is 0.1, which they call "a very vanilla setting". And: Interestingly, over the last few years the AdamW hyperparameters have barely moved: The same triplet is reused in Llama 1, 2, and 3 and DeepSeek-V1, V2, and V3-671B, with no changes. Anyway, assuming they're right about weight decay value for the models they mention (and I assume they've done the research -- I had the link to the DeepSeek paper to hand, and that one certainly says 0.1), it looks like 0.1 is pretty much standard these days. And a quick double-check of what a typical value would be -- asking ChatGPT, Claude, Gemini and Grok -- they all recommend 0.1 as a solid sensible default with AdamW (though they all also say that values between 0.01 and 0.1 are reasonable). So on that basis, I think we can say that 0.1 is a reasonable default, and has pretty much become the standard, but it might be worth trying 0.01 just to see if it does help with tiny models like ours. Are there any dissenting voices to the 0.1 orthodoxy? I came across a paper from a team at Cerebras Systems , " Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training ". It's essentially a Chinchilla-like attempt to get scaling laws, but rather than looking just at optimal tokens per parameter in order to work out what you should scale up when adding on more compute, they're trying to find optimal batch sizes and values for weight decay. That's certainly relevant to our interests :-) However, it is very dense and in-depth, and fully understanding it at this stage would need quite a lot of work -- very much a side quest. Definitely something to come back to later, but for now, I'll just try to extract the stuff we need. Let's start off with the optimal batch size, as they have it right there on the first page. We're not going to use it, but it will be interesting to compare with what we're using, and what the DeepSeek paper that I looked at in the last post suggested. They fit this formula: ...where D is the total number of tokens that you're training on. That's quite different to the formula in the DeepSeek paper, which was: ...where C is the number of FLOPs 4 . C scales up linearly with the number of tokens D , but also with the number of parameters in the model N , so you can see the DeepSeek formula as a function of N and D -- as your model gets bigger, so does B opt -- whereas this Cerebras paper is saying that it's just a function of D , unaffected by model size. They did train over a number of different sizes (from 111M parameters up to 1.7B) and their formula seems to hold, so it's not just that they didn't treat model size as relevant. Well, let's see what their formula comes up with. We have 3,260,252,160 tokens in our train, so their formula for B opt comes out as: That's much closer to the 97-or-so sequences that appeared to be optimal when I did some rough-and-ready curve-fitting than the 373 that the DeepSeek formula gave for our setup :-) OK, so what about the weight decay? They don't give a direct formula for that, but they do give a formula for the optimal τ , the AdamW timescale. Without going into exactly what that means right now (that's one for my optimisers post later), they relate it to other numbers that we do know with this formula: ...where B is the batch size, D is the amount of data, and of course λ and η are weight decay and learning rate respectively. So if we know the optimal τ we can work out the optimal λ for our training run; solving for λ , we get: So let's work out the τ opt . Their fitted formula is this: ...where TPP is tokens-per-parameter. For us, with our Chinchilla-optimal TPP of 20, we get: Now, we're using a batch size B of 96, and (as before) D is 3,260,252,160. Our learning rate η is 0.0004 for this train -- remember, although in the last post we found that a scheduled learning rate with a peak at 0.0014 was better, in this post we're testing changing weight decay in isolation. 5 So, we just need to plug our τ opt into this: Before we do: having a batch size and a number of tokens in the same formula feels like a unit mismatch. In particular, as part of the explanation of that formula, they tie it back to a value S , the total number of optimisation steps, which they define as D / B . For that to work, either both need to be in terms of tokens, or both need to be in terms of sequences They clearly say that "B is reported in units of sequences". I'm not sure how to explain this, except by saying that perhaps the D is also meant to be in terms of sequences too, even though I'm pretty sure that it's meant to be in terms of tokens in the equation for the batch size. 6 Well, let's assume that is the case, and plug in numbers for sequences. We have 3,260,252,160 training tokens split into 1,024-token sequences, which is 3,183,840 sequences, so that comes out as: (Note that we'd get the same numbers if we plugged in numbers for tokens in both cases, as it would just multiply the top and the bottom by 1,024.) That comes out as 0.33724. Wow! That's even higher than the "traditional" 0.1, never mind the 0.01 that is the best guess we have for GPT-2. Even if I'm missing something here (I certainly can't say I've read the paper in as much detail as it deserves), that actually gives us a nice number to try out as an experiment. We already have a loss on our test set for a model trained with a weight decay of 0.1, as that was what we used in our baseline train. It looks like it might be worth doing two more, one with the GPT-2 estimate of 0.01, and one with this Cerebras-inspired 0.33724, neatly bracketing it. Let's give them a go! Firstly, the training run with λ = 0.01 : Looks like a nice smooth train -- one small loss spike near the start but it quickly recovered. The output was: That's not a bad final train loss (which does tend to indicate a good model). Let's look at the evals; firstly, the smoke test -- how would it complete "Every effort moves you"? Passably coherent. Let's take a look at the loss it gets on our test set: Not bad at all! Time to upload it to Hugging Face and to add it to the table so that we can compare it to the other interventions we've tried so far. So, it's better than gradient clipping and the QKV bias, but slightly worse than removing dropout and much worse than scheduling (and increasing) the learning rate. Now, that suggests to me that the much-higher Cerebras-inspired weight decay will be worse. My logic is this: if both decreasing it and increasing it improved loss, that would suggest that we have an inverted-U loss curve for weight decay like this: Now, it seems vanishingly unlikely that those downward trends on either side would continue so that you could get arbitrarily low loss by increasing or decreasing weight decay even more. So the curve would perhaps look a bit more like this W-shaped one: My intuition is that having multiple minima -- especially ones that just happen to be on either side of the "standard" value for weight decay -- seems less likely than the alternative -- that the higher number will be worse because we're actually on a U-shaped curve more like this: Of course, my intuition could be completely off on this, and it's definitely still worth doing the test! Here's the loss chart with that: You can see right away that it was a much choppier train, with quite a few loss spikes, some quite late on. The output at the end reflected this: ...a significantly worse loss at the end. Still, we should do the evals. Firstly the smoke test: Not too bad, but the loss test is the important one: That's terrible! Our first result for loss on the test set for an intervention that is actually worse than the baseline. Much worse: However, at this point I started wondering. When I was looking at the learning rate, the number I selected based on the DeepSeek paper worked well with learning rate scheduling, but failed to converge without. The weight decay number is multiplied by the current learning rate before it's used to reduce weights' values, so will be affected by both scheduling and η . It seemed likely that Cerebras used a learning rate schedule, and double-checking the paper: We present results with a single (standard) learning rate schedule ... For a given TPP, all models have the exact same warmup phase: a linear warmup of the learning rate from 0 to the maximum value. ... We use the µP-tuned and adjusted peak η , for 111M models. The learning rate increases linearly to the peak for the first 10% of steps, then decreases from the peak to 0 for the remainder of steps. Seems pretty certain. Now, I've been following a fairly strict rule of testing interventions in isolation; however, the learning rate and the weight decay parameters are so intertwined that perhaps that's just not reasonable here. I decided to do two more trains, both with learning rate scheduling. I'd use the same schedule as in the last blog post -- a warmup from pretty-much zero to the peak over 10% of the run, followed by a cosine decay to 10% of the peak. In the first, I'd use the same learning rate as our baseline model, 0.0004. In the second, I'd use the one we got from the DeepSeek paper, which did really well when scheduled: 0.0014. Well, that's less choppy, at least -- the scheduling calmed down the later parts of the run, as you'd expect given that the learning rate was dropping. The output: Still a kind of high training loss at the end, though. The smoke test: Not too bad, and the test set loss: Unfortunately still worse than the baseline of 3.692, albeit better than the one without learning rate scheduling. I'm not going to add it to the table, as this was more in the way of an exploratory training run. Let's see how we do with the larger DeepSeek-suggested learning rate. For this one, I kept the weight decay at 0.33724. (This was an error, as I realised later -- more on that shortly) Ouch, super-choppy loss -- and the loss at the end of the train isn't promising either Terrible loss at the end. The smoke test gives this: ...which is not too bad, but the test set loss: ...is still pretty terrible (though still a tad better than the one without the learning rate scheduling). Another one to throw away, I think. But then something occurred to me: the formula to go from the optimal AdamW time horizon τ opt to the optimal weight decay λ opt is this: It has the learning rate η in it -- I even made a footnote saying that I was going to have to remember to recalculate the weight decay value when that changed :-S Luckily, though, running the real numbers through that: ...which is almost exactly the same as the 0.1 that we've been using for all of our other experiments. So that actually suggests that the Cerebras equations come up with a reasonably usable number for weight decay if you use the DeepSeek-optimal level for the learning rate, and schedule it in a normal warmup-cosine decay manner. But it's still not as good -- for this model -- as using the GPT-2 number. 7 With that, I think it's time to wrap this intervention up! Let's look at our results table again: We've found that reducing the weight decay from the now-standard 0.1 to a GPT-2-inspired 0.01 improves the loss our model gets on the test set; it's the third-best intervention so far, after getting rid of dropout and updating our learning rate -- and the difference between it and the dropout intervention is pretty small. It did surprise me that the Cerebras-inspired number did so badly, though. To recap: I think that for now, I should not head any further down this rabbit hole and just take the win -- we have a weight decay parameter that works better than the one we had, and so that's something that can go into our set of working interventions. I can revisit the Cerebras paper later when I've spent more time studying optimisers. As to why this old-fashioned GPT-2 value might work better than the current default of 0.1: I think that could plausibly be due to scale. The 0.1 value appears to come from the GPT-3 paper, which essentially was an experiment in scaling up GPT-2. Perhaps larger models need larger weight decays? And the model we're working with here is really small, at 163M parameters. So, that's weight decay done! Of the list of planned interventions I wanted to try , only training in full-fat 32 bits (rather than AMP), and weight-tying remain. I think I'll look into the second of those next. Stay tuned! Here's a link to the next post in this series . More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩ β 1 = 0.9, β 2 = 0.95 Grad norm clipping = 1.0 Weight decay = 0.1 (Llama 3 405B drops this to 0.01) With our too-low learning rate of 0.0004, it performed terribly When we added scheduling, it was a bit better but still not great. When we used a DeepSeek-optimal learning rate (and actually did the right calculations to get the real value for weight decay based on that), we got a number which was very close to our baseline train, and seems very unlikely on the face of it to have a significantly different resulting test set loss. More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32e -- Interventions: the learning rate

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: The values in there -- for the learning rate, and for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book. What do those values actually mean, and are those really the right values for them? I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before. The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that. In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later. If you're reading this blog, you almost certainly know what the learning rate is, but let's go over it briefly to build a solid foundation. The way it's normally explained, using simple gradient descent, goes something like this. Let's assume that we're training a model with just one parameter, and it starts off set to − 5 . We run some training data through, and get a loss, let's say 44.44: We don't know what shape our loss curve is (if we did, we might be able to find the lowest loss algebraically), but we do know the differential of the parameter versus the loss at the point we've measured; it happens to be -13. That is reasonably large and negative: We use that information to say that we want to move in the direction of a larger value for our parameter -- that is, in our case where the gradient is negative, so we have a downhill slope towards the right, we want to increase the parameter to move rightwards on that chart, whereas if it were positive (an uphill slope) we'd want to decrease the parameter to move leftwards. Simply subtracting the gradient from the parameter would lead to an update in the right direction, but it would be a very large one in this case -- we'd move 13 units to the right -- so we multiply the gradient by a small positive number, the learning rate (often written as a lower-case eta, like this: η ), to move a small distance in that direction. Let's say η = 0.3 . That means we want to update our parameter: So now we run that through and get a new loss -- let's say it's 9.06 -- and a new gradient, which happens to be -5.2. Now we can do another update, and our parameter will become 0.46, so we use that and work out another loss and gradient, which come to 3.3816 and -2.08. Let's plot that one, but this time we'll draw back the veil and show the actual loss curve. Now, it's worth reiterating that while we're training this model we don't know what that curve looks like -- we're just finding points on it, along with its gradient at those points, and using that information to work out which parameter value to explore next. But it's pretty clear that as we continue, if the learning rate is set correctly, we'll get to the minimum eventually if the learning rate is the right kind of size, because -- due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum 1 . It's also pretty clear that if the learning rate is smaller than an optimal value, in this simple case we will still find the right point, but it will take more steps because each one is smaller: And, of course, if the learning rate is too high, we might never converge -- we'd "bounce out of" the dip, and wind up with a parameter value that endlessly cycles between increasingly smaller and increasingly larger values, zooming off to infinity: OK, that's the basics. Why might we want to change from something that seems so logical and simple? A few paragraphs back I said: due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum What if it doesn't? Imagine if we had something more like a V-shaped curve, like this: The gradient does not decrease as we get closer to the minimum, and so while we're in the downward-sloping part, each update is exactly the same distance: Now, eventually we'll jump over the minimum: In this example, I've used a gradient of − 8.33 on the downward-sloping part of the curve, and + 8.33 on the upward-sloping part, so that means that our next update just bounces us back to where we were before! Because the gradient isn't decreasing the closer we get to the minimum, we wind up just oscillating around it. That's not very helpful. That's a slightly contrived example (though not entirely -- intuitively, with functions like ReLU or GELU in our real LLMs, it's easy to imagine crazy loss landscapes). But it does show that perhaps we might want to add in our own "artificial" way to decrease the size of the steps we take over the course of training our model rather than just relying on the gradients naturally flattening out for us. Another way of looking at things is that as the model gets trained, we don't want batches of very new-looking data to cause big updates, taking us away from what was a good part of the loss landscape in terms of what we've seen so far. For example, imagine you've been training an LLM on a bunch of documents, which have so far been in English. Halfway through, it encounters a document in Byzantine Greek, the loss skyrockets, and you do a big update. That would be a problem! You might want it to learn a bit from it to push it slightly in a "the world is multi-lingual" direction, but you don't want it to lose a big chunk of the value from its previous training. You might also see a kind of connection to the way that people learn over the course of their lives -- for babies, everything is new and they "update their parameters" constantly as they try to understand the world. Children are still pretty flexible, but as we get older we tend to update our beliefs less and less. That's not always optimal, but as a heuristic it's pretty adaptive. Anyway, in general: for most training runs, we're going to want the learning rate to adjust over time. Most of the time this will be by reducing it, though there can be cases for increasing it again for periods. The general case of doing this is called "learning rate scheduling". There are a bunch of ways that people adjust the learning rate over the course of a train; here are a few that cropped up a lot while I was researching this. If we want the learning rate to go down over time, and we know how many steps we're training for, we can just set it to (say) 0.0004 for the first quarter of our train, then 0.0002 for the next, then 0.0001, then finish off with 0.00005, like this: That can work pretty well! But there is one obvious oddity -- the big step changes in learning rate mean that the exact placement of the drops and the training data before and after can matter. Why are we treating the data and the state of the model immediately before and immediately after so differently? It would make more sense to have a smoother schedule. What functions decay smoothly like that? An exponential curve does: let's say we just multiply the learning rate by a number that is a little smaller than one every step, so that it drops smoothly like this: But there are lots of other curves like that, and one is particularly interesting: As you change θ from 0 to π , the value of cos θ goes smoothly from 1 to − 1 , so it's easy enough to rescale that so that our learning rate follows the same curve: This is called a "cosine annealing" or "cosine decay" schedule, and was apparently inspired by the algorithms used for simulated annealing (an optimisation algorithm that was in turn inspired by how the atomic structures form in metals as they cool -- another one for the list of things to look into in the future...) That solves the mystery from earlier: the cosine that the Chinchilla paper was talking about was exactly this. As it turns out, the cosine decay scheduling curve is quite popular in deep learning, because it has what amounts to two well-defined phases -- an initial high learning rate where lots of exploration of the loss landscape can happen, followed by a smooth transition to something more like fine-tuning to optimise the location in whatever part of the loss landscape we've wound up in. Now, all of the above are assuming that we want the learning rate to start high and finish low, so that we can mimic the textbook gradient descent that we had at the start of this post. Intuitively that feels nice, but on further thought, the important thing is really that we have a low learning rate at the end of the train, so that we can find as close a point as possible for the minimum at the part of the loss landscape we've found ourselves in. But perhaps there's a case for having both high and low periods during the train, so that we don't get stuck in a local minimum -- something to jolt us out of where we were every now and then? 2 With a step function, that's easy: you could, for example, do this: With an exponential, you could do something like this: With cosine decay, of course, things are even easier, because the cosine function is inherently cyclical, so we can just do this: However, at least for our purposes, training an LLM using a Chinchilla-optimal number of training tokens, it makes sense to be guided by what the authors of the Chinchilla paper did. Appendix B says: We find that setting the cosine cycle length too much longer than the target number of training steps results in sub-optimally trained models, as shown in Figure A1. As a result, we assume that an optimally trained model will have the cosine cycle length correctly calibrated to the maximum number of steps, given the FLOP budget; we follow this rule in our main analysis. So, at this point, I think we have one important part of the intervention we want to make: we want to use a cosine learning rate scheduler, going from high near the start of the training run, down to low at the end over one cycle. Additionally, and also from appendix B in the paper: we use a 10x learning rate decay in line with Rae et al. (2021) ...which means that if our learning rate starts at η , then we want it to decay down to η / 10 by the end. So, we just need to work out an initial value for η , and let it rip, right? Well, not so fast... When our model is uninitialised, right at the start of the train, gradients are going to be pretty wild. It's going to be making random errors all of the time, and we'll be making huge jumps across the loss landscape. That sounds bad. Additionally those kind of wild jumps can get the optimiser into a -- well, sub-optimal -- state. I haven't read enough about optimisers yet to have a solid handle on that, but that can wait -- intuitively it makes some kind of sense that erratic gradient updates might confuse it. So, it makes a certain amount of sense to start off with a low learning rate so that we don't do that, and then to increase it gradually to the peak, and only then to schedule the gradual cosine decay. According to this (rather nice looking) masterclass on LLM training , it's typical to do this over "a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size", and we would just use a linear increase over that period: I think we should do that; a simple linear warmup at the start -- let's relatively arbitrarily say 5% of our training steps going up to our desired peak learning rate. So our learning rate schedule should look something like this: So far I've written a lot about how we vary the learning rate over time, and that's all been very useful. But we still need to know what the value should be initially! In smaller-scale experiments you might just try a bunch of different numbers to see what worked well, but at more than US$30 per train, that's not practical here. Unfortunately it's really quite hard to find good suggestions published anywhere. The GPT-2 paper is (as usual) reticent: The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText ...and if you search for "learning rate training llm", you'll see lots of results for when people are fine-tuning existing LLMs ( 2 × 10 − 4 comes up a lot), but almost nothing about when you're training one from scratch. I eventually came across this (long!) post from Hugging Face , which I definitely need to spend time going through in the future, because it covers a lot of the ground I've been going over in this post series. But for this post, I think the most relevant part is in the section " Scaling Laws for Hyperparameters ", where they include a figure from this DeepSeek paper . Here it is, with some of the (also relevant) surrounding text: In our trains we're using something like 5 × 10 18 total FLOPs. Now, they are specifically charting things in terms of non-embedding FLOPs, but I'm going to play a little fast and loose here and ignore that, so reading off their chart, that looks like we should be using about 1.4 × 10 − 3 as our learning rate. We can double-check that against their formula, where C is the compute budget: Nice, a close match! However, it's definitely worth noting that we're using a simple GPT-2 architecture, and they are using something quite different -- RMSNorm instead of LayerNorm, SwiGLU as the activation function on the feed-forward networks, Rotary Position Embedding rather than the fixed ones we're using, and so on. As a sanity check: you can see that they also give a formula for the optimal batch size in terms of tokens. For our FLOP budget, that comes in at 381,782, which is about 373 of our 1,024-token sequences. That is quite a lot higher than the 97-or-so sequences that we appeared to be optimal in our earlier experiments . That is a little concerning, though of course the 97 number came out of a very ad-hoc bit of curve-fitting. For now, I'm going to hope that that doesn't matter too much for the learning rate. This may come back to bite me; if the results of a train with 1.4 × 10 − 3 are radically worse than the existing rate of 4 × 10 − 4 , I'll have to do a bit more investigation. So, now I think we have all of the theoretical pieces in place to do a train. Let's move on to the practicalities. We started by looking at this: What should we change -- disregarding the until the next post? Based on the above, we want to do a linear warmup of about 5% of our steps, going up to a learning rate of 1.4 × 10 − 3 , followed by a cosine decay down to one tenth of that, 1.4 × 10 − 4 . What does that look like in code? The relevant API for scheduling the learning rate in PyTorch is, logically enough, in the module, and there are a bunch of different scheduling classes. You create your optimiser, then create a scheduler for the shape you want, and then you can call on the scheduler (after the on the optimiser) to adjust the optimiser's learning rate over time. Let's make that more concrete; one of the schedulers is , which is what we'll need for our linear warmup period. It takes as its parameters: Let's say that we want to go from almost-zero to our optimiser's learning rate over 1,600 steps -- we'd create our scheduler like this: ...then in our training loop, after we've done the scaled step of the optimiser, we'd also step the scheduler: This confused me a little bit the first time I saw it; after all, if the scheduler hasn't been "triggered" when we step the optimiser, how does the optimiser know what learning rate to use? Surely it would just use whatever it was initialised with? The answer is that when you create the optimiser, it stores away the learning rate that you give it in two places -- an "initial learning rate" and a "current learning rate". Next, when you create your scheduler, it uses the initial learning rate to work out the start and end values, and then sets the current one to the start value immediately. Just by creating a scheduler, you're changing the optimiser's current learning rate -- but not the initial one, which is important, as we'll see in a moment. So, we have a scheduler that handles our warmup period nicely. Another scheduler that's relevant to our interests is the CosineAnnealingLR . This takes: On creation, this scheduler will read in the optimiser's initial learning rate -- note, not the current one -- and then the first time it's stepped, it will set the current learning rate to that value, and then for steps after that it will reduce it so that it follows a nice cosine decay, reaching after steps. So those two cover the two regimes that we want -- the warmup and then the cosine decay. But now we need to put them together; we want to do one and then the other. There's a very useful class, , which allows you to chain schedulers and tell it when each one takes over from the previous one. Let's sketch out some code to use that to do a train with our new peak learning rate of 1.4 × 10 − 3 , a warmup of 1,600 steps, followed by a cosine decay for the next 32,000 steps to one tenth of the peak learning rate: That actually works quite nicely! I wrote a dummy training loop to plot the current learning rate over a fake train using code like the above , and got this: ...with the output confirming that the values were good at the "milestone" point, the start and the end: I was initially a bit surprised by that, as at the time I ran it, I didn't realise that there was that split between the initial and the current learning rates on the optimiser, so I thought that the cosine scheduler would pick up whatever tiny starting value the warmup scheduler had overwritten the optimiser's learning rate with -- but that split saves the day. That means that now we have the outline of how to schedule our learning rate. But before we can put that into the code, we need to think about how it affects our checkpoints. Just like the scheduler and the optimiser, the learning rate scheduler -- or, indeed, our two schedulers here -- contain information about the state of the train. That means that if we recover from a checkpoint, we need to provide them with the information they need. If we just created them afresh, they'd start from the beginning -- for example, if we restarted from step 20,000 in a train like the one above, we'd start a new warmup from pretty much zero, and then start a fresh cosine decay. That would be bad: (Dummy test code here .) Now, we could use the parameter to initialize them with the correct current global step. But they have a state dict, like most other PyTorch objects, so the simplest thing to do is just to write that to another checkpoint file: ...and then load it likewise: (Dummy test code here .) Conveniently, if you save the state dict of a , it will also include the state of all of its component schedulers, and likewise if you reload it, it will load the components' states back in too. The one thing you have to be careful about is what they warn about in the PyTorch docs: Initializing a scheduler overwrites its optimizer’s s. When restoring a checkpoint, initialize the scheduler before calling your optimizer's to avoid overwriting the loaded learning rates. Luckily enough, in our code as it stands, we create all of the things that are checkpointed -- the optimiser and the scaler so far, but shortly the scheduler as well -- before we load in the state dicts, so that drops out quite nicely. So, we have some sketched-out code -- it's time to put it in place for the real training run. I won't go through the details of the changes to my existing DDP training code, though you can see the diff here if you're interested. Much of the complexity was due to keeping backward compatibility so that we don't have to always use a learning rate scheduler; remember that in this mini-series, I'm trying making various changes ("interventions") to the training loop in isolation, seeing whether each one improves things. So it's important to be able to easily train with or without learning rate scheduling; I did that with a flag in the Implementation-wise, initially I was thinking that it would be easiest to always have a scheduler, and in the "non-scheduled" case to just set it to a linear one that didn't change the value over the course of the train. But in the end it turned out to be easier to use as being the switch to tell the training loop which "mode" it was in. The placement of the code to create the schedulers was also a little tricky; the "natural" place was just after the optimiser is created, like it is in the example code above. However, at that point, we don't know how many global steps we're going to have in the train, because we don't have the dataset -- which means that working out the numbers to pass in to the schedulers for the warmup and decay steps would be impossible. It turned out to be easiest to put it in the function , just after the datasets are loaded, as at that point we have all of the information we need. Anyway, that's the code done, so let's see what happens! I wanted to do two trains; one with the learning rate scheduling, and one with just the new value for the learning rate, instead of . I was expecting the updated learning rate alone to be too high and to cause a very choppy train, but had high hopes for the train with the scheduling. Here's how it did; the scheduled learning rate train first: Here's what the training loss looked like over that: Quite a few loss spikes early on in the train when the learning rate is at its peak, but nothing unmanageable -- and, as you'd expect, things calmed down quite a lot later on. I also charted the learning rate, to make sure it really was doing what I thought it was doing: So, a pretty smooth train, and we definitely did the right learning rate scheduling. Time to upload it to Hugging Face , and see what the evals look like. Firstly, the smoke test: Reasonably coherent, at least, though it's not super-impressive. On to the loss on our test set: That's our best loss so far! Let's put it into the table: So, it definitely looked like it was worth it. But was it the scheduling of the learning rate that helped, or just the change from 0.0004 to 0.0014? I kicked off a second run with no scheduling, just a learning rate of 0.0014, to see what would happen. After about an hour, I noticed that the loss chart had stopped updating. The last point had a maximum and minimum loss but no average -- but after that, nothing: However, the learning rate was still being charted, so the train was definitely running: Looking at the checkpoint metadata showed what had happened. At global step 1851, we had this 3 : ...and at the next checkpoint at step 2468, we had this: ...and the same for all checkpoints thereafter. Clearly the parameters had gone off the rails -- exactly what we'd expect with an excessive learning rate: There was no point in continuing the train, as it was pretty much certainly unrecoverable, so I stopped it. Out of interest, I downloaded the model, but I couldn't even run the smoke test on it: So it was pretty clear that just updating the learning rate to 0.0014 was actively harmful. No need to upload that one to HF! And time to wrap up this experiment. While this has been quite a long post, I've really only scratched the surface of how learning rates are set. If I were doing things in more detail, the best would probably be to do a "sweep" over multiple values to try to at least approximate the best possible rate for this model. That would be pretty expensive for me, though, so I decided to stick with the DeepSeek number. It might not be ideal for the specific architecture that I'm using, given how different that is to theirs, but given the results, it's a decent one compared to what I was using. 4 Something that I found interesting is that exactly how to schedule your learning rate is still an area being actively researched. Even in my relatively minimal research, I came across three alternatives to the mainstream warmup-cosine decay pattern: I'm sure there are many more. But for this train, I decided to stick to the mainstream, and the results were pretty good! To reiterate, this has been the most positive intervention so far: So I'll stick with that, and move on to the next thing: what is the parameter that we're passing in to the AdamW optimiser? Tune in next time :-) Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. , which is the optimiser we're applying it to. , which the optimiser's learning rate is multiplied by to work out where we want to start up. , which is likewise applied to the optimiser's learning rate to work out the value we're heading for. , which is the number of steps over which it should go from the initial learning rate to the final one. , which lets the scheduler know how many steps into its schedule it currently is -- this defaults to , meaning it hasn't started yet. This can be useful if you're resuming from a checkpoint, but for our purposes we can ignore it. , which is the same as the 's. , which is the number of steps before it reaches its minimum , the minimum learning rate we want to get to. , again the same as the 's. Per the Hugging Face paper, some people do warmup, then pause at a set level for a while, then start the cosine decay (warmup-stable-decay). DeepSeek use a relatively simple stepped function after a warmup. 5 I came across a 2025 paper " Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs " which says that a linear decay (after a warmup) outperforms cosine. Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: ↩ You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: So: we initialise the weights W q , W k and W v as linear layers rather than simple matrices of weights, and have a parameter to say whether or not we should add bias to those. In all of our trains so far we've set that to . Why do we have this parameter, and where did it come from? In Raschka's book, the use of the for these weights is introduced in section 3.4.2 with the wording: We can improve the implementation further by utilizing PyTorch's layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using instead of manually implementing is that has an optimized weight initialization scheme, contributing to more stable and effective model training. So, it's presented essentially as a way of getting better weights for our untrained model, which makes good sense in and of itself -- but, if that's the only reason, why don't we just hard-wire it to have ? That would be the sensible thing to do if the initialisation were the only reason, but clearly there's more to it than that. Section 4.1 has a bit more information: determines whether to include a bias vector in the layers of the multi-head attention ... We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model. That looks like a typo, as the real explanation is in chapter 5, section 5 (page 164 in my copy), where we do indeed load the OpenAI weights: OpenAI used bias vectors in the multi-head attention module's linear layers to implement the query, key and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary. So, that all makes sense so far. QKV bias was part of the original GPT-2 models, perhaps just because it was standard at the time, inherited from something else, or perhaps for some other reason -- I can't find any reference to it in the actual paper . But people have found it doesn't help, so no-one uses it these days. But... is there some way in which an LLM of this specific size, or in some other way similar to the GPT-2 small model that we're training, might in some way benefit from having bias? That's what this experiment is for :-) One thing that occurred to me while setting this up is that we have been training on a Chinchilla-optimal number of tokens, 20x the number of parameters. Without QKV bias, we have 163,009,536 parameters, so we've been training on 3,260,190,720 tokens, rounded up to the nearest batch size, which is 3,260,252,160 in our current setup for these experiments (per-GPU micro-batches of 12, with 8 GPUs, so a total batch size of 96). These extra bias terms will be parameters, though! We're essentially making our model larger by adding them, which changes the Chinchilla calculation. How much? OK, that's essentially nothing -- 27,648 extra total paramaters on top of 163 million. I make it less than two hundredths of a percentage point larger! The correct number of tokens goes up to 3,260,743,680, so if we wanted to be very pedantic, we're under-training. But I feel like training on a larger dataset is worse in terms of comparability between the baseline and our "intervened-on" model with QKV bias. So: we'll train a model with QKV bias on 3,260,252,160 tokens, accepting that it's a tiny bit less than Chinchilla-optimal. Let's see how it goes! Here's the config file for this train. Running it gives this training chart: Pretty standard, though the loss spikes look less prominent than they have been in the other trains. Might QKV bias actually help with model stability in some way...? The train finished with these stats: Timing-wise, pretty much indistinguishable from the baseline train's 12,243.523 seconds. The final train loss looks a tad better, but we can't rely on that -- the test set loss is the important one. So it was time to download it, upload it to Hugging Face Hub , and then on to the evals. Firstly, our normal "how should you continue ": Not bad at all, borderline coherent! Next, the loss on the test set: Well, crap! Now that's a surprise. Let's look at that in the context of the other interventions to see how surprising that is, given Raschka's comments (which were undoubtedly backed up by serious research): So, adding QKV bias actually improved our test set loss by more than gradient clipping did! The loss spikes in the training chart look smaller than in the other trains 1 , so, speculating wildly, perhaps with a model of this size, the bias stabilises things somehow? Or perhaps what we're seeing is the model become that tiny bit smarter because it has some extra parameters -- albeit less than 0.02 percent more? I'm not going to spend time investigating things now, but this is a really interesting result. One extra thing that does occur to me is that the direction research has taken since GPT-2 has definitely been in the direction of larger models. The attention weight matrices are sized d emb × d emb , so excluding bias they have d emb 2 weights each. Bias adds on another d emb . So, as a model scales up, the attention-related non-bias weights will scale quadratically -- doubling d emb will square their number -- while the bias weights will scale linearly. So perhaps it's just that the effect -- whatever causes it -- gets rapidly swamped as you scale out of toy-model territory. That, at least, seems pretty plausible. One final note to self, though: these improvements are small enough that I do find myself wondering whether or not it might be some kind of noise, despite the setting of the random seeds I'm doing: I think that at the end of this, before I do a final train, it would be worth doing another baseline train and measuring the test set loss again, and doing another comparison. If it comes out exactly the same -- and I can bump up the number of significant figures in the output, it's just a formatting parameter -- then I don't need to worry. But if they vary to some degree, perhaps I'll need to update my mental model of what level of finding is significant, and what isn't. I think it goes without saying that QKV bias definitely goes onto the list of interventions we want to add when training our best-possible GPT-2 small-scale model, assuming that the random seed test goes well. That surprises me a bit, I was expecting it to have negligible impact! That, of course, is why it's worth doing these tests. Next up, I think, is trying to understand how we can tweak the learning rate, and its associated parameters like weight decay. This will need a bit of a deep dive, so you can expect the next post late next week, or perhaps even later. I'm sure you can't wait ;-) Note to self: is there some way I could quantitatively measure those?  ↩ Note to self: is there some way I could quantitatively measure those?  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32c -- Interventions: removing dropout

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something! This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better? In a blog post last summer about architectural advances in LLMs since GPT-2 , Sebastian Raschka wrote: Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. That makes quite a lot of sense. My own understanding of dropout was that it was a bit broader than just preventing overfitting -- it seemed to me to be similar to the mandatory vacation policies that financial firms user to prevent over-dependence on individuals . My instinct was that having knowledge distributed across different weights in the model was good in and of itself, even beyond its benefit on multiple-epoch training. But it is quite a high price to pay. With the training parameters we've been using we're literally discarding 10% of our calculations' results -- attention weights, feed-forward neuron activations, and so on -- as we do the forward pass. It's easy to see why it would harm training. Let's give it a go. The nice thing about this one is that, unlike the gradient clipping experiment, I didn't have to write any new code. The dropout level was already controlled by a setting in the file , so by setting that to zero for this run, I could just kick it off and let it do its thing while I worked on something else: Here's what the training run chart looked like (please disregard the stuff about grad norms in the title and the axis -- I'll remove that for the next train): As you can see, we still have loss spikes, including one just after global step 20,000 that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping might have helped with that, but I'm very deliberately testing each intervention in isolation. At the end of the training run, we got this: So, interestingly, it took 967 seconds -- about 16 minutes -- less time than the gradient clipping run, and about 15 minutes less than the baseline train. So while gradient clipping added on a small amount of time (or maybe that was just noise), dropping dropout certainly seems to speed things up! I guess there's quite a lot of work involved in generating and applying the random masks that drop things out as we're doing the forward pass. Anyway, with the model trained, it was time to download it, upload it to Hugging Face Hub , and run the evals. Firstly, the smoke test, where it just needs to continue the sequence , it came up with something reasonably coherent: ...but it was on the test of the loss on the training set that it was most impressive: That's a bigger improvement on the baseline train's 3.692 than gradient clipping: 0.051, which is more than three times the improvement! Let's start keeping a table of these: Now, of course, we don't know how these different interventions combine together -- it would be naive to think that if we did both gradient clipping and dropout removal, we'd get a total loss reduction of 0.014 + 0.051 -- but, especially with that long-lived loss spike in our training run -- it does feel like they might play well together. So, that's dropout covered. Which one next? I think a nice easy one that I should be able to get done on a Friday will be adding bias to the attention weight calculations. Let's give that a go and see if it makes things worse or better! Stay tuned...

3 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping. In the training chart for the baseline model, you can see that there are three places where the loss suddenly spiked up, at around global steps 4,200, 13,000, and 23,000: There are a number of things that could cause loss spikes like that: Exploding gradients are common in RNNs, and also happen in LLMs like this one. I spent a bit of time reading around to find out how they happen, and the ah-ha moment came when I came across this post from Wanshun Wong . Not only is the post itself a good intro in terms of how it affects RNNs, but in the "further reading" at the end, there's some gold: Chapter 10.11 of [1] has a good overview of how gradient clipping works. Now, I bought a copy of " Deep Learning " at the same time as I bought Raschka's book, but I'd only glanced through it. Now was the time to get it down from the shelf -- and, indeed, section 10.11.1 is all about clipping to handle exploding gradients. I'll put the explanation of how they happen into my own words, to see if I can clarify things (at least in my mind). Normally, when we learn about gradient descent, it's illustrated with nice smooth loss charts like this imaginary one for a single-parameter model: We're told that we might start at point A. The gradient is quite high and negative, so we multiply it by our learning rate and subtract it from our parameter. That gets us to point B. This time around, the gradient is smaller as the curve is flatter there, so when we do the same -- multiply by LR and subtract -- we take a smaller step, and wind up at C. Rinse and repeat and we'll wind up near the minimum. The problem is, what if the loss curve actually looks like this: We start at A, with a small gradient, move a little to the right, and now we're at B halfway down a cliff! The gradient is massive, and when we subtract it, even scaled by the learning rate, we can zoom off somewhere to the right -- maybe not even on the chart. Indeed, you can imagine a cliff that is so steep that it would have vertical portions -- negative infinite gradients in this case -- and no matter what your learning rate is, you'll wind up with an infinite parameter update and everything will break. It's hard to see how a model can continue training in a case like that. Now, what can cause steep cliffs like that? The book says "strongly nonlinear functions, such as those computed by a recurrent neural net over many time steps". If you know about RNNs (I wrote about them if you'd like a summary), you'll remember that a single RNN might be quite shallow -- maybe three or four layers -- but when you're doing backpropagation, you run a number of inputs through, one after the other, work out the overall loss, and then "unroll" it to something similar to a "vanilla" neural net to do the backward pass. To put that in concrete terms, a 3-layer neural network trained with a 100-element sequence would unroll to a 300-layer deep network. Every one of those layers has several operations, including (in the implementation I was looking at in my post above), a t a n h . It's not surprising that there are cliffs in the loss landscape -- it's more surprising that there are any smooth bits! Now in LLMs, we don't have that unrolling through time -- but our network is deep enough as it is. For the GPT-2 small model, disregarding the embeddings and the final output head, we have 12 Transformer layers, each of which is multiple matrix multiplications for attention, then a softmax, then another layer, and then a feed-forward... mapping precisely to the equivalent vanilla NN is hard, but I think you can treat each one as at least four layers, so we've got 48. And there are GELUs and logs and exps 1 dotted around, so again -- we should expect cliffs. So if sometimes we'll get crazy gradients, what can we do about them? We clip them. Clipping gradients simply means that if they get larger than a particular number -- v , which we define -- we reduce them to that number. In other words, we have a cap on how big they can get. "Deep Learning" ("DL" from now on) suggests two ways to do it. Remember that while in the example above, we only had one parameter -- on the X axis -- for the GPT-2 small LLM we're training, we have 163 million of them. So the gradients, instead of being one number, will be a 163M-long vector, one per parameter. The two ways to clip are: The second feels more elegant -- we're scaling all of the elements of the gradient vector by the same amount, so it still points in the same direction. Interestingly, though, DL says that the two methods "work similarly", which I'll read as "are pretty much the same in practice". DL then goes on to say how infinite or not-a-number gradients should be handled. With the first way, clearly doing it naively would set every element in the gradient vector to v , which would make the total size (norm) of the update very large. With the second, it be even worse -- we'd still wind up with completely junk gradients, because the norm would be infinite, and in Python is , so we'd be applying gradients with NaNs in them at best. That would be likely to knock our model into unrecoverable territory, as any parameter that had that applied to it would be NaN forever. Their suggested solution is that if you get garbage gradients like that, you can take a random step -- that is, create a new gradient to apply that has the norm v but just points in a random direction. The idea is that this will move you away from the cliff-ridden part of the loss landscape where you've found yourself (more about that later), and things will continue nicely. So, anyway, how to do this in practice? PyTorch has a function, , and that's what's referenced in almost every bit of writing I've found about how to clip gradients. So I decided to use that, assuming it would do what was described in DL's second option and that it would do the random updates they suggest for non-finite gradients. (I was half-correct -- see later.) As to how to use it -- if we had a normal training loop, where we were just using a normal optimiser, we would go from: ...to something like ...where is the max value v from above. However, for our training code using Automatic Mixed Precision (AMP), it's a little more complicated -- but luckily, the AMP explainer we've been using has a section explaining what to do . Right now we have this: Per that explainer, we need to move to this: That looks a bit weird; we're "unscaling" the gradients, then clipping them, then using the scaler to step the optimiser. You'd think that you'd need to "re-scale" the scaler after clipping the gradients -- to get back to where you started from before the optimiser step. From the help page I gather it keeps track of whether or not the gradients it has right now are currently scaled and handles them appropriately based on that state in . Anyway, given that we know what the code looks like now, we need to implement it in a way that can be easily switched on for this experiment (and potentially in the future), but which also allows us to not use it if we don't want to. The best way with our setup is to make it a training option, so we can do it this way: ...with extracted from the file where we call it in : ...and we can just pass in for it in our function that we use to find the maximum micro-batch size for our current hardware, as all we're testing for there is memory usage -- we don't care if we're doing good updates. Here's the code delta for that , plus a bugfix to allow for files without a in them. But it would also be useful to be able to track when it "fired" -- that is, when we had to clip our gradients. Then we can see two things: Now, the docs for say that it returns the "[t]otal norm of the parameter gradients (viewed as a single vector)". It doesn't say whether that's before or after the clipping, but given that the return value would always be if it was after, I'm going to guess that it returns the pre-clipping norm (ChatGPT agrees). So we can chart that; changes in these diffs: 1 , 2 , 3 , 4 . So we now have code to clip gradients to a given norm size and to chart the gradient norms so that we know what they were before clipping. The question is, what should that clipping norm be? Some googling around suggested that there was no standard way of saying "for such-and-such a kind of model, gradients should be clipped at around x ". For example, on this Reddit thread , says "Common values are 1, 3, 5, 8, 10", and likewise sample code in this tutorial . has 1, as does this one . So my initial thought was, let's just use 1. But then I wondered, what actually are the gradient norms that we're getting in normal training? I decided to run a local short train on 3m tokens (a thousandth of the full training set, taking just less than four minutes) with very frequent checkpointing, and gradient clipping set to 1, and see what happened. You can see that the "grad max" line is almost always above the "grad clip" -- we're almost always clipping. This doesn't sound right. It looked like the range of the grad max was generally beween 1.1 and a little above 3, so I set the to 3.5 and did another train: Our loss is about the same, but we're no longer clipping -- and that's what we want; there was no evidence of exploding gradients for that short run -- just big updates near the start, as you'd expect. I then ran the same with no gradient clipping at all, and got exactly the same shape for the loss chart as I did with gradient clipping at 3.5, and the same final loss -- that's a good signal that clipping is not affecting the train when we stay inside the limit, which is exactly what we want. So, it was time to train our model! I kicked off the train, and after a little while, I looked at the training chart, which is updated dynamically as the model trains: You can see the dotted green lines, both the light one and the dark one -- that is, the "grad max" and the "grad avg" -- disappear starting just before global step 4,000, only coming back at about 5,500 -- that is, these were not plotted for global steps 4,319 and 4,936, even though the loss was. What was going on? I took a look at the checkpoint meta file for the first of those to see what the actual numbers were, and saw this: Aha! The PyPlot code I was using could not handle infinite values, which is entirely reasonable. That was easy enough to fix , though -- I just replaced positive infinity by 1,000,000 and negative infinity by -1,000,000, and then (in the interest of getting a proper from-scratch run) kicked everything off from the beginning. That training run completed with this chart: That's a little hard to read, but if you look closely at the green lines, you can see that there are seven periods where gradients were either very large or infinite. Weirdly, though, out of the seven, two of them were two checkpoint periods long (that is, two periods of 617 global steps). That felt weird, though of course we're looking at the maximum gradient norm and the average gradient norm -- so two single infinite/high-gradient steps in successive 617-step periods would lead to that effect. What was even stranger, though, was that if you look at the training chart for the run with no gradient clipping, we have only three loss spikes rather than seven: ...though it's also very noticeable that the gradient-clipped run had only two small loss spikes, unlike the three larger ones in the unclipped run. The training loss the gradient-clipped run reported at the end was better, too: ...versus 3.743 at the end of the baseline train. So it was time to download it, and run the sequence-completion smoke test: Coherent enough! Next, we evaluate it against our held-back test set: So, the loss had gone down -- but only from 3.743 to 3.678, a reduction of 0.065, or about 1.7%. That's not actually all that bad! After all, in my initial experiments on my local machine, training for a Chinchilla-optimal number of tokens from FineWeb-Edu (rather than the regular FineWeb I'm using now) got a loss of 4.167 on the same dataset (weirdly worse with the more-curated training set), and training for a further Chinchilla-optimal number of tokens only brought that down to 4.135, for a difference of 0.032, or 0.7%. It's not strictly comparable due to the different training sets, but speaking very loosely, we could say that gradient clipping for this train had more effect than doubling the training time for the other one. That's pretty nifty. But the question remained: why those long periods of high gradients, even with gradient clipping? And why were there still loss spikes -- in particular the one just before global step 12,000, which lasted for two checkpoint periods? Remember that when I started the first run of this train, and got the chart with the missing bits, it was because the logged and were infinite. What happens when gets an infinite gradient -- either one that has an infinity as one of its components, or one that (due to numerical overflow) winds up with a norm of infinity anyway? I'd been kind of assuming that it did what the authors described in "Deep Learning" -- a random update of norm v -- given that the book stated pretty confidently that you "can" do it but then appeared to consider the topic closed. But it doesn't! If you check that link to the docs, you'll see that it has a parameter , which is by default. If it's set to , that will raise an exception if the norm is positive or negative infinity, or if it's not a number -- which catches both the infinite component and the norm overflow cases above. But if it's not set -- and we weren't setting it -- and the norm or the gradients are non-finite, then will essentially return garbage gradients. Depending on the exact cause, elements will either be infinities of one sign or another, or NaNs. And if these are added to parameters, then those parameters will become garbage too. Now that leads to the question, given that we know that somewhere in the period between the checkpoint at global step 4,319 and the previous one at 3,702 there was an infinite norm at some point, how on earth did the model manage to continue training after that? Loss went up at around the same time, but it wasn't completely broken as it would have been with NaNs or infinities in its parameters. Obscurely enough, the answer turned out to be in the AMP explainer , in a comment in one of the bits of example code. Regarding the class we're using: So what was happening was that the scaler -- something we introduced into our code to get a speedup by using 16-bit floats instead of 32-bit whenever PyTorch thought it would make sense -- was protecting us against infinite and NaN gradients as a side-effect. It was skipping updates that would have polluted our weights with bad values from non-finite gradients. If the above comes across as a little frustrated, then it's because I am a bit! From a software engineering viewpoint, this situation really does feel a bit like a rather messy part of the API. There are three things that it's reasonable for a library to do with infinite/NaN gradients: Now, if we look at that , we can see that the first two of those cases are handled there; and the developer can choose which option to follow. It's not where I'd personally put it (the function on the optimiser seems more natural) and I think I'd probably set the default to too, but I can also imagine good reasons for it being the way it is -- backward compatibility for one. But the "skip non-finite gradients" being a (not even optional!) behaviour that is on a class designed for handling mixed-precision training just seems outright bonkers. I would be surprised if there weren't people out there who've spent days trying to work out why their training runs failed catastrophically when they decided to switch from mixed-precision to "full fat" 32-bit floats, not realising that a hardly-even-documented feature of the scaler 3 had been saving them from gradient issues previously. Anyway, rant over. What does this all mean? There are three ways a gradient can explode: With both the baseline code and our new code, the was saving us from the last two of those, by skipping the optimiser steps with non-finite gradients. However, the baseline run was not protected against the first kind -- large but finite gradients with a finite norm -- while this run was protected. What I'm almost certain is happening here is that in all of my training runs so far, there have been all three kinds of issues with exploding gradients. The , which again, we introduced for faster training, happened to be saving us from the infinite gradients/norms. But we were still being bitten by the finite but excessively large ones. And that, I think, is why this training run had a positive -- not huge, but certainly worthwhile -- effect on the test set loss. If I had more time, I think I'd do another run, logging all three of those categories of error to see how frequent they are, and charting the result. That might go some way to explaining the final question I had here: why is it that the renowned "Deep Learning" suggests a random update to get away from the cliff where you've found yourself, while we seem to be getting away with just skipping the update, which is much simpler? Well, the book was written in 2016, and I guess rather a lot has changed in the last 10 years :-) My guess is that their solution might have been a solid default in the age of RNNs, but might not make so much sense with the kind of models we're training these days. I think I can see a way in which that makes sense. Think of the illustration of a loss "cliff" in a one-parameter world that we had at the start of this post: If you happen to wind up on that cliff, you're in trouble. But imagine a two-parameter model -- the line of the loss function becomes a surface. Just as in the real world you might be able to walk along the edge at the top of a cliff and find a nice easy slope down next to it, you can imagine that the cliff in the two-parameter case might be less of a problem because you don't need to be lucky enough to jump down it -- you can walk around it. Extrapolating examples like this to higher dimensions is risky, but I think it should hold that the more dimensions you're working with, the less likely it is that a cliff is an issue -- you're more likely to be able to find a way around it. I've heard a very similar argument made for why local minima are less of an issue with lots of parameters. It's certainly worth saying that this is far from a mathematical proof, but I think it's a decent grounding for intuition. Now think about an RNN. Although you're doing back-propagation through time over what amounts to a very deep network, there aren't actually all that many parameters, certainly compared to an LLM like this. Each parameter is involved in the back-propagation multiple times. So, thinking of it that way, the gradient vector for the RNNs they were dealing with was of much lower dimensionality than the ones we're dealing with, even for this tiny model. They say that the random step "will typically move away from the numerically unstable configuration". I'm probably playing fast and loose here, but I'll take that as something like: if you wound up on a cliff, you were likely in a very "cliffy" area of the loss landscape. "Teleporting" randomly to somewhere some distance away was a sensible way to handle that. In our situation, even if the area is "cliffy" in the direction that one particular batch might push us, we have so many extra dimensions that it may well be that it won't be so bad with the next one. So just skipping the problematic update -- under all of those assumptions -- seems a perfectly reasonable way to handle it. All of this, BTW, made me think back to validation loss. In our previous training runs, where we were measuring it just before each checkpoint, its spikes were in general correlated with but not identical to spikes in training loss: Now, of course, exploding gradients don't have to be related to high training loss -- there's enough non-linearity in there that we can treat them as being completely uncorrelated, I think. But you definitely would expect them to have an effect on validation loss if applied. Disregarding the infinite ones (which were being filtered out anyway), the very high ones that we are now clipping would, in the unclipped baseline train, seem very likely to have caused validation loss spikes. So: if I hadn't stripped that out, we would likely have been able to see a clear difference in the validation loss line between clipped and unclipped. That would have been useful! I'm not going to re-introduce it, though. Best to keep the number of code changes to a minimum if I'm trying to compare like with like over the course of these intervention tests. I think that's enough for gradient clipping. I may come back and do the experiment another time to see what the relative ratios of the different kinds of problematic gradients are. Are there parts of the train where we get lots of them as a percentage (ie. we're somewhere "cliffy" in the loss landscape)? How many infinite gradient vs infinite norm vs big-but-not-infinite instances do we have relative to each other, and to normal gradient updates? What do we see if we have validation loss? And so on. But for now: gradient clipping definitely helps, and goes on the positive interventions list! I'm thinking I'll see what happens with switching off dropout next. That should at least be a bit easier... Stay tuned! Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩ A "bad batch" -- that is, one batch, or even one sequence in a batch, was massively different in structure to the others that the model had seen, so it just had much worse loss. That doesn't seem likely in this case, though: the numbers on the chart are averages over 617 global steps each, and it would take a truly pathological sequence to move the needle that much. Something weird in the optimiser. That's not something I understand well, but according to the various LLMs I'm working with, it's a possibility. Exploding gradients. This is my working hypothesis, and so in this post I'll try out gradient clipping, the normal solution to that problem. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (2016), MIT Press. We clip element-wise. If any one of the gradients in the vector is larger than v , we reduce it to v . We clip based on the norm: the length of the gradient vector in -- in our case -- 163M-dimensional space. That sounds harder than it is -- it's really just an extension of the Pythagorean equation that a 2 + b 2 = c 2 to multiple dimensions. If you want to work out the length of a vector ( a , b ) then you can use Pythagoras to work out c = a 2 + b 2 , and that generalises to any number of dimensions. So for our model we'd just square all 163M elements of the vector, sum those, and take the square root of the result, and that's the norm. 2 If the norm is greater than v , we just divide every element of the gradient vector by the norm and multiply the result by v , to produce a new gradient vector whose norm is v . Whether we actually did wind up clipping them and fixing those loss spikes Whether we were clipping at other times -- we don't want to be doing it unnecessarily. Blindly apply them and expect the developer to sanitise their inputs. Raise an error. Take some kind of default sane action, like skipping the update. It can get very large, still be finite, and have a finite norm. It can get very large, still be finite, but have an infinite norm (eg. due to numerical overflow) It can become infinite -- that is, at least one of the parameters' gradients is infinite (which of course means an infinite norm regardless of any numerical stuff). Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩

1 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

I'm rounding out my series of posts on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090 , and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better. For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud . That led to some refinements in the prompt-following test I was using , and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub . Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to. I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list: I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'. The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU. But first, we need a baseline model. I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one? In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on. In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss). Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it. I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point. Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility. So: what was the baseline train going to look like? The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed , which I set to 42. A couple of bugfixes, and we were left with this version of the code . One thing to highlight: in the file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that: Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together. If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud. Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1 . On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12 × 8 = 96 , which means we'd be able to do gradient accumulation and do a parameter update every 96 ÷ 6 = 16 steps. Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help! So, having coded that up and set up the configuration, it was time to run it. Here's the training chart it came up with: Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later. The training run reported this at the end: So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost. Here is the model on Hugging Face Hub . Let's see how it looks. For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other . But the smoke test -- how does it complete the sequence is worthwhile: Looks good! Reasonably coherent. Now we can find the loss on our held-back test set: That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216). So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart. Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training. If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case! So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35. All in all, it was good to see them there, as it sets us up well for that experiment. So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests! Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping? Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow. Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩ The amount of training data. I'm not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we'd need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring. The number of epochs. I'm going to stick to single-epoch training -- that is, I'll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine. The bias on the W q , W k and W v matrices. This one definitely sounds worth looking into -- easy, as it's just a change to a config flag, and makes the model more like the original GPT-2. I'll give that a go. Dropout. I've read that for single-epoch training, dropout doesn't help (which doesn't quite work with my mental model of what it's for, but does sound plausible). Worth a look! The learning rate, and weight decay. The values I've used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit. The precision. I'm using AMP , which means that some calculations are done in 16-bit rather than 32-bit, and calling with to let PyTorch choose to use the GPU's tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you'd get if you trained with full-fat . Would reverting that be worth the longer train time? I should probably at least poke at that. The batch size. I've already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation. Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training -- maybe 10-15% of training time -- to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating! Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩

0 views
Giles's blog 2 months ago

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face

As part of my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally , and four in the cloud . I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank. It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too. Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later). From the post where I trained the models locally , we have: Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs , four models (with two checkpoints from one of them): You can see how they compare on my evals at the bottom of this post . I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that. Here's the code I've been using as a smoke test after training a model to make sure it's not complete garbage. There's quite a lot of it. That's a lot of faffing about to generate a continuation of ! Disregarding the boilerplate with the argument parsing and validating, we have to load up the model, load up the tokeniser, encode our prompt, and then do a bunch of rather arcane stuff 1 to sample from the model to generate some tokens before we finally print out the result. With the HF Transformers library, there are extra levels of abstraction that allow you to do things much more simply: ...and I wanted what I published to work with that -- and, indeed to be trainable further using the associated training library, like I did during my fine-tuning experiments . I managed to get that all to work, but it was quite a lot more effort than I expected. But at the end, both the pipeline code above, and the training code that you can see in this notebook worked fine. I'll write a follow-up blog post shortly about how to write the code to make a vanilla PyTorch model work within the Hugging Face ecosystem (probably not as part of this LLM from scratch series, as it's a bit of a tangent). But in the meantime, if you're using HF and want to take a look, have fun :-) I've put all of the models in a collection . Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩ -- the first model in that post, trained on a roughly Chinchilla-optimal number of tokens (20x the number of parameters) from FineWeb . -- the second model, trained on the same number of tokens from FineWeb-Edu . -- the third one, which is the model trained further on another roughly Chinchilla-optimal number of tokens from the same dataset. -- trained on a 8x A100, 40 GiB/GPU machine. -- trained on a 8x B200, 160 GiB/GPU machine. -- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was not in the last iteration, so this is the checkpoint with the best loss. -- this one is the final checkpoint from the one above. -- trained on a 8x A100, 80 GiB/GPU machine. Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090

Having worked through the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware? The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series , I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time. But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC. Additionally, Andrej Karpathy recently announced nanochat , "the best ChatGPT that $100 can buy". He mentions on the main page that he's trained a model called , with 32 Transformer layers, which has 1.9B parameters, for about $800. His smaller 20-layer model, with 561M parameters, he says should be trainable in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the $100 total price. What's even more interesting about nanochat is that it's built with PyTorch; initially I'd got the impression that it was based on his pure C/CUDA , which I would imagine would give a huge speedup. But no -- he's using the same stack as I have been in this series! Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help. This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project. But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs. Here's the full story. For this project, I want to use the exact same model code as Raschka presented in the LLM from scratch book -- my copy here . There have been a number of architectural improvements to LLMs since GPT-2, but for now it's best to keep things simple. But there are still some settings to decide on. The config dictionary for the models we've been using has these parameters: There's also the aspect of weight-tying -- the original GPT-2 reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits . There's nothing in the code we've been working with to enforce that, though -- when we do our small train in the book, we're using independent weights for each of those steps. The only time it is "enforced" is when we download the pretrained weights from OpenAI, where we put the same values into both the embedding matrix and the final output head. Given that Raschka says that it's in general better to avoid weight-tying, and actually doing it would be harder than not doing it, then it seems a no-brainer to not do it. So, what does that mean about our model? That matches what we got when working through the book; 163M parameters. Can we train it? It seems like every AI project starts with the question "what data can we use?" The original report on GPT-2, " Language Models are Unsupervised Multitask Learners ", is frustratingly lacking in details. However, it does say that they trained it on "8 million documents for a total of 40 GB of text". Now, according to OpenAI , it's reasonable to assume roughly four characters per token for typical English text. So 40 GB of text is ~10 billion tokens. That data was essentially gathered by scraping pages linked from Reddit that had more than three upvotes there, so was reasonably high quality. Can we get something similar? Conveniently, Hugging Face host a big dataset called FineWeb , and that has a 10 billion token "sample" dataset, randomly selected from the full 18.5 trillion tokens. So the sample feels like it's order-of-magnitude right. And while reading more about Karpathy's nanochat, I spotted that it uses FineWeb-Edu , which is a version of FineWeb that contains "only the most educational web pages". I wrote a script to download both of those , and kicked it off. It took about 20 minutes for each one (slow wifi in my study, I was getting < 5MB/s); FineWeb's 10B sample took up about 29 GiB, and FineWeb-Edu's about 27 GiB. Time to take a look at them. The Hugging Face function loads up all of the files you provide, and you can tell it how to split them up into train/validation/test sets. This command just loads up the whole FineWeb one and says "treat it all as the train split", which is good enough for now: Yikes. It took 1 minute, 53 seconds to generate the train split. However, that appears to be a one-off cost -- when I accessed it again later using the same code in a different Python session, it just did the second "Loading dataset shards" portion, taking three seconds, not the generation of the split. Presumably it caches it. Anyway, let's see what's in it: Great, so we have 14,868,862 rows, each of which has various bits of information. Checking the first one's text: Well, for FineWeb, that doesn't look particularly "fine", but I guess it's better than the stuff that Karpathy talked about in his recent interview with Dwarkesh Patel : When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet Let's take a look at FineWeb-Edu. That looks a lot better! Now let's take a look at the document lengths in terms of tokens. There's a column, but I don't know which tokeniser that's for, so to be safe we'll calculate it ourselves. How long would it take to tokenise every row in FineWeb 10B to check? Let's tokenise the first 10,000 of the 14,868,862 that we have, and see how long that would take -- then we can work out the estimated time for the whole thing. 2,160 seconds or about 36 minutes. Yikes! After a bit of digging, though, I found that tokenisers can handle batches (poorly documented, but it's there in the source ): Also, we can map a function over an entire HF dataset, and that can be made to run with multiple processes. So, we can combine the two: Just over three minutes, not too bad! (The reason the command count above jumps from 47 to 53 was that in the first run I didn't have the in there -- one of the rows in the dataset had in it, and the tokenizer rejected it. I'm going to play fast and loose and ignore that for now.) Now let's see how it added it: Cool! We've added a column with the number of GPT-2 tokens for each row, and we can extract what amounts to a list of those values. Let's plot them as a histogram. Trying to do it directly -- that is, just doing ...seems to make MatPlotLib very unhappy, and my interpreter crashed with an OOM -- I think it might be trying to load all of the dataset -- text, IDs, etc -- into RAM in one go. So I started a fresh one and did the stuff to load it and annotate it with token lengths again -- weirdly, this time the mapping only took 10 seconds or so! That was strange, I'll need to look into that. Perhaps the earlier command added the column to the files on disk? To work around the memory issue, I converted the column from the dataset to an actual list: That took ten or twenty seconds. Let's then try the plot again (full code this time): That took about 11s to run, and the result is this: That's really promising! The bulk of them are less than our 1,024 token sequence length. 1 If we present each row in the dataset as a stand-alone training sample, cropping them when necessary, perhaps we won't lose too much data? Let's see. First step, how many tokens are there in total? Nice, about 10B, as expected. How many tokens would we have if we cropped them to the default GPT-2 context length of 1,024? Ouch, 7.3B. That's quite a reduction: So we're losing 29% of our tokens by that cropping. That's from curtailing just 16% of the sequences: That's not great. I feel that we have two options here: At this point in the experiment, I'm going to keep both options open. I'm inclined towards the latter (I believe it's closer to what the real GPT-2 train did), but I'm not sure. Anyway, we're scoping things out here, so let's move on. After looking at the data, I've thought a bit more about this. I'd previously been thinking in terms of training across all of the tokens in the dataset; we'd work our way through the 10B tokens, and then we'd be done. But when training a model, you do multiple epochs, normally -- you run through the dataset once, updating your gradients as you go, then run through it again likewise, and eventually you stop when your validation loss starts rising. I think that because I'd read that LLMs are normally trained on just one epoch these days, I'd kind of internalised that we only need to do one. But it wasn't the case in 2019 when GPT-2 came out. They had less data -- just 10B tokens or so, compared to insanely huge datasets like the full FineWeb (not the 10B one we've been looking at -- the 18.5T full one), so they would have trained it for some number of epochs. How many? That's another case where the GPT-2 paper is annoyingly light. This report says in the "Replicating GPT-2" section that OpenAI trained it for 800k iterations with a batch size of 512. Plugging in a sequence length of 1024, that gives us this many tokens: Over 419B tokens! Now, if we believe that their dataset was 10B tokens, then we can work out how many epochs that came to: The same report says that they -- as in, the report authors -- make that "around a total of 60 epochs through the training set" -- I believe that the training set they're talking about could well be slightly shorter than the original GPT-2 one -- the GPT-2 authors didn't release their own, which is called "WebText", so the report's author is using a different one that tries to replicate it, OpenWebText . That sounds expensive; even without knowing how many tokens per second we can train for, 40-odd epochs of 10B tokens each sounds like it would take a long time. Are there any other comparison points that might tell us how long to train for? Well, there's a "Chinchilla heuristic" that I've heard of, which says that you should train on about 20 tokens per model parameter. I spent some time reading into where that comes from; originally it's in " Training Compute-Optimal Large Language Models " from Google DeepMind, and it's an interesting paper, and is surprisingly easy to read, with a few bits of maths that get a bit hairy (but aren't required to get a good-enough feel for what they're saying). I recommend you take a look. It was written in 2022, and the authors felt that people were scaling up models a lot, but weren't increasing the number of tokens that they used for training enough. So, they trained a huge number of models, trying to answer the question: "given a particular budget in training FLOPs, what is the optimal balance of training tokens versus parameters to make sure you're using those FLOPs most efficiently?". They were arguing against the method taken in a particular paper, where another team had trained a model (called Gopher) on significantly fewer tokens than they thought optimal. The number of FLOPs used to train a model is linear with both the number of parameters and the number of tokens you train it on, so if you get 2x the number of FLOPs that you had before, you can either train the same model on twice as many tokens, or you can double its size. Which is better? Their conclusion was that you should actually scale both parameters and tokens up by the same amount -- that is, in the 2x case you'd want to have 2 times both the parameters and tokens, which would double your FLOPs and get you better performance. As you can probably see, by doing this they indirectly worked out an optimal number of tokens to train a particular size of model for. They don't state the "20x" heuristic themselves, but it's pretty clear in table 3 in the paper, where they give a number of model sizes and the optimal number of tokens for each. Now, this number is not the number of tokens you need to train for to get the best model you can for a particular number of parameters; a model of a given size can always be trained more and will (hopefully) get better. But it tells you when you've trained on enough tokens that you could get better results by training a larger model than you have right now. They're implicitly assuming that models can get as large as you want, which of course is not the case -- in reality, you're going to be targeting a particular model size, the size that can fit on your training hardware (or more likely with production models, the size that can fit on your planned inference hardware). But interestingly, looking at the README.md for Karpathy's nanochat project, he trained his 1.9B "d32" model on 38B tokens -- exactly 20x. And if you look at the script in the same repo, he explicitly says that he's training for 20x parameters for the smaller model: If Andrej Karpathy thinks that training for Chinchilla-optimality is the right way to go, then who am I to disagree? ;-) More seriously, perhaps the better quality of the dataset makes this a reasonable thing to do. From the GPT-2 paper, their description of how they got the data: ...we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. That's a clever trick, but I believe that FineWeb is much more carefully filtered and improved than the WebText dataset they got from that. Back in 2019, they had to do everything from scratch -- find appropriate ways to get data, filter it, and so on. Now we can just download stuff from Hugging Face. So maybe Chinchilla-optimal is enough. Anyway, we have 163,009,536 parameters, so on that basis, let's train for: ...tokens. (I'll just use 3.2B from now on, but that's the actual number I mean.) That's pretty cool! We have more than that number of tokens already in our FineWeb 10B sample, so we can do a single-epoch training run. So the question is -- is that even doable on my hardware? It all hinges on how many tokens per second we can train at. A good way to check this is to write a throwaway "trainer". We can use that to work out what our maximum batch size on the RTX 3090's 24 GiB of VRAM, then run a bunch of batches through -- a forward and backward pass for each -- and see how many we get. This won't estimate how much time we'll spend validating the model, of course. But my gut is telling me that we should spend no more than 5% of our training time running validations, so we can later on do a similar test, eval mode, forward pass only with no gradient tracking, and use that to work out how many tokens should be in the validation set. So, let's estimate training speed. This code gets an estimate of tokens/second at different batch sizes. Hopefully it's clear enough to not need an in-depth explanation. An outline: Here's what it prints out: So we can see that it gets faster as we increase the batch size, which makes sense because we're handling sequences in parallel, but it does flatten off a bit, which makes sense because there's a limit to how much parallelism we can do, even on a GPU. Let's see how that fits in with the different training sizes we looked at above: OK. We're definitely not going to be able to train this thing the GPT-2 way! I expected that to be the case, but now we have a solid proof of that. But the three-day Chinchilla-optimal train actually sounds doable! I'm heading to London to visit family soon, so won't be using my home PC. With a bit of help from Tailscale I'll be able to log into it from my laptop, though, so I can potentially nurse a run through. Can we make it any faster? Now, when doing the fine-tuning work, I found that you could generally speed things up by doing everything in 16-bit rather than 32-bit. Intuitively that makes sense -- lower-precision numbers, fewer bits, means less work for the GPU doing the various multiplications and additions that are involved in our train. Working with ChatGPT, I found a couple of ways to take advantage of that. Firstly, using TF32. The normal float32 format uses 8 bits for the exponent, and 23 for the mantissa. If you haven't looked into how floats are represented in memory (or if you've forgotten), that means that, using m to mean the mantissa and x the exponent, the numbers are represented in memory as TF32 is messier; it has the same exponent size -- and thus the same range -- as float32, but it essentially ignores the lower 13 bits of the mantissa. So it takes up the same amount of memory, but is lower-precision, which means that calculations can be faster. Most importantly, cards like the RTX 3090 have dedicated "tensor cores" -- as opposed to the normal CUDA cores that do normal matrix multiplications -- and they operate in TF32. Unsurprisingly, "TF32" is "tensor float 32-bit". The PyTorch allows you to tell it what precision to use for matrix multiplications; the default is , which means "use float32 all of the time", so you're stuck using just the CUDA cores. If, instead, you set it to , then it will use TF32 if the hardware supports it and it has the appropriate kernels available. So that will let us use the tensor cores. I added this to the code above just above the loop over the different batch sizes: Let it run, and: That's a 22% speedup! Of course, the precision of the training isn't as good. But given that many modern models are trained at 16-bit (I've seen suggestions that some are even trained as low as 4-bit) then that shouldn't matter. Let's see whether we can train in 16-bit instead. PyTorch has a smart mode where you can tell it "use 16-bit where it makes sense, otherwise use 32-bit" -- AMP, which stands for "Automatic Mixed Precision". There's a great recipe for how to use it in the docs , so let's use that. We need to create a object to handle scaling parameters from 16-bit to 32-bit as needed -- we can re-use that across all batch sizes so we can create it just before the loop: ...then we need to replace this core part of our training loop: ...with some code to use AMP and that scaler -- basically we use a context manager to switch it on when we're doing the forward pass and work out the loss, and then use the scaler to manage the backward pass and the optimiser's step: Running that gives us these results: Wow! With that we can train on 3.2B tokens in about 160,000 seconds, which is 44 hours. That's definitely doable. Now, what happens if we remove the ...so that we're using AMP, but not the tensor cores? It's basically the same. 300tps slower at the start, down to 70 at the end. Still, it looks better to keep the "high" precision in place, rather than the "highest". Right. We have the beginnings of a training loop that should be able to let us run a Chinchilla-optimal train on a GPT-2 small sized model in 44 hours, and I have the time to do it. And it looks like a batch size of six is what we can fit into the RTX 3090's 24 GiB of VRAM. What else are we going to need to build something to do this? If I want to do a long training run, then stuff might go wrong -- it might crash for some reason. So we're going to need to save checkpoints as we go and be able to restart training from those checkpoints. In those, we're going to need to save the model and the optimiser's state, plus some kind of info about how far through the dataset we are. We should keep training and validation losses too, so that we can easily chart and recover our progress, and according to this forum post we're going to need to save the scaler (which makes me think that it actually has state in it, so we probably should have used a fresh scaler for each batch size in the above -- let's hope that doesn't prove to be a problem [note from later: it wasn't]). I wrote a script to create a model, train it for a bit, and then dump out all of that apart from the metadata (which I reckon is going to be less than 1kB). I wanted to use the safetensors format for all of it, but unfortunately I couldn't get it to work for the optimiser or the scaler, so had to use for those (which I don't like because it uses pickle , which introduces serious problems if you ever want to move files from machine to machine, as the Python and library versions need to match perfectly). Ah well. Here's what the test checkpoint looks like: That's huge! And it's almost all the optimiser. From what I read, that stores two numbers per parameter, so it makes sense that it's double the size of the model weights. And at 32-bit, 4 bytes per param, then 670MiB for the model is sane. Timing-wise, it takes about a second to save, the same to load, so that's fine. So that sounds reasonable in terms of timing, and disk space is pretty high, but not so huge that it can't be managed with careful planning -- don't checkpoint so much that we run out of disk during the train (I have a 2TiB disk, but it's far from empty). It's probably worth double-checking that it works, though! Because my checkpoint test already did some training, I changed it so that it does this: Looks sane! The numbers for loss are the same before and after, so I think it's vanishingly implausible that the checkpoint we restored is different from the one we saved. And the continued training seems to be working -- at least, loss is going down -- so that sounds reasonable too. OK, so, again, the time taken to checkpoint is negligible, but the disk space isn't. I reckon we can comfortably do 100 checkpoints over the train. That's roughly one every half-hour over 44 hours. We're going to want to do a validation run each time we checkpoint, so let's think about that next. How big should our validation set be? Let's say we only want to spend 5m per checkpoint period doing validation. How many batches can we get through in that time? I wrote a simple script to run a model (after a few hundred training steps) in eval mode on different numbers of iterations to see how long each one took. It used the same trick as the training loop above in order to use mixed precision, and I ran it with instead of the that I've used in the past (ChatGPT tells me it's a little faster). I also put in some calls to around the loop that I was timing, which should apparently help make sure that the numbers are precise. The code is here if you'd like to take a look. After some fiddling with the min/max numbers at the top: OK, so let's call it 3200. That's 3200 * 6 * 1024 tokens = 19,660,800 tokens. That's about 0.006144 of our training set. Pretty low, but we're talking about such a large training set that I think we're OK. And practically we can't do more -- we're already talking about 5 mins every half-hour, so we're bumping up our train time by 88 * 5 = 440 minutes, which is seven hours. Now let's start thinking about the datasets. We can split the HF thing into train and validation sets. I'm thinking it might be useful to load all of our training and validation data into RAM for the train loop. 3.2B tokens with four bytes per token should be about 13 GiB, after all, and I have 64 GiB RAM on the machine. ...but wait, int64 is the default for PyTorch for long ints -- that's what our token lists are in the original, and it's twice the size, so we're talking 26 GiB. I believe that PyTorch expects that format for the cross entropy loss. That's not the end of the world, though -- we can store the data as int32 in RAM (with 50,257 as our vocab size we could even use int16 if we wanted to) and then we'll need to make them the right type just before using them. We can do that when splatting them onto the GPU, eg. First thought, can we store them as a Python list? Turns out they're not all that memory-efficient, though: How about PyTorch tensors? Promising! (Though ChatGPT pointed out when reviewing a draft of this post that I was using the default rather than an type here. Still, it's the same size.) Let's measure memory usage in a new interpreter. Yup, 12,801,474,560, so about 12 GiB. Can we save it? OK, let's try reloading it in a fresh session: Nice. So, I think we can write a quick script that splits our incoming dataset into say 99/1% train and validation, grabs the first 3.2B tokens from the training set, glomming them together into one big tensor with EOSes between them, and saves them, and then does likewise for the first 19,660,800 tokens from the validation set. We'll use FineWeb, with the possibility of switching to FineWeb-Edu later on. Doing it that way means that we're actually using the second of the two options I considered earlier: Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. I thought it would be harder than concatenating/padding rows, but it actually turns out to be simple enough. Let's give it a go. Here's the code . I wanted to have an round number of 6-sequence batches of 1,024 tokens each, so the the number of training tokens worked out at ...rather than the strict Chinchilla-optimal 3,260,190,720, but that's no biggie. Running it takes 5m55s, and then: Looks about the right size -- 19M * 4 for val, 3.2B * 4 for train. Cool! Let's finally write our training script. You can see the full training script here -- note that this is the final version from the repo, so isn't exactly what I'm running at this point in the post. The checkpointing code is (sensibly enough) in a separate file, . It took two days to run, and... Both train and validation losses fall nicely! Training loss is a bit choppy, but that's because I erroneously only plotted the most recent iteration's training loss rather than an average over all iterations between the last and current validation run; the validation loss is correct because I did average all of the validation numbers. (The version of the code linked above fixes that error.) The best epoch for val loss is not the last one but it was close. Looking at the last 5 iterations, their val losses were: It's time to do some evals Firstly, let's try the smoke test that we do in the book. What does our model think should come after the text "Every effort moves you"? With uninitialised weights we get gibberish, as expected But with our best checkpoint we get this: Nice! The multiple mentions of protein is actually the kind of repetition that small models tend to do, so that's not bad news. Let's try with the last iteration's checkpoint: Also very nice, perhaps better! I think that both of those are qualitatively as good as the result we got when we loaded the pre-trained weights from OpenAI , which was: That's very reassuring. But is there something a bit more quantitative that we can do? Firstly, can we compare it to anything in the GPT-2 paper? In figure 4 they give their perplexity against their train and test sets for the different model sizes; for the small one it's a bit over 16, Let's assume that they're basing that on natural logarithms, so they mean that they have a loss of ln 16 . That's , which is much lower than our best loss of 3.9401. However, that is across different datasets, so while it makes me suspect that their model is better than ours, we can't really say for sure either way. The cool thing is, though, that we have their model -- so we can actually run it against our dataset. I wrote a script called , and running it gives us this: Still better than ours :-( I considered doing the same thing against Qwen to see whether that was also better, but with a different tokeniser we couldn't really treat it as comparable. Loss and perplexity are both over next-token predictions, and if the meaning of "token" changes, then the numbers will change. 2 OK, so we have a model, but it's not as good as the original GPT-2 small. Our loss on our validation set is roughly 3.94, while the original weights get about 3.50. Expressing that in terms of perplexity gives our own model about 51.4, while the original has 33.1. That's actually still higher than the 16 that they had in the paper, which is interesting -- presumably it's related to the fact that they're validating over their own WebText test set rather than ours; they're both samples of web content, but there must be differences. At this point, my guess is that this shows that all of that extra training that the OpenAI team did beyond the Chinchilla-optimal number of tokens did have a real benefit -- and that's not suprising. Remember that the Chinchilla paper is about the best way to spend a FLOPs budget. They're not saying that you can't drive down loss by continuing to train your model further -- of course you can. They're saying that when you pass the optimal number of tokens, you should increase the model parameters and the tokens by the same ratio, and by doing that you'll get the best balance. But still, a Chinchilla-optimal model of 163M parameters might still be useful. What happens if we instruction fine-tune it like we did the original model in Chapter 7 of the book ? In that post and its followup , we used some training samples using the "Alpaca" one-shot question-answering format: ...to get a model that we then provided a test set of questions in the same format, then used the Llama 3 7B model to judge the results on a scale of 0 to 100. We then averaged the results and got a plausible-looking indicator of how useful the model was, as compared to the more narrowly technical loss number. One problem with that is that we ran those tests on the OpenAI weights for the medium-sized 355M-parameter GPT-2 model. If we don't want to be comparing apples to oranges, we'll need to re-run it on their weights for the small model. Let's see how we do. First, let's run it for five epochs just to see when/if it starts overfitting: OK, so two epochs looks like the right amount, just as it was with the medium model. So we can train for that (because I'm using the original code I wrote when working through the chapter, I didn't checkpoint during training -- but it takes less than a minute to run the whole thing, so no biggie). Here's the loss chart: Validation loss at the end is 0.733, noticeably above the 0.649 that I got with the medium-sized model. And the sample outputs shown at the end aren't as good, either. With the medium-sized model, I got these: ...but with the small model (remember, this is with OpenAI's original weights) I get this: Definitely worse, especially the last one! Let's see what Llama 3 thinks of it, again using the code from the book: The medium model got an average of 50, so the OpenAI small model is definitely much worse, as the examples suggested. Makes sense. Let's see how our own base model performs when fine-tuned on the same data. After a bit of fiddling I found that validation loss settled down at the end of epoch 10: (It's hard to see from the chart, but validation loss was actually very slowly dropping even after epoch 5.) It's interesting that our own model took longer to train here, but it does make sense in terms of it being that little bit dumber. The samples it printed out at the end are also interesting: The simile is pretty good, I think better than the OpenAI original weights' one, but the storm clouds one is dreadful. It's fascinating that they both chose the same wrong answer for "Pride and Prejudice" -- my guess is that it's because the training set contained this question: ...so both models picked up on Robert Frost being a useful author to reference in answers. Anyway, what does Llama 3 think of the output? Yup, it's dumber than the original weights -- but, at least to my mind, closer to the original weights' score than you might have thought based on that loss/perplexity number alone. But, on the other hand, I'm not convinced that Llama 3 7B is smart enough to be doing a good job. In the stuff the eval script printed out, we have this: This is clearly completely wrong, the mention of cumulonimbus is coming from the dataset response, not the model response. Llama 3 7B is tripping up over what came from where, which is pretty normal for a small model. Of course, it's possible that the scores for the OpenAI GPT-2 small weights also have been given a higher rating than they deserve -- or, indeed, that there were right answers that were incorrectly judged wrong. Conceivably it averages out. But there's no reason to assume it would, so it's essentially noise and is making the results less useful. Let's try using a much smarter LLM as a judge and run both of the models responses through it -- the just-released OpenAI GPT-5.1 model. The code is here . Running that against our own model's answers: ...and against the model fine-tuned from the small OpenAI weights: ...and, of course, it didn't make the mistake of confusing the dataset response with the model's in any of the cases printed out. ChatGPT 5.1 in the chat interface is very smart, I expect these results are much closer to a reasonable ground truth. Out of interest, what does it make of the model based on the GPT-2 medium weights that we train as part of the book? That's as compared to an average of about 50 from Llama 3 7B. It seems like GPT 5.1 is a tougher judge than the small local model -- and my guess is that that is because it's more accurate. 3 Anyway, the ranking remains the same; after fine-tuning on the same Alpaca dataset, GPT-2 medium > GPT-2 small > our model. But it's still a relatively close-run thing between our model and GPT-2 small. Can we close the gap without vast amounts of extra training? The results so far were from using 3.2B tokens of the FineWeb 10B corpus. Now, as I noted at the start of this post, Andrej Karpathy's nanochat project uses FineWeb-Edu, a separate corpus designed to be really informative. Indeed, back at the start when we were looking at the two datasets, the first row in the Edu dataset was about Jane Austen, so maybe we would wind up with a model that at least got that question right! That's going to take another two days to train, but that's no big deal. We first need to change our script that generates the train/validation splits to regenerate them using the Edu dataset; we'll move the old ones to one side, though -- it will be interesting to see what loss we get on the non-edu validation data with the new model. (Note to self: work out some way to split out different datasets and training runs for future experiments like this. The setup I had in my recent post on RNNs worked quite well. Throughout the remainder of this post I'm juggling directories of checkpoints and datasets, and I'm sure I got it right, but it was an error-prone process.) That being done, it's time to move the checkpoints we already have to one side, and to kick off the train! Here's what we have after two days on that -- oops, I forgot to add the code to average training loss across all of the batches, so again it's a bit spiky. But we got to a final eval loss of about 3.693 this time. Of course, that's on its own validation set, so it's not comparable with the numbers from before; loss is specific to a particular dataset. Let's see what it makes of the original run's validation set. Juggle some directories around (my messy file structure means that there is just one "datasets" directory and one "checkpoints" one, so I'm moving them around to make sure I'm using the right combination): We get 4.16! That's truly terrible, worse than both the original base model that we trained on FineWeb's non-edu dataset, and than the OpenAI GPT-2 small weights. Let's see what we get from the closer-to-real-world instruction fine-tuning test. Five epochs turns out to be best: I won't bother running it past Llama 3 7B, as that's proven unhelpful, so we'll go straight to GPT-5.1. Gosh! So it's judged slightly worse than our weights based on FineWeb. That does surprise me a bit. I was definitely expecting the Edu version of the dataset to give us a better model. So: OpenAI medium > OpenAI small > our FineWeb base model > our FineWeb-Edu base model. That last pairing does surprise me a bit. Handwaving wildly, perhaps the more "regular" nature of the Edu dataset meant that the model saw less variation in its training set, and that actually made it learn less? I think there's one more experiment I want to do before bringing this ( very lengthy) post to a close. We've shown that Chinchilla-optimal training of models produces worse results than OpenAI's original, we think longer, train. What would happen if we continued training for another two days? As I have it easily to hand, I want to use the FineWeb-Edu model for this. I want to start with the best checkpoint (which happens to be the last one), and train it on another 3.2B tokens from FineWeb-Edu. Let's see what we get. Getting a dataset is going to be a bit messy, as our existing script to generate the safetensors datasets just grabs tokens from the original dataset until it gets 534,200 batches of 6 sequences, each of 1024 tokens (3,282,124,800 total). Might as well hack it (and note that this is something worth improving for any later experiments). I'll just loop round the code to do that twice, throwing away the first set of 3.2B tokens. I was pretty sure that the ordering of the datasets I'm getting is fixed, but perhaps not -- it spent time regenerating the train/val split at the start of the script, so there's no guarantee we have different data this time. That feels like a note-to-self about data pipeline hygiene -- if the train/val split is randomised by the infra I'm using, I should persist the raw data in case I need to use more data than I though I would need to. Still, for this experiment, we can play relatively fast and loose. After all, GPT-2 small -- the original OpenAI weights -- was trained on multiple epochs, so it saw tokens multiple times. What we're trying to see here is what happens if you train for longer; a more scientific experiment can happen later (if at all...). Anyway, we have 3.2B tokens that should at least be reasonably different from the original 3.2B. Right, let's clean up some disk space so that we have enough for the new train (deleted some old optimiser checkpoints, keeping the metadata and the weights). Now, we create a new checkpoints directory, and we can copy the last/best checkpoint from the original FineWeb-Edu train there. Hack the in there to zero, create and symlinks, and then we can "restart" from that checkpoint. Due to the way the restart-from-checkpoint code works in the training script, that means that it will start with an offset of 1 into the dataset, so we're dropping one of about 530,000 iterations, but that's not exactly the end of the world. There are some interesting spikes on validation loss in there -- in particular that one at around iteration 300,000 where it goes up from 3.6 or so to 7.5 for two validation periods (which, remember, happen every ~30 minutes, or every 7020 iterations). My guess is that we got some kind of gradient spike prior to those, which led to a bad update to the parameters. However, it looks like the loss recovered really quickly after it, so while gradient clipping (that is, limiting the size of the gradients so that one-off spikes don't cause massive updates) might have prevented them, I don't think it would have improved matters much -- we might have "lost" an hour so of training, but out of a 44-hour train (48 hours including breaks for validation), it's not the end of the world. But, looking at the raw numbers, after our second two days of training on a fresh sample from FineWeb-Edu 10B, we've managed to get the loss on our validation set down from 3.693 to... drumroll... 3.661. And that's on the "best" measurement, which was an hour before the end. The last validation number was 3.663. By spending twice the time, we've managed to get our loss down by 0.032, which is a touch less than 1%. Even measured in terms of perplexity (which, being an exponential, is more sensitive to this kind of change), we've gone from 40.2 to 38.9, which is hardly show-stopping. Let's see how this one measures up against the non-edu FineWeb validation dataset that we originally used to calibrate our first training run. Run it, and: ...we get 4.13 -- that's opposed to 4.16 on the last model, trained on half as much data. Well, maybe it's a much better base model for instruction fine-tuning? Let's give that a go, again with the Alpaca training set from the book. 8 epochs turns out to be the right number: Certainly better than the 15.18 that we got on our Chinchilla-optimal FineWeb-Edu model, and a bit better than the 16.14 we got on the Chinchilla-optimal FineWeb one. So by training for double the time on twice the data, we've definitely got a better model. It's just not that much better. I think that's more -- significantly more -- than enough experimentation for one blog post, so let's do some analysis. I want to sanity-check the number of FLOPs spent on this train, just to make sure that I hadn't messed up. Feel free to skip this if you want to jump straight to the conclusion :-) In appendix F, the Chinchilla paper mentions a common approximation for how many FLOPs, C , you spend training a model with N parameters over D tokens: So based on that, each of those training runs cost us (using the exact numbers for N and D ) this many FLOPs: They also give a more carefully-worked out calculation; it doesn't look all that difficult -- it's just a case of plugging in the numbers from our architecture and pulling out a result 4 -- but the numbers they get from that are generally within 10% of the simpler calculations, so we may as well stick with the above. 5 Now, in terms of how many FLOPs we actually spent... well, manufacturers' datasheets for hardware are based on carefully-selected benchmarks and won't really be comparable to the code we were running (especially given that it's my crappy code based on top of a huge stack of PyTorch, CUDA kernels, CUDA itself, and so on), but we can do a Fermi estimate . From Wikipedia, the RTX 3090 has 35.58 TFLOPS performance on FP32. Way back earlier in this post, when I was measuring how many tokens per second I could get locally, the first experiment capped out at 12,599 tokens/second with FP32. showed the GPU usage at 100%, so let's say (again, this is very approximate) that we were getting about 35.58 TFLOPs and that enabled 12,599 tokens/second. We wound up training at about 19,921 tokens/second after adding in mixed precision and using the tensor cores. So, hand-wavingly we can say that we were getting Now, we trained for 44 hours (48 including validation), so the total number of training FLOPs should have been the number of seconds in that times the total FLOPS 6 of 56.27 × 10 12 That's pleasingly close to the 3.19 × 10 18 above! I can easily imagine that the stack we're using could somewhat-more-than-halve performance from the theoretically optimal, or that we're running at 50% of the GPU's theoretical capacity, or some combination of the two. We're in the same order of magnitude, and for a Fermi approximation, that's what matters. Now, looking at figure 3 in the Chinchilla paper, their IsoFLOP curves (each one showing the loss they got on their training set for models of a particular size, using the same number of FLOPs for each curve), we can see that the top one, which is training runs of 6 × 10 18 FLOPs, the lowest point is pretty much bang-on the 168M point on the X axis. So that is at least reassuring that we did do a proper Chinchilla-optimal train here. (Their loss on that chart is showing 3, but they're using a different dataset, so I don't think it's comparable.) Apart from the obvious answer of "skill issue", let's see if there are any obvious reasons why the base model I've trained (and retrained) in this post is worse than the original OpenAI GPT-2 small. Let's review the results first: The first row is not super-interesting, it's the second and third that matter. OpenAI is clearly winning by quite some margin! Earlier on I assumed that the difference was that they trained on more data, but let's be a bit more systematic here. What specific differences do we have to the original train? Again, the amount of data in the paper is frustratingly limited, but: Right at the start, I estimated that the WebText dataset they trained on was about 10B tokens. We've trained on 3.2B tokens for two of our models, and 6.4B tokens for the extended train one. That could well have an effect. There's more information in their larger dataset, both in terms of raw facts like "Jane Austen wrote Pride and Prejudice", and in terms of information about the structure of language. On the other hand, their dataset is, as they say, comprised of the contents of web pages that were linked from Reddit posts with more than three upvotes. FineWeb (and even more FineWeb-Edu) is a much more curated dataset, so you would expect it has more facts, and better structure -- less of the slop and junk that Andrej Karpathy talked about in his interview with Dwarkesh Patel. So I'm not sure that this is it, but it's worth keeping in mind. Again, we don't know how many epochs they trained on, but the report I linked to right at the start of this post estimated that they trained for 60, while I calculated based on their numbers that it would be 41 epochs with WebText. It certainly makes sense that grinding along, epoch after epoch, will get your loss down, at least on the training set! And there's also a phenomenon with certain kinds of neural networks where if keep training past the point where you're overfitting (that is, validation loss starts rising while training loss continues to fall), suddenly the model can have an "aha" moment and start generalising again . 8 It's not quite comparable, because it was not a second epoch, but rather continued training with more data, but we were able to eke out an extra reduction of 0.032 in loss by training our FineWeb-Edu model for twice as long. If we'd trained it for 40 times as long, then we presumably would have managed to grind it down even further. I have no idea how much further we could get it, but I'd guess that it's going to be worse than linear (that is, each extra two days gets you less loss reduction than the previous) -- so we can bound the loss reduction at a maximum of 39 × 0.032 = 1.248 . So... maybe? It would be a dull experiment to run, though, taking 78 days. If I want to do that, it would be better to find a way to do it quickly, so that I can get a better feedback loop going. The reason this post has taken so long has in part been because each training run has taken so long (as well as trips to London and other life stuff). The original GPT-2 model from OpenAI had bias on the W q , W k and W v projections -- that is, they were normal NN biased linear layers rather than simple matrices, so they did a projection into their respective spaces followed by a translation. In the book, Raschka says that this is not normally done these days, which is why I didn't do it for this base model train. But perhaps it actually is valuable with this architecture or size? Modern models presumably differ in multiple ways, and perhaps the bias would have been useful for this old design. Likewise, weight-tying -- the original GPT-2 re-used its embedding matrix to do the final projection from embedding space to vocab space, rather than having a separate one. That seems intuitively clever but not necessarily "right", given that it gives the model less flexibility in what it can output from the last layer. But perhaps with this size and architecture, it's the right thing to do? Contrariwise, having made those two changes to GPT-2 because I believed that modern models don't work that way, there was one "modern" change that I didn't make. In his post on the architectural changes since GPT-2, Raschka mentioned that dropout is normally not used nowadays. This looked to me like it was due to the move to single-epoch training. But single-epoch training was exactly what we were doing in this post! Perhaps I was holding myself back by keeping dropout in place. I don't have a good intuition as to what the right level is for this at the moment. My code blindly uses the optimiser setup from the book: I have at best a vague understanding of how those work, at least when using an optimiser (LR for simple gradient descent isn't too hard to understand, although it's hard to work out an intuition for what the right value might be in any given case). Additionally, in the Chinchilla paper, they talk about using a cosine function to vary the learning rate, which is something I'm completely unfamiliar with. I gained about a day in training time by using AMP and the TF32 tensor cores; however, I lost precision. I don't know for sure, but I suspect that the original weights were trained with pure full-fat FP32. Perhaps reducing precision lost something? I know that modern models are often trained with lower precisions, but perhaps that's balanced out by something else? This is the one that I think it least likely, but it's worth mentioning. The post that I linked to estimating the size of the training run for GPT-2 small mentioned that they used a batch size of 512, which (of course) is completely impossible on consumer hardware like mine. Indeed, I think you'd be lucky to get 512 onto a single 8-GPU node -- we're talking serious cluster training scale here. Larger batches lead to more stable updates to the gradients. So maybe that helped for OpenAI when they did their train? I suspect it did, but I'm pretty much certain that it's not a large part of the difference. (Counterpoint: Gemini thinks that this might actually be a big part of the problem! It recommends using gradient accumulation -- that is, not stepping the optimiser every iteration, but instead giving gradients time to build up -- as a way of getting a larger batch effective batch size.) While it doesn't look like we had any issues with these on the original FineWeb and FineWeb-Edu trains, they definitely did kick in on the extended Edu train. The code to clip them is easy enough, and I think it's likely that the original GPT-2 trains would have had it. I doubt this was a major part of the difference, but it probably would have helped, at least a bit. Anyway, I think that's it in terms of differences that I can see between my train and OpenAI's (as always, comments welcome -- let me know if you spot any others!), so it's time to (finally) wrap this post up. At the start of this (ridiculously long) post, I asked the question: can we train a GPT-2 style base model at home on a single RTX 3090. The answer is a resounding "yes we can", which is great! Training base models: not just for the GPU-rich. If you have a couple of days and a decent graphics card, you can train a Chinchilla-optimal GPT-2 pretty easily. But the model itself isn't quite as good as the original GPT-2 small one, and I have some ideas about why that might be. Testing any of those would take quite a long time, given that each training run takes two days. Now, my next planned step was to see whether I could work out how to move this up to the cloud and train the same model on an 8x A100 or similar machine on Lambda Labs. This still sounds like an excellent plan! With his project, Karpathy trains a larger model on more tokens in four hours; if we could get the experiment time down to one hour (plausible if training time is linear in both tokens and parameters) then it would be much easier to check out those hypotheses above. 9 So, I think that's still the right way to go: after training a base model at home for free (if you ignore the electricity costs -- and it's cold enough in Lisbon right now that the heat from the PC was probably saving me money on my home heating bill -- and the cost of having bought the RTX 3090 in the first place), the next step is to see how cheaply we can train it in the cloud. Stay tuned :-) It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩ . This is determined by the tokenizer, and I want to use the GPT-2 one, so it will need to be . . GPT-2 has a 1,024-token context length, so I'll stick with that. , , --- these define which of the different GPT-2 model classes we're training, and I want to stick to the smallest one, so they will be , and respectively . One of the most surprising things to me in the "architectural improvements" post linked above was that dropout is no longer used so much. However, this appears to be tied in to the one-epoch training that has taken off since GPT-2, so I think it would be best to stick to here. . From what Raschka says in the book, this doesn't add on much value, even though the original GPT-2 used it, so let's set it to . Crop all of the input sequences -- that is, each row in the dataset -- so that each one is no more than our 1,024 sequence length. Then we can pad them out with end-of-sequence tokens (as is the standard) so that they're all 1,024. This will lose us quite a lot of tokens, but has the big benefit of being easy. Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. Doing it this way would mean we'd use all of our training data. But it would be more complicated, especially if we hit memory constraints. We load enough GPT-2 tokens from FineWeb for batches of sequences each, every one of those sequences being long (plus one extra token for the targets we're comparing them to). Note that we're not bothering to separate them with anything for this test. We then loop over batch sizes from to . Then we create our model and put it on the CUDA device. We do this for each batch size rather than creating one and then using it for all of them so that they're all starting from the same point -- the should make sure that they're identical. For each batch size, we create input and output batches as tensors -- note that we're not putting these on CUDA yet, I wanted to do that in the training loop to mirror what a real training loop will have to do. When we're training with 3.2B tokens then having them all on CUDA will be a waste of VRAM, so we'll be pushing a batch there for each iteration. We do a stripped-down training loop -- for each batch, put the inputs and outputs onto CUDA, then a forward pass, work out the loss, backward pass, and optimiser step. We do the same iterations per batch size. Finally, we print out the number of tokens we trained on for this batch size, how long it took, and the number of tokens per second. Chinchilla heuristic, 20x parameters -- 3.2B tokens: 247,850 seconds, which is just less than three days Estimated GPT-2 train, 419B tokens: 32,452,947 seconds, which is just over a year. Create a model, optimiser and scaler. Train the model for a bit. Work out the loss. Save a checkpoint. Create a new model, optimiser, and scaler, and then restore the checkpoint into them. Work out the loss Train for a bit more to check that the optimiser and scaler still work. On our own validation set from FineWeb, our we have OpenAI > our FineWeb train > our FineWeb-Edu extended train > our FineWeb-Edu train On the answers judged by GPT-5.1 after instruction fine-tuning, we have OpenAI > our FineWeb-Edu extended train > our FineWeb train > our FineWeb-Edu train It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩

0 views
Giles's blog 5 months ago

Why smart instruction-following makes prompt injection easier

Back when I first started looking into LLMs , I noticed that I could use what I've since called the transcript hack to get LLMs to work as chatbots without specific fine-tuning. It's occurred to me that this partly explains why protection against prompt injection is so hard in practice. The transcript hack involved presenting chat text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at a base LLM: ...you would instead prepare it with an introductory paragraph, like this: That means that "simple" next-token prediction has something meaningful to work with -- a context window that is something that a sufficiently smart LLM could potentially continue in a sensible fashion without needing to be trained. That worked really well with the OpenAI API, specifically with their model -- but didn't with their earlier models. It does appear to work with modern base models (I tried Qwen/Qwen3-0.6B-Base here ). My conclusion was that had had some kind of instruction tuning (the OpenAI docs at the time said that it was good at "consistent instruction-following"), and that perhaps while the Qwen model might not have been specifically trained that way, it had been trained on so much data that it was able to generalise and learned to follow instructions anyway. The point in this case, though, is that this ability to generalise from either explicit or implicit instruction fine-tuning can actually be a problem as well as a benefit. Back in March 2023 I experimented with a simple prompt injection for ChatGPT 3.5 and 4. Firstly, I'd say: It would, of course, accept the challenge and tell me that it was thinking of a number. I would then send it, as one message, the following text: Both models told me that yes, I'd won -- the only way I can see to make sense of this is that they generalised from their expected chat formats and accepted the fake "transcript" that I sent in my message as part of the real transcript of our conversation. Somewhat to my amazement, this exact text still works with both the current ChatGPT-5 (as of 12 November 2025): ...and with Claude, as of the same date: This is a simple example of a prompt injection attack; it smuggles a fake transcript in to the context via the user message. I think that the problem is actually the power and the helpfulness of the models we have. They're trained to be smart, so they find it easy to generalise from whatever chat template they've been trained with to the ad-hoc ones I used both in the transcript hack and in the guessing game. And they're designed to be helpful, so they're happy to go with the flow of the conversation they've seen. It doesn't matter if you use clever stuff -- special tokens to mean "start of user message" and "end of user message" is a popular one these days -- because the model is clever enough to recognise differently-formatted stuff. Of course, this is a trivial example -- even back in the ChatGPT 3.5 days, when I tried to use the same trick to get it to give me terrible legal advice , the "safety" aspects of its training cut in and it shut me down pretty quickly. So that's reassuring. But it does go some way towards explaining why, however much work the labs put into preventing it, someone always seems to find some way to make the models say things that they should not.

0 views
Giles's blog 5 months ago

Writing an LLM from scratch, part 27 -- what's left, and what's next?

On 22 December 2024, I wrote : Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting. More than ten months and 26 blog posts later, I've reached the end of the main body of the book -- there's just the appendices to go. Even allowing for the hedging, my optimism was adorable. I don't want to put anyone else off the book by saying that, though! I expect most people will get through it much faster. I made a deliberate decision at the start to write up everything I learned as I worked through it, and that, I think, has helped me solidify things in my mind much better than I would have done if I'd only been reading it and doing the exercises. But on the other hand, writing things up does take a lot of time, much more than the actual learning does. It's worth it for me, but probably isn't for everyone. So, what next? I've finished the main body of the book, and built up a decent backlog as I did so. What do I need to do before I can treat my "LLM from scratch" journey as done? And what other ideas have come up while I worked through it that might be good bases for future, similar series? There are a few sources of ideas for this -- from the book itself and its supplementary material, from notes I've made as I went along, and from other things that I've kept on a mental checklist. There are five appendices: Raschka also gives a link at the end of chapter 7 to a notebook showing how to do further fine tuning using Direct Preference Optimization , which also looks fascinating, and he's working on a new project, " Build a reasoning model (from scratch) ". While working through the book, I've deliberately deferred various things. I'd kind of lost track of all of them, so I gave ChatGPT the source markdown for all of the posts in this series, and asked it to find where I'd done that. It did an amazing job! There were three categories: long context and attention efficiency, maths, and optimisers. The model we've built in the book has a context length of 1,024 tokens, and is O ( n 2 ) in both space and time with respect to the number of tokens you feed it. There are lots of things that people do to work around that. Things I need to learn: I really want to understand softmax at a better level than "it's a magic thing that turns logits into probabilities". I'd also like to learn more about higher-order tensor operations -- the ones that we use in the book are essentially treating the extra dimensions as the batch, but I believe that there's more to it than that. I really want to understand in reasonable depth what optimisers do. I know that they make gradient updates work better than they do with simple gradient descent. But how? That was the set of things I noted at the time I wrote the posts so far, but there are a few more that come to mind as I write this. In some comments that he made on posts in this series, said that it seems like this book isn't really "from scratch", given that we rely on PyTorch's magic to handle the backward pass. He's 100% right! I think I understand why it is that way, though. There would be two different ways that I can see for the book to do it: I think I'd definitely like to revisit that at some point. Another one from Simon; while the book does explain how tokenisers work, even down to a high-level overview of byte-pair encoding, we don't write our own. Again, I can see why this is -- we load in the GPT-2 weights, so we need to use that model's tokeniser. And there's no point in writing our own if we're just going to throw it away. But perhaps a bit of time playing with one would be useful? The book, quite reasonably, shows you how to train your LLM, does a basic train on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. But given that I was getting a pretty good training speed on my own hardware, perhaps I could train a model really from scratch, perhaps using one of the smaller FineWeb datasets? Even if I can't do it locally, perhaps it might be doable on a rented cloud machine, like the Lambda Labs ones I used when fine-tuning Llama 3 ? After all, Andrej Karpathy is training a full model that you can chat with for $100 . I don't think I ever mentioned this on the blog, but one important plan for me is to try to build an LLM from scratch, only using my own blog posts and what I remember -- no looking at the book. If I can do that, then I can be reasonably sure that I really have learned it all. I'm also thinking that I'll do that using a different library -- that is, not PyTorch. That would stop me from regurgitating code that I've learned. If you're reading this within a day or so of the post's publication, I'm running a poll on X/Twitter about which framework to use . If you have an opinion, please do stop by and vote :-) It feels like almost every new model these days is an MoE. I have read a lot around the subject and would love to build on it. Essentially, instead of having just one feed-forward network after your attention heads, you have several. In front of them you have a router -- a trainable network of some kind -- that tells you which of these "expert" FFNs the token should be forwarded to. You then send it to the top (or top k ) experts, while leaving the others inactive. The result is that you have more space (in terms of parameters) for the LLM to know about things, but not all of those parameters are active during inference -- so your model is smarter but still fast. There's a bunch of interesting stuff there, from how you build it in the first place, to how you handle the fact that you're processing lots of tokens at once -- multiple tokens in each sequence and multiple sequences in a batch. It would be a pretty cool follow-on to the "my own LLM" series, thinking about it. I definitely don't think I need to do all of those things in order to wrap up this series. Here's the subset I'm planning on doing: For the other things, I think there are some potential future series to write. I'm certainly not promising that I'll write up all (or even any) of that second list, but they all seem really tempting to me right now. If you're particularly interested in seeing my take on any of them, please do leave a comment below. I think the next post in this series -- maybe the next several posts -- will be on trying to train the model code provided in the book from scratch to produce my own base model. Stay tuned! Here's a link to the next post in this series . A: An introduction to PyTorch B: References and further reading C: Exercise solutions D: Adding bells and whistles to the training loop E: Parameter-efficient fine-tuning with LoRA The KV cache . This is basic stuff and I feel I sorta-kinda understand it, but I haven't written about it so I can't be sure. It's a pretty obvious enhancement to avoid repeating work when generating autoregressively -- that is, the normal setup where in order to generate n tokens, we give the model its input, sample our first token from its predictions, then feed the whole thing -- the input and that first token -- back in for the second token, and so on. Obviously, because attention is causal, we're doing exactly the same work every time for all of the tokens in each round apart from the last one, so it makes sense to cache things. The result is that generating the first token is still O ( n 2 ) , but subsequent ones will be something more like O ( n ) each. That's why real-world modern models tend to take a while pondering before they generate the first token but then speed up -- they need to fill their cache. FlashAttention and related things: there are lots of ways people have found to reduce the cost of attention generally, but this seems to be the most popular one, or at least the best to get started with. Better positional embeddings : the context length of our GPT-2-style LLM is fixed in part because you need position embeddings for every possible input position. That means that we can never extend it. More modern LLMs use better ways to represent positions -- Rotary Position Embeddings (RoPE) look like they're very popular. Manually code a backward pass to go with the forward pass on each of our modules. Simon did this, and was kind enough to share his code with me -- it looks like one of those things (like attention) that is pretty hard to get your head around initially, but once it clicks it's super-clear. Definitely kudos to him for getting it all to work! The problem with this is that I don't think any ML practitioners do this nowadays, because automatic differentiation is there in every popular framework. So it might be a good learning experience, but also might nudge people into an unprofitable direction. Create our own automatic differentiation system. Andrej Karpathy pops up again when looking into this; he created micrograd , which handles back-propagation for scalar functions. That's really clever -- but it would be hard, and a bit of a side quest from the point of the book. Also, the most interesting stuff (at least from what little I know) for automatic differentiation is how you do it with non-scalars -- the matrices and higher-order tensors that our LLM uses. From what Simon says, this is where you need to use the mysterious Jacobian matrices I've heard about in the context of back-propagation. Training the full GPT-2 base model myself. I'm 100% going to try this. From the appendices -- anything that surprises me from the one on PyTorch, and perhaps from the "bells and whistles" in the training loop. The others I either won't do, or will pick up later. Building my own LLM from scratch in a different framework, without using the book. That is, I think, essential, and perhaps would be the crowning post of this series. It would be a nice way to end it, wouldn't it? Improving context length -- RoPE and other tricks -- sounds like an excellent series to start on when I'm done with this. AIs tell me that other interesting things to look into would be ALiBi, NTK/YaRN scaling, and positional interpolation. Improving performance: the KV cache, FlashAttention, and other performance enhancements likewise feel like they could make a good series. I also want to do a separate series on LoRA. In that, I'll draw on appendix E from this book, but also on other tutorials. Likewise DPO, along with other post-training that can be done to make models more useful as chatbots, like Reinforcement Learning. I'd really like to spend some time understanding that area. (And Raschka's upcoming reasoning model book might fit into that category too.) Optimisers: Adam, AdamW, maybe Muon (though the latter scares me a bit). The maths -- softmax and higher-order tensor calculations -- also seems to belong in another series, perhaps an extension of the various "maths for AI" posts I've done in the past. Automatic differentiation and the backward pass; that would make a great series. A mixture-of-experts model would be excellent fun, I think. Tokenisers would be a great stand-alone post, at least at the level that I can see myself covering it. Perhaps that would develop into a series if I found myself getting sucked in.

0 views
Giles's blog 5 months ago

Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model

This post is on the second half of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I covered the part of the chapter that covers instruction fine-tuning; this time round, we evaluate our model -- particularly interestingly, we try using another, smarter, model to judge how good its responses are. Once again, Raschka's explanation in this section is very clear, and there's not that much that was conceptually new to me, so I don't have that many notes -- in fact, this post is probably the shortest one in my series so far! Unusually, when at the start of section 7.7 we generate some sample responses for the instructions in our test set, I got exactly the same results as in the book. For once, I guess, everything that uses randomness was happening in the same order as it did when Raschka ran it on his machine. The next step was to generate a file with all of the responses to all of the test instructions, which took 18.9 seconds on my RTX 3090 (compared to a minute on an A100, per the book -- that's quite surprising!) Once that was done, it was time to install Ollama so that I could use the Llama 3 model to evaluate my own. I've never used Ollama before -- when playing with other people's models, I've always used Hugging Face's Transformers library. It's a neat package, though. It wraps , which is a pure C/C++ inference framework (with CUDA support), and makes it easy to download and run models that have been packaged for it. Being written in C, I would imagine that it's faster than PyTorch/Transformers -- though, being inference-only, it's less useful if you're planning to do things like training or fine-tuning the models. My desktop is running a fairly customised install of Arch Linux, and I didn't want to use the default install procedure (which puts it into your system-wide and directories). But it turns out that it's a very well-packaged app, and you don't need to do that. Using the manual install instructions for Linux , I just created a new directory , and then ed there and downloaded it: It was about 1.75 GiB. I then untarred it: ...and then I could run commands with full paths, for example: ...to start up the server, or ...to start a session. Neat! It's always good to see pre-built binary packages that have no issues with their install location. The next step was to throw all of the generated test responses (and their associated targets) at Llama 3 and see what it thought about how close they were. Again, this all worked without trouble. I noted that the responses I was getting from Llama 3 were not the same as the ones in the book -- Raschka notes that Ollama is non-deterministic, so there's no surprise there (though it does make me wonder why it accepts a parameter in the API call). When I got on to the final eval, where you run the test results through Llama 3 and ask it to rate them compared to the target outputs, it took 11 seconds to run, and I got an average score of 48.95 / 100, which is close enough to the 50.32 that appears in the book. 1 I'd run an eval on my model, using a smarter model to judge its responses! Somewhat surprisingly, that number was stable over multiple runs. So perhaps there is some level of determinism in Ollama now that wasn't present when the book was written, and the seed (eg. ) is of value. Or perhaps Raschka's comment about it being non-deterministic was more of a "between machines" thing rather than for multiple runs on the same machine -- though then I'm not sure why he suggests re-running it for multiple results. Anyway -- that was it! Eval done. And, to my amazement, that was the end of the chapter -- and almost the end of the book. We've built an LLM from scratch, fine-tuned it, and evaluated it by using a smarter model to judge how well it was following instructions. ...or at least the end of the beginning. Having run the evaluation, I've reached the end of the main part of " Build a Large Language Model (from Scratch) ". But I don't think I've reached the end of this project, there's still more to do (not least working through the appendices). So, coming up next: a post summarising what I've got through so far in this series, and what the next steps are to wrap it up. Here's a link to the next post in this series . I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩ I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩

0 views
Giles's blog 5 months ago

Writing an LLM from scratch, part 25 -- instruction fine-tuning

This post is on the first part of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", which covers instruction fine-tuning. In my last post , I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way. So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get. Just as with the last chapter , what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause. For this part of the book, we use the "medium" variant of the GPT-2 open weights rather than the "small" ones that we've been using so far. I have to assume that this is because the small really isn't very very good at this kind of thing. [Update: it isn't! See the metrics from my mammoth post on training an LLM completely from scratch .] This was quite interesting. In the past, all of the templates I've seen for instruction following have been designed for chatbots -- that's what we tend to use LLMs for, after all. There's a system prompt and then a format for "message from user", and another for "message from bot". In my series on fine-tuning , where I learned how to fine-tune an 8B-parameter Llama 3 base model to work as a chatbot, I used the format for Llama 2 , which is not dissimilar to the Phi3 one that's given as an example in the book. The Alpaca -style one is quite different; it is designed for more of a one-shot interaction than it is for chat: Now, Alpaca dates from early 2023, and it looks like they used that prompt following a paper " Self-Instruct: Aligning Language Models with Self-Generated Instructions ". I had to think a bit about why one would use that, and I think the core is that this was early days (all of two years ago!) and LLMs had very short context lengths and weren't very smart. Chat uses a lot of tokens! You need the system prompt, and then every conversational turn so far. With our GPT-2 model we have just 1024 tokens to play with -- and Alpaca wasn't much better, as it was built as a fine-tune of Meta's original Llama model , which (according to the model card ) had a context length of 4096 tokens. Chat is a good way to interact with a model, as the multiple conversational turns allow you to build up large amounts of context for the model to play with, meaning that (hopefully) it will be able to give good answers. But if that context doesn't fit into the context length, then it's not so good. Early chatbots, I believe, worked around this by replacing the "transcript" with a summary, but there's only so much you can fit into a 4k-token one. 1 Maybe modern ones do this too, but with GPT-5 having a 400,000-token context window it's not so important. So, in Alpaca times, people were thinking in terms of one-shot interactions with LLMs, and the pattern they chose was targeted at that, so that you could get all of the interesting information and a reply into one sequence. An interesting bit of history! (Again, two years ago is history. Cripes.) This was explained well in the book, but it's an interesting enough point that I thought it was worth going over. Last time around we had a bunch of text messages as our inputs to the model. We found the longest one, and then padded them all out to the same length with end-of-sequence tokens, which meant that we could construct batches -- naturally, every input in a batch has to be the same size. This time around we're being a bit smarter. Although every item in a given batch needs to be the same length, batches themselves can be of different lengths -- that is, if our batch size was 8, and the longest sequence in our first batch was 73 tokens long, then we would make our first batch 8 × 73 -- but then, if the longest sequence in our second batch was only 60 tokens long, then the second batch could be 8 × 60 . We only need to pad out sequences to match the longest sequence in their batch, and that saves us time when running the model. That got me thinking about inference at scale -- the kind of thing that LLM providers like OpenAI or Anthropic do. They're going to be receiving very large numbers of sequences to complete, and of course they are going to be running them through in batches. But padding tokens are kind of a waste of inference GPU cycles. They'll have a bunch of different instances of their models running on different machines to handle all of these requests, and they almost certainly have some kind of code to try to route sequences of similar length to the same instances. To take a toy example, if you had a batch size of two and received six sequences, with lengths of 2, 9, 100, 11, 3 and 120, then you'd want to route them so that one instance received the (2, 3) pair, another the (9, 11), and another (100, 120) -- that minimises the amount of padding required and saves wasted cycles. Following on from that, it looks like we could actually improve the book's code by doing something similar here, grouping similarly-sized inputs together. That would be quite complicated, though, so probably not worth it in an educational context like this. Anyway, our collator needs to handle the variable-length batches, and through various drafts we converge on one that does it, with one tweak. This was a really important and interesting bit. Let's say that we're feeding in a 20-token input, in a batch where the longest of the other sequences is 30 tokens long. That means that we have ten padding tokens at the end. Let's represent that input sequence like this: The numbers are our token IDs, and I've used to represent the end-of-sequence token that we use for padding. Now, we need our target sequence to predict for that. The first version that we come up with in the book looks like this: So, we've just done the normal trick of shifting left by one character, and we've added an extra end-of-sequence token at the end to make the lengths match. But as the next step, we replace all of the padding tokens, apart from the one right at the end of the "real" part of the sequence, with an invalid token ID, . Using to represent that, we have: The core thing to remember here is that we honestly don't care what the model generates after it's done the real, unpadded sequence, plus an end-of-sequence token. It could generate random junk and it wouldn't matter, because it's already done the important part of predicting next tokens for all of the input sequence. The is a magic number that PyTorch's cross_entropy function uses in target sequences to say "ignore this position". I must admit that as a software engineer, it gives me a bit of an "ick" -- magic numbers are never nice -- but it does make sense. Negative numbers are invalid targets when you're comparing predictions across tokens -- which have indexes from zero up. In general, if you're predicting categories -- which essentially we are with tokens -- then the "minus one'th" token doesn't make sense. You could use any other negative number, but -1 might cause confusion (being used heavily in ML code to get the last element of a sequence) and if you're going to use any other negative number it might as well be . "Purer" solutions would be hard, anyway. We're working with a PyTorch tensor here, so it has to be a number -- which rules out using something like or some kind of special object. You could keep an "ignore after this index" number, but you'd need as many of them as you have items in the batch and it would be just another thing to keep track of. You could even keep a tensor of boolean "ignore these tokens" of the same size as your batch -- a mask -- but that would have the same problem of being something to pass around in your code. As I understand it, those last two solutions are actually used in some systems -- imagine that your outputs were not logits to create a probability distribution across categories or tokens, but were meaningful numbers in and of themselves. Pretty much any number you picked might be a valid output from the model. You wouldn't be using cross entropy loss in those cases anyway, of course, but you'd need to keep some record of where the padding starts so that you can ignore it. One final thing that is worth noting that we only add the s on to the targets. This makes sense, as all of the inputs will be fed into the LLM, so things that aren't valid tokens are going to make the embedding layer very unhappy. That also explains why we firstly add them on to the sequence as regular padding and then convert them to -100 for the targets: it allows us to add on the padding, then get the input sequence as all but the last token, then get the targets as tokens 1 up to the end. After that's done we run the code to replace all but the first end-of-sequence padding tokens with -100 on the targets. As with the last chapter, I got different results to the ones in the book; something different about the order of execution in my version of the code when compared to Raschka's meant that despite all of the careful use of , the numbers didn't quite match up. But, again as before, they were close and the trends in -- for example -- loss were the same, so the right things were happening. When I finally ran the train on my RTX 3090, it took 48 seconds; I watched it in and saw that it was using 9GiB VRAM. Due to the usual differences in randomness, I got slightly different results to the book -- but similar enough to not be any cause for concern: Also, due to a typo, I accidentally ran it with five epochs -- that took two minutes. I noticed that validation loss started rising fairly steadily after epoch 2, with train loss dropping -- clearly overfitting. Presumably Raschka chose two epochs for exactly that reason :-) A couple of things that I noticed while working through the code; when I first ran the download script, I got . That's because of a typo in the import at the start -- instead of ...it should be: The other thing that tripped me up was the original . We add on a padding token, then pad out the sequence with more padding tokens, then remove the last one. I found that confusing -- why not just add on the required number in the first place rather than adding on an extra one and then deleting it? It became clear later on; it's to make it mirror the next function, which adds on an extra end-of-sequence token for our targets, but having this anticipatory code in there with no explanation in the first draft made me start doubting my sanity for a little while... Minor points, though. So, that was it for the first half of chapter 7 in the book. The next bit looks like fun -- we're going to use a smart model to evaluate our relatively dumb one on how well it follows instructions. Definitely looking forward to that :-) Here's a link to the next post in this series . I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩ I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩

0 views