Latest Posts (18 found)
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: So: we initialise the weights W q , W k and W v as linear layers rather than simple matrices of weights, and have a parameter to say whether or not we should add bias to those. In all of our trains so far we've set that to . Why do we have this parameter, and where did it come from? In Raschka's book, the use of the for these weights is introduced in section 3.4.2 with the wording: We can improve the implementation further by utilizing PyTorch's layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using instead of manually implementing is that has an optimized weight initialization scheme, contributing to more stable and effective model training. So, it's presented essentially as a way of getting better weights for our untrained model, which makes good sense in and of itself -- but, if that's the only reason, why don't we just hard-wire it to have ? That would be the sensible thing to do if the initialisation were the only reason, but clearly there's more to it than that. Section 4.1 has a bit more information: determines whether to include a bias vector in the layers of the multi-head attention ... We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model. That looks like a typo, as the real explanation is in chapter 5, section 5 (page 164 in my copy), where we do indeed load the OpenAI weights: OpenAI used bias vectors in the multi-head attention module's linear layers to implement the query, key and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary. So, that all makes sense so far. QKV bias was part of the original GPT-2 models, perhaps just because it was standard at the time, inherited from something else, or perhaps for some other reason -- I can't find any reference to it in the actual paper . But people have found it doesn't help, so no-one uses it these days. But... is there some way in which an LLM of this specific size, or in some other way similar to the GPT-2 small model that we're training, might in some way benefit from having bias? That's what this experiment is for :-) One thing that occurred to me while setting this up is that we have been training on a Chinchilla-optimal number of tokens, 20x the number of parameters. Without QKV bias, we have 163,009,536 parameters, so we've been training on 3,260,190,720 tokens, rounded up to the nearest batch size, which is 3,260,252,160 in our current setup for these experiments (per-GPU micro-batches of 12, with 8 GPUs, so a total batch size of 96). These extra bias terms will be parameters, though! We're essentially making our model larger by adding them, which changes the Chinchilla calculation. How much? OK, that's essentially nothing -- 27,648 extra total paramaters on top of 163 million. I make it less than two hundredths of a percentage point larger! The correct number of tokens goes up to 3,260,743,680, so if we wanted to be very pedantic, we're under-training. But I feel like training on a larger dataset is worse in terms of comparability between the baseline and our "intervened-on" model with QKV bias. So: we'll train a model with QKV bias on 3,260,252,160 tokens, accepting that it's a tiny bit less than Chinchilla-optimal. Let's see how it goes! Here's the config file for this train. Running it gives this training chart: Pretty standard, though the loss spikes look less prominent than they have been in the other trains. Might QKV bias actually help with model stability in some way...? The train finished with these stats: Timing-wise, pretty much indistinguishable from the baseline train's 12,243.523 seconds. The final train loss looks a tad better, but we can't rely on that -- the test set loss is the important one. So it was time to download it, upload it to Hugging Face Hub , and then on to the evals. Firstly, our normal "how should you continue ": Not bad at all, borderline coherent! Next, the loss on the test set: Well, crap! Now that's a surprise. Let's look at that in the context of the other interventions to see how surprising that is, given Raschka's comments (which were undoubtedly backed up by serious research): So, adding QKV bias actually improved our test set loss by more than gradient clipping did! The loss spikes in the training chart look smaller than in the other trains 1 , so, speculating wildly, perhaps with a model of this size, the bias stabilises things somehow? Or perhaps what we're seeing is the model become that tiny bit smarter because it has some extra parameters -- albeit less than 0.02 percent more? I'm not going to spend time investigating things now, but this is a really interesting result. One extra thing that does occur to me is that the direction research has taken since GPT-2 has definitely been in the direction of larger models. The attention weight matrices are sized d emb × d emb , so excluding bias they have d emb 2 weights each. Bias adds on another d emb . So, as a model scales up, the attention-related non-bias weights will scale quadratically -- doubling d emb will square their number -- while the bias weights will scale linearly. So perhaps it's just that the effect -- whatever causes it -- gets rapidly swamped as you scale out of toy-model territory. That, at least, seems pretty plausible. One final note to self, though: these improvements are small enough that I do find myself wondering whether or not it might be some kind of noise, despite the setting of the random seeds I'm doing: I think that at the end of this, before I do a final train, it would be worth doing another baseline train and measuring the test set loss again, and doing another comparison. If it comes out exactly the same -- and I can bump up the number of significant figures in the output, it's just a formatting parameter -- then I don't need to worry. But if they vary to some degree, perhaps I'll need to update my mental model of what level of finding is significant, and what isn't. I think it goes without saying that QKV bias definitely goes onto the list of interventions we want to add when training our best-possible GPT-2 small-scale model, assuming that the random seed test goes well. That surprises me a bit, I was expecting it to have negligible impact! That, of course, is why it's worth doing these tests. Next up, I think, is trying to understand how we can tweak the learning rate, and its associated parameters like weight decay. This will need a bit of a deep dive, so you can expect the next post late next week, or perhaps even later. I'm sure you can't wait ;-) Note to self: is there some way I could quantitatively measure those?  ↩ Note to self: is there some way I could quantitatively measure those?  ↩

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32c -- Interventions: removing dropout

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something! This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better? In a blog post last summer about architectural advances in LLMs since GPT-2 , Sebastian Raschka wrote: Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. That makes quite a lot of sense. My own understanding of dropout was that it was a bit broader than just preventing overfitting -- it seemed to me to be similar to the mandatory vacation policies that financial firms user to prevent over-dependence on individuals . My instinct was that having knowledge distributed across different weights in the model was good in and of itself, even beyond its benefit on multiple-epoch training. But it is quite a high price to pay. With the training parameters we've been using we're literally discarding 10% of our calculations' results -- attention weights, feed-forward neuron activations, and so on -- as we do the forward pass. It's easy to see why it would harm training. Let's give it a go. The nice thing about this one is that, unlike the gradient clipping experiment, I didn't have to write any new code. The dropout level was already controlled by a setting in the file , so by setting that to zero for this run, I could just kick it off and let it do its thing while I worked on something else: Here's what the training run chart looked like (please disregard the stuff about grad norms in the title and the axis -- I'll remove that for the next train): As you can see, we still have loss spikes, including one just after global step 20,000 that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping might have helped with that, but I'm very deliberately testing each intervention in isolation. At the end of the training run, we got this: So, interestingly, it took 967 seconds -- about 16 minutes -- less time than the gradient clipping run, and about 15 minutes less than the baseline train. So while gradient clipping added on a small amount of time (or maybe that was just noise), dropping dropout certainly seems to speed things up! I guess there's quite a lot of work involved in generating and applying the random masks that drop things out as we're doing the forward pass. Anyway, with the model trained, it was time to download it, upload it to Hugging Face Hub , and run the evals. Firstly, the smoke test, where it just needs to continue the sequence , it came up with something reasonably coherent: ...but it was on the test of the loss on the training set that it was most impressive: That's a bigger improvement on the baseline train's 3.692 than gradient clipping: 0.051, which is more than three times the improvement! Let's start keeping a table of these: Now, of course, we don't know how these different interventions combine together -- it would be naive to think that if we did both gradient clipping and dropout removal, we'd get a total loss reduction of 0.014 + 0.051 -- but, especially with that long-lived loss spike in our training run -- it does feel like they might play well together. So, that's dropout covered. Which one next? I think a nice easy one that I should be able to get done on a Friday will be adding bias to the attention weight calculations. Let's give that a go and see if it makes things worse or better! Stay tuned...

3 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping. In the training chart for the baseline model, you can see that there are three places where the loss suddenly spiked up, at around global steps 4,200, 13,000, and 23,000: There are a number of things that could cause loss spikes like that: Exploding gradients are common in RNNs, and also happen in LLMs like this one. I spent a bit of time reading around to find out how they happen, and the ah-ha moment came when I came across this post from Wanshun Wong . Not only is the post itself a good intro in terms of how it affects RNNs, but in the "further reading" at the end, there's some gold: Chapter 10.11 of [1] has a good overview of how gradient clipping works. Now, I bought a copy of " Deep Learning " at the same time as I bought Raschka's book, but I'd only glanced through it. Now was the time to get it down from the shelf -- and, indeed, section 10.11.1 is all about clipping to handle exploding gradients. I'll put the explanation of how they happen into my own words, to see if I can clarify things (at least in my mind). Normally, when we learn about gradient descent, it's illustrated with nice smooth loss charts like this imaginary one for a single-parameter model: We're told that we might start at point A. The gradient is quite high and negative, so we multiply it by our learning rate and subtract it from our parameter. That gets us to point B. This time around, the gradient is smaller as the curve is flatter there, so when we do the same -- multiply by LR and subtract -- we take a smaller step, and wind up at C. Rinse and repeat and we'll wind up near the minimum. The problem is, what if the loss curve actually looks like this: We start at A, with a small gradient, move a little to the right, and now we're at B halfway down a cliff! The gradient is massive, and when we subtract it, even scaled by the learning rate, we can zoom off somewhere to the right -- maybe not even on the chart. Indeed, you can imagine a cliff that is so steep that it would have vertical portions -- negative infinite gradients in this case -- and no matter what your learning rate is, you'll wind up with an infinite parameter update and everything will break. It's hard to see how a model can continue training in a case like that. Now, what can cause steep cliffs like that? The book says "strongly nonlinear functions, such as those computed by a recurrent neural net over many time steps". If you know about RNNs (I wrote about them if you'd like a summary), you'll remember that a single RNN might be quite shallow -- maybe three or four layers -- but when you're doing backpropagation, you run a number of inputs through, one after the other, work out the overall loss, and then "unroll" it to something similar to a "vanilla" neural net to do the backward pass. To put that in concrete terms, a 3-layer neural network trained with a 100-element sequence would unroll to a 300-layer deep network. Every one of those layers has several operations, including (in the implementation I was looking at in my post above), a t a n h . It's not surprising that there are cliffs in the loss landscape -- it's more surprising that there are any smooth bits! Now in LLMs, we don't have that unrolling through time -- but our network is deep enough as it is. For the GPT-2 small model, disregarding the embeddings and the final output head, we have 12 Transformer layers, each of which is multiple matrix multiplications for attention, then a softmax, then another layer, and then a feed-forward... mapping precisely to the equivalent vanilla NN is hard, but I think you can treat each one as at least four layers, so we've got 48. And there are GELUs and logs and exps 1 dotted around, so again -- we should expect cliffs. So if sometimes we'll get crazy gradients, what can we do about them? We clip them. Clipping gradients simply means that if they get larger than a particular number -- v , which we define -- we reduce them to that number. In other words, we have a cap on how big they can get. "Deep Learning" ("DL" from now on) suggests two ways to do it. Remember that while in the example above, we only had one parameter -- on the X axis -- for the GPT-2 small LLM we're training, we have 163 million of them. So the gradients, instead of being one number, will be a 163M-long vector, one per parameter. The two ways to clip are: The second feels more elegant -- we're scaling all of the elements of the gradient vector by the same amount, so it still points in the same direction. Interestingly, though, DL says that the two methods "work similarly", which I'll read as "are pretty much the same in practice". DL then goes on to say how infinite or not-a-number gradients should be handled. With the first way, clearly doing it naively would set every element in the gradient vector to v , which would make the total size (norm) of the update very large. With the second, it be even worse -- we'd still wind up with completely junk gradients, because the norm would be infinite, and in Python is , so we'd be applying gradients with NaNs in them at best. That would be likely to knock our model into unrecoverable territory, as any parameter that had that applied to it would be NaN forever. Their suggested solution is that if you get garbage gradients like that, you can take a random step -- that is, create a new gradient to apply that has the norm v but just points in a random direction. The idea is that this will move you away from the cliff-ridden part of the loss landscape where you've found yourself (more about that later), and things will continue nicely. So, anyway, how to do this in practice? PyTorch has a function, , and that's what's referenced in almost every bit of writing I've found about how to clip gradients. So I decided to use that, assuming it would do what was described in DL's second option and that it would do the random updates they suggest for non-finite gradients. (I was half-correct -- see later.) As to how to use it -- if we had a normal training loop, where we were just using a normal optimiser, we would go from: ...to something like ...where is the max value v from above. However, for our training code using Automatic Mixed Precision (AMP), it's a little more complicated -- but luckily, the AMP explainer we've been using has a section explaining what to do . Right now we have this: Per that explainer, we need to move to this: That looks a bit weird; we're "unscaling" the gradients, then clipping them, then using the scaler to step the optimiser. You'd think that you'd need to "re-scale" the scaler after clipping the gradients -- to get back to where you started from before the optimiser step. From the help page I gather it keeps track of whether or not the gradients it has right now are currently scaled and handles them appropriately based on that state in . Anyway, given that we know what the code looks like now, we need to implement it in a way that can be easily switched on for this experiment (and potentially in the future), but which also allows us to not use it if we don't want to. The best way with our setup is to make it a training option, so we can do it this way: ...with extracted from the file where we call it in : ...and we can just pass in for it in our function that we use to find the maximum micro-batch size for our current hardware, as all we're testing for there is memory usage -- we don't care if we're doing good updates. Here's the code delta for that , plus a bugfix to allow for files without a in them. But it would also be useful to be able to track when it "fired" -- that is, when we had to clip our gradients. Then we can see two things: Now, the docs for say that it returns the "[t]otal norm of the parameter gradients (viewed as a single vector)". It doesn't say whether that's before or after the clipping, but given that the return value would always be if it was after, I'm going to guess that it returns the pre-clipping norm (ChatGPT agrees). So we can chart that; changes in these diffs: 1 , 2 , 3 , 4 . So we now have code to clip gradients to a given norm size and to chart the gradient norms so that we know what they were before clipping. The question is, what should that clipping norm be? Some googling around suggested that there was no standard way of saying "for such-and-such a kind of model, gradients should be clipped at around x ". For example, on this Reddit thread , says "Common values are 1, 3, 5, 8, 10", and likewise sample code in this tutorial . has 1, as does this one . So my initial thought was, let's just use 1. But then I wondered, what actually are the gradient norms that we're getting in normal training? I decided to run a local short train on 3m tokens (a thousandth of the full training set, taking just less than four minutes) with very frequent checkpointing, and gradient clipping set to 1, and see what happened. You can see that the "grad max" line is almost always above the "grad clip" -- we're almost always clipping. This doesn't sound right. It looked like the range of the grad max was generally beween 1.1 and a little above 3, so I set the to 3.5 and did another train: Our loss is about the same, but we're no longer clipping -- and that's what we want; there was no evidence of exploding gradients for that short run -- just big updates near the start, as you'd expect. I then ran the same with no gradient clipping at all, and got exactly the same shape for the loss chart as I did with gradient clipping at 3.5, and the same final loss -- that's a good signal that clipping is not affecting the train when we stay inside the limit, which is exactly what we want. So, it was time to train our model! I kicked off the train, and after a little while, I looked at the training chart, which is updated dynamically as the model trains: You can see the dotted green lines, both the light one and the dark one -- that is, the "grad max" and the "grad avg" -- disappear starting just before global step 4,000, only coming back at about 5,500 -- that is, these were not plotted for global steps 4,319 and 4,936, even though the loss was. What was going on? I took a look at the checkpoint meta file for the first of those to see what the actual numbers were, and saw this: Aha! The PyPlot code I was using could not handle infinite values, which is entirely reasonable. That was easy enough to fix , though -- I just replaced positive infinity by 1,000,000 and negative infinity by -1,000,000, and then (in the interest of getting a proper from-scratch run) kicked everything off from the beginning. That training run completed with this chart: That's a little hard to read, but if you look closely at the green lines, you can see that there are seven periods where gradients were either very large or infinite. Weirdly, though, out of the seven, two of them were two checkpoint periods long (that is, two periods of 617 global steps). That felt weird, though of course we're looking at the maximum gradient norm and the average gradient norm -- so two single infinite/high-gradient steps in successive 617-step periods would lead to that effect. What was even stranger, though, was that if you look at the training chart for the run with no gradient clipping, we have only three loss spikes rather than seven: ...though it's also very noticeable that the gradient-clipped run had only two small loss spikes, unlike the three larger ones in the unclipped run. The training loss the gradient-clipped run reported at the end was better, too: ...versus 3.743 at the end of the baseline train. So it was time to download it, and run the sequence-completion smoke test: Coherent enough! Next, we evaluate it against our held-back test set: So, the loss had gone down -- but only from 3.743 to 3.678, a reduction of 0.065, or about 1.7%. That's not actually all that bad! After all, in my initial experiments on my local machine, training for a Chinchilla-optimal number of tokens from FineWeb-Edu (rather than the regular FineWeb I'm using now) got a loss of 4.167 on the same dataset (weirdly worse with the more-curated training set), and training for a further Chinchilla-optimal number of tokens only brought that down to 4.135, for a difference of 0.032, or 0.7%. It's not strictly comparable due to the different training sets, but speaking very loosely, we could say that gradient clipping for this train had more effect than doubling the training time for the other one. That's pretty nifty. But the question remained: why those long periods of high gradients, even with gradient clipping? And why were there still loss spikes -- in particular the one just before global step 12,000, which lasted for two checkpoint periods? Remember that when I started the first run of this train, and got the chart with the missing bits, it was because the logged and were infinite. What happens when gets an infinite gradient -- either one that has an infinity as one of its components, or one that (due to numerical overflow) winds up with a norm of infinity anyway? I'd been kind of assuming that it did what the authors described in "Deep Learning" -- a random update of norm v -- given that the book stated pretty confidently that you "can" do it but then appeared to consider the topic closed. But it doesn't! If you check that link to the docs, you'll see that it has a parameter , which is by default. If it's set to , that will raise an exception if the norm is positive or negative infinity, or if it's not a number -- which catches both the infinite component and the norm overflow cases above. But if it's not set -- and we weren't setting it -- and the norm or the gradients are non-finite, then will essentially return garbage gradients. Depending on the exact cause, elements will either be infinities of one sign or another, or NaNs. And if these are added to parameters, then those parameters will become garbage too. Now that leads to the question, given that we know that somewhere in the period between the checkpoint at global step 4,319 and the previous one at 3,702 there was an infinite norm at some point, how on earth did the model manage to continue training after that? Loss went up at around the same time, but it wasn't completely broken as it would have been with NaNs or infinities in its parameters. Obscurely enough, the answer turned out to be in the AMP explainer , in a comment in one of the bits of example code. Regarding the class we're using: So what was happening was that the scaler -- something we introduced into our code to get a speedup by using 16-bit floats instead of 32-bit whenever PyTorch thought it would make sense -- was protecting us against infinite and NaN gradients as a side-effect. It was skipping updates that would have polluted our weights with bad values from non-finite gradients. If the above comes across as a little frustrated, then it's because I am a bit! From a software engineering viewpoint, this situation really does feel a bit like a rather messy part of the API. There are three things that it's reasonable for a library to do with infinite/NaN gradients: Now, if we look at that , we can see that the first two of those cases are handled there; and the developer can choose which option to follow. It's not where I'd personally put it (the function on the optimiser seems more natural) and I think I'd probably set the default to too, but I can also imagine good reasons for it being the way it is -- backward compatibility for one. But the "skip non-finite gradients" being a (not even optional!) behaviour that is on a class designed for handling mixed-precision training just seems outright bonkers. I would be surprised if there weren't people out there who've spent days trying to work out why their training runs failed catastrophically when they decided to switch from mixed-precision to "full fat" 32-bit floats, not realising that a hardly-even-documented feature of the scaler 3 had been saving them from gradient issues previously. Anyway, rant over. What does this all mean? There are three ways a gradient can explode: With both the baseline code and our new code, the was saving us from the last two of those, by skipping the optimiser steps with non-finite gradients. However, the baseline run was not protected against the first kind -- large but finite gradients with a finite norm -- while this run was protected. What I'm almost certain is happening here is that in all of my training runs so far, there have been all three kinds of issues with exploding gradients. The , which again, we introduced for faster training, happened to be saving us from the infinite gradients/norms. But we were still being bitten by the finite but excessively large ones. And that, I think, is why this training run had a positive -- not huge, but certainly worthwhile -- effect on the test set loss. If I had more time, I think I'd do another run, logging all three of those categories of error to see how frequent they are, and charting the result. That might go some way to explaining the final question I had here: why is it that the renowned "Deep Learning" suggests a random update to get away from the cliff where you've found yourself, while we seem to be getting away with just skipping the update, which is much simpler? Well, the book was written in 2016, and I guess rather a lot has changed in the last 10 years :-) My guess is that their solution might have been a solid default in the age of RNNs, but might not make so much sense with the kind of models we're training these days. I think I can see a way in which that makes sense. Think of the illustration of a loss "cliff" in a one-parameter world that we had at the start of this post: If you happen to wind up on that cliff, you're in trouble. But imagine a two-parameter model -- the line of the loss function becomes a surface. Just as in the real world you might be able to walk along the edge at the top of a cliff and find a nice easy slope down next to it, you can imagine that the cliff in the two-parameter case might be less of a problem because you don't need to be lucky enough to jump down it -- you can walk around it. Extrapolating examples like this to higher dimensions is risky, but I think it should hold that the more dimensions you're working with, the less likely it is that a cliff is an issue -- you're more likely to be able to find a way around it. I've heard a very similar argument made for why local minima are less of an issue with lots of parameters. It's certainly worth saying that this is far from a mathematical proof, but I think it's a decent grounding for intuition. Now think about an RNN. Although you're doing back-propagation through time over what amounts to a very deep network, there aren't actually all that many parameters, certainly compared to an LLM like this. Each parameter is involved in the back-propagation multiple times. So, thinking of it that way, the gradient vector for the RNNs they were dealing with was of much lower dimensionality than the ones we're dealing with, even for this tiny model. They say that the random step "will typically move away from the numerically unstable configuration". I'm probably playing fast and loose here, but I'll take that as something like: if you wound up on a cliff, you were likely in a very "cliffy" area of the loss landscape. "Teleporting" randomly to somewhere some distance away was a sensible way to handle that. In our situation, even if the area is "cliffy" in the direction that one particular batch might push us, we have so many extra dimensions that it may well be that it won't be so bad with the next one. So just skipping the problematic update -- under all of those assumptions -- seems a perfectly reasonable way to handle it. All of this, BTW, made me think back to validation loss. In our previous training runs, where we were measuring it just before each checkpoint, its spikes were in general correlated with but not identical to spikes in training loss: Now, of course, exploding gradients don't have to be related to high training loss -- there's enough non-linearity in there that we can treat them as being completely uncorrelated, I think. But you definitely would expect them to have an effect on validation loss if applied. Disregarding the infinite ones (which were being filtered out anyway), the very high ones that we are now clipping would, in the unclipped baseline train, seem very likely to have caused validation loss spikes. So: if I hadn't stripped that out, we would likely have been able to see a clear difference in the validation loss line between clipped and unclipped. That would have been useful! I'm not going to re-introduce it, though. Best to keep the number of code changes to a minimum if I'm trying to compare like with like over the course of these intervention tests. I think that's enough for gradient clipping. I may come back and do the experiment another time to see what the relative ratios of the different kinds of problematic gradients are. Are there parts of the train where we get lots of them as a percentage (ie. we're somewhere "cliffy" in the loss landscape)? How many infinite gradient vs infinite norm vs big-but-not-infinite instances do we have relative to each other, and to normal gradient updates? What do we see if we have validation loss? And so on. But for now: gradient clipping definitely helps, and goes on the positive interventions list! I'm thinking I'll see what happens with switching off dropout next. That should at least be a bit easier... Stay tuned! Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩ A "bad batch" -- that is, one batch, or even one sequence in a batch, was massively different in structure to the others that the model had seen, so it just had much worse loss. That doesn't seem likely in this case, though: the numbers on the chart are averages over 617 global steps each, and it would take a truly pathological sequence to move the needle that much. Something weird in the optimiser. That's not something I understand well, but according to the various LLMs I'm working with, it's a possibility. Exploding gradients. This is my working hypothesis, and so in this post I'll try out gradient clipping, the normal solution to that problem. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (2016), MIT Press. We clip element-wise. If any one of the gradients in the vector is larger than v , we reduce it to v . We clip based on the norm: the length of the gradient vector in -- in our case -- 163M-dimensional space. That sounds harder than it is -- it's really just an extension of the Pythagorean equation that a 2 + b 2 = c 2 to multiple dimensions. If you want to work out the length of a vector ( a , b ) then you can use Pythagoras to work out c = a 2 + b 2 , and that generalises to any number of dimensions. So for our model we'd just square all 163M elements of the vector, sum those, and take the square root of the result, and that's the norm. 2 If the norm is greater than v , we just divide every element of the gradient vector by the norm and multiply the result by v , to produce a new gradient vector whose norm is v . Whether we actually did wind up clipping them and fixing those loss spikes Whether we were clipping at other times -- we don't want to be doing it unnecessarily. Blindly apply them and expect the developer to sanitise their inputs. Raise an error. Take some kind of default sane action, like skipping the update. It can get very large, still be finite, and have a finite norm. It can get very large, still be finite, but have an infinite norm (eg. due to numerical overflow) It can become infinite -- that is, at least one of the parameters' gradients is infinite (which of course means an infinite norm regardless of any numerical stuff). Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩

1 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

I'm rounding out my series of posts on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090 , and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better. For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud . That led to some refinements in the prompt-following test I was using , and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub . Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to. I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list: I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'. The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU. But first, we need a baseline model. I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one? In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on. In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss). Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it. I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point. Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility. So: what was the baseline train going to look like? The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed , which I set to 42. A couple of bugfixes, and we were left with this version of the code . One thing to highlight: in the file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that: Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together. If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud. Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1 . On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12 × 8 = 96 , which means we'd be able to do gradient accumulation and do a parameter update every 96 ÷ 6 = 16 steps. Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help! So, having coded that up and set up the configuration, it was time to run it. Here's the training chart it came up with: Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later. The training run reported this at the end: So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost. Here is the model on Hugging Face Hub . Let's see how it looks. For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other . But the smoke test -- how does it complete the sequence is worthwhile: Looks good! Reasonably coherent. Now we can find the loss on our held-back test set: That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216). So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart. Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training. If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case! So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35. All in all, it was good to see them there, as it sets us up well for that experiment. So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests! Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping? Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow. Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩ The amount of training data. I'm not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we'd need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring. The number of epochs. I'm going to stick to single-epoch training -- that is, I'll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine. The bias on the W q , W k and W v matrices. This one definitely sounds worth looking into -- easy, as it's just a change to a config flag, and makes the model more like the original GPT-2. I'll give that a go. Dropout. I've read that for single-epoch training, dropout doesn't help (which doesn't quite work with my mental model of what it's for, but does sound plausible). Worth a look! The learning rate, and weight decay. The values I've used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit. The precision. I'm using AMP , which means that some calculations are done in 16-bit rather than 32-bit, and calling with to let PyTorch choose to use the GPU's tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you'd get if you trained with full-fat . Would reverting that be worth the longer train time? I should probably at least poke at that. The batch size. I've already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation. Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training -- maybe 10-15% of training time -- to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating! Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩

0 views
Giles's blog 1 months ago

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face

As part of my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally , and four in the cloud . I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank. It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too. Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later). From the post where I trained the models locally , we have: Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs , four models (with two checkpoints from one of them): You can see how they compare on my evals at the bottom of this post . I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that. Here's the code I've been using as a smoke test after training a model to make sure it's not complete garbage. There's quite a lot of it. That's a lot of faffing about to generate a continuation of ! Disregarding the boilerplate with the argument parsing and validating, we have to load up the model, load up the tokeniser, encode our prompt, and then do a bunch of rather arcane stuff 1 to sample from the model to generate some tokens before we finally print out the result. With the HF Transformers library, there are extra levels of abstraction that allow you to do things much more simply: ...and I wanted what I published to work with that -- and, indeed to be trainable further using the associated training library, like I did during my fine-tuning experiments . I managed to get that all to work, but it was quite a lot more effort than I expected. But at the end, both the pipeline code above, and the training code that you can see in this notebook worked fine. I'll write a follow-up blog post shortly about how to write the code to make a vanilla PyTorch model work within the Hugging Face ecosystem (probably not as part of this LLM from scratch series, as it's a bit of a tangent). But in the meantime, if you're using HF and want to take a look, have fun :-) I've put all of the models in a collection . Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩ -- the first model in that post, trained on a roughly Chinchilla-optimal number of tokens (20x the number of parameters) from FineWeb . -- the second model, trained on the same number of tokens from FineWeb-Edu . -- the third one, which is the model trained further on another roughly Chinchilla-optimal number of tokens from the same dataset. -- trained on a 8x A100, 40 GiB/GPU machine. -- trained on a 8x B200, 160 GiB/GPU machine. -- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was not in the last iteration, so this is the checkpoint with the best loss. -- this one is the final checkpoint from the one above. -- trained on a 8x A100, 80 GiB/GPU machine. Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090

Having worked through the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware? The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series , I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time. But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC. Additionally, Andrej Karpathy recently announced nanochat , "the best ChatGPT that $100 can buy". He mentions on the main page that he's trained a model called , with 32 Transformer layers, which has 1.9B parameters, for about $800. His smaller 20-layer model, with 561M parameters, he says should be trainable in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the $100 total price. What's even more interesting about nanochat is that it's built with PyTorch; initially I'd got the impression that it was based on his pure C/CUDA , which I would imagine would give a huge speedup. But no -- he's using the same stack as I have been in this series! Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help. This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project. But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs. Here's the full story. For this project, I want to use the exact same model code as Raschka presented in the LLM from scratch book -- my copy here . There have been a number of architectural improvements to LLMs since GPT-2, but for now it's best to keep things simple. But there are still some settings to decide on. The config dictionary for the models we've been using has these parameters: There's also the aspect of weight-tying -- the original GPT-2 reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits . There's nothing in the code we've been working with to enforce that, though -- when we do our small train in the book, we're using independent weights for each of those steps. The only time it is "enforced" is when we download the pretrained weights from OpenAI, where we put the same values into both the embedding matrix and the final output head. Given that Raschka says that it's in general better to avoid weight-tying, and actually doing it would be harder than not doing it, then it seems a no-brainer to not do it. So, what does that mean about our model? That matches what we got when working through the book; 163M parameters. Can we train it? It seems like every AI project starts with the question "what data can we use?" The original report on GPT-2, " Language Models are Unsupervised Multitask Learners ", is frustratingly lacking in details. However, it does say that they trained it on "8 million documents for a total of 40 GB of text". Now, according to OpenAI , it's reasonable to assume roughly four characters per token for typical English text. So 40 GB of text is ~10 billion tokens. That data was essentially gathered by scraping pages linked from Reddit that had more than three upvotes there, so was reasonably high quality. Can we get something similar? Conveniently, Hugging Face host a big dataset called FineWeb , and that has a 10 billion token "sample" dataset, randomly selected from the full 18.5 trillion tokens. So the sample feels like it's order-of-magnitude right. And while reading more about Karpathy's nanochat, I spotted that it uses FineWeb-Edu , which is a version of FineWeb that contains "only the most educational web pages". I wrote a script to download both of those , and kicked it off. It took about 20 minutes for each one (slow wifi in my study, I was getting < 5MB/s); FineWeb's 10B sample took up about 29 GiB, and FineWeb-Edu's about 27 GiB. Time to take a look at them. The Hugging Face function loads up all of the files you provide, and you can tell it how to split them up into train/validation/test sets. This command just loads up the whole FineWeb one and says "treat it all as the train split", which is good enough for now: Yikes. It took 1 minute, 53 seconds to generate the train split. However, that appears to be a one-off cost -- when I accessed it again later using the same code in a different Python session, it just did the second "Loading dataset shards" portion, taking three seconds, not the generation of the split. Presumably it caches it. Anyway, let's see what's in it: Great, so we have 14,868,862 rows, each of which has various bits of information. Checking the first one's text: Well, for FineWeb, that doesn't look particularly "fine", but I guess it's better than the stuff that Karpathy talked about in his recent interview with Dwarkesh Patel : When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet Let's take a look at FineWeb-Edu. That looks a lot better! Now let's take a look at the document lengths in terms of tokens. There's a column, but I don't know which tokeniser that's for, so to be safe we'll calculate it ourselves. How long would it take to tokenise every row in FineWeb 10B to check? Let's tokenise the first 10,000 of the 14,868,862 that we have, and see how long that would take -- then we can work out the estimated time for the whole thing. 2,160 seconds or about 36 minutes. Yikes! After a bit of digging, though, I found that tokenisers can handle batches (poorly documented, but it's there in the source ): Also, we can map a function over an entire HF dataset, and that can be made to run with multiple processes. So, we can combine the two: Just over three minutes, not too bad! (The reason the command count above jumps from 47 to 53 was that in the first run I didn't have the in there -- one of the rows in the dataset had in it, and the tokenizer rejected it. I'm going to play fast and loose and ignore that for now.) Now let's see how it added it: Cool! We've added a column with the number of GPT-2 tokens for each row, and we can extract what amounts to a list of those values. Let's plot them as a histogram. Trying to do it directly -- that is, just doing ...seems to make MatPlotLib very unhappy, and my interpreter crashed with an OOM -- I think it might be trying to load all of the dataset -- text, IDs, etc -- into RAM in one go. So I started a fresh one and did the stuff to load it and annotate it with token lengths again -- weirdly, this time the mapping only took 10 seconds or so! That was strange, I'll need to look into that. Perhaps the earlier command added the column to the files on disk? To work around the memory issue, I converted the column from the dataset to an actual list: That took ten or twenty seconds. Let's then try the plot again (full code this time): That took about 11s to run, and the result is this: That's really promising! The bulk of them are less than our 1,024 token sequence length. 1 If we present each row in the dataset as a stand-alone training sample, cropping them when necessary, perhaps we won't lose too much data? Let's see. First step, how many tokens are there in total? Nice, about 10B, as expected. How many tokens would we have if we cropped them to the default GPT-2 context length of 1,024? Ouch, 7.3B. That's quite a reduction: So we're losing 29% of our tokens by that cropping. That's from curtailing just 16% of the sequences: That's not great. I feel that we have two options here: At this point in the experiment, I'm going to keep both options open. I'm inclined towards the latter (I believe it's closer to what the real GPT-2 train did), but I'm not sure. Anyway, we're scoping things out here, so let's move on. After looking at the data, I've thought a bit more about this. I'd previously been thinking in terms of training across all of the tokens in the dataset; we'd work our way through the 10B tokens, and then we'd be done. But when training a model, you do multiple epochs, normally -- you run through the dataset once, updating your gradients as you go, then run through it again likewise, and eventually you stop when your validation loss starts rising. I think that because I'd read that LLMs are normally trained on just one epoch these days, I'd kind of internalised that we only need to do one. But it wasn't the case in 2019 when GPT-2 came out. They had less data -- just 10B tokens or so, compared to insanely huge datasets like the full FineWeb (not the 10B one we've been looking at -- the 18.5T full one), so they would have trained it for some number of epochs. How many? That's another case where the GPT-2 paper is annoyingly light. This report says in the "Replicating GPT-2" section that OpenAI trained it for 800k iterations with a batch size of 512. Plugging in a sequence length of 1024, that gives us this many tokens: Over 419B tokens! Now, if we believe that their dataset was 10B tokens, then we can work out how many epochs that came to: The same report says that they -- as in, the report authors -- make that "around a total of 60 epochs through the training set" -- I believe that the training set they're talking about could well be slightly shorter than the original GPT-2 one -- the GPT-2 authors didn't release their own, which is called "WebText", so the report's author is using a different one that tries to replicate it, OpenWebText . That sounds expensive; even without knowing how many tokens per second we can train for, 40-odd epochs of 10B tokens each sounds like it would take a long time. Are there any other comparison points that might tell us how long to train for? Well, there's a "Chinchilla heuristic" that I've heard of, which says that you should train on about 20 tokens per model parameter. I spent some time reading into where that comes from; originally it's in " Training Compute-Optimal Large Language Models " from Google DeepMind, and it's an interesting paper, and is surprisingly easy to read, with a few bits of maths that get a bit hairy (but aren't required to get a good-enough feel for what they're saying). I recommend you take a look. It was written in 2022, and the authors felt that people were scaling up models a lot, but weren't increasing the number of tokens that they used for training enough. So, they trained a huge number of models, trying to answer the question: "given a particular budget in training FLOPs, what is the optimal balance of training tokens versus parameters to make sure you're using those FLOPs most efficiently?". They were arguing against the method taken in a particular paper, where another team had trained a model (called Gopher) on significantly fewer tokens than they thought optimal. The number of FLOPs used to train a model is linear with both the number of parameters and the number of tokens you train it on, so if you get 2x the number of FLOPs that you had before, you can either train the same model on twice as many tokens, or you can double its size. Which is better? Their conclusion was that you should actually scale both parameters and tokens up by the same amount -- that is, in the 2x case you'd want to have 2 times both the parameters and tokens, which would double your FLOPs and get you better performance. As you can probably see, by doing this they indirectly worked out an optimal number of tokens to train a particular size of model for. They don't state the "20x" heuristic themselves, but it's pretty clear in table 3 in the paper, where they give a number of model sizes and the optimal number of tokens for each. Now, this number is not the number of tokens you need to train for to get the best model you can for a particular number of parameters; a model of a given size can always be trained more and will (hopefully) get better. But it tells you when you've trained on enough tokens that you could get better results by training a larger model than you have right now. They're implicitly assuming that models can get as large as you want, which of course is not the case -- in reality, you're going to be targeting a particular model size, the size that can fit on your training hardware (or more likely with production models, the size that can fit on your planned inference hardware). But interestingly, looking at the README.md for Karpathy's nanochat project, he trained his 1.9B "d32" model on 38B tokens -- exactly 20x. And if you look at the script in the same repo, he explicitly says that he's training for 20x parameters for the smaller model: If Andrej Karpathy thinks that training for Chinchilla-optimality is the right way to go, then who am I to disagree? ;-) More seriously, perhaps the better quality of the dataset makes this a reasonable thing to do. From the GPT-2 paper, their description of how they got the data: ...we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. That's a clever trick, but I believe that FineWeb is much more carefully filtered and improved than the WebText dataset they got from that. Back in 2019, they had to do everything from scratch -- find appropriate ways to get data, filter it, and so on. Now we can just download stuff from Hugging Face. So maybe Chinchilla-optimal is enough. Anyway, we have 163,009,536 parameters, so on that basis, let's train for: ...tokens. (I'll just use 3.2B from now on, but that's the actual number I mean.) That's pretty cool! We have more than that number of tokens already in our FineWeb 10B sample, so we can do a single-epoch training run. So the question is -- is that even doable on my hardware? It all hinges on how many tokens per second we can train at. A good way to check this is to write a throwaway "trainer". We can use that to work out what our maximum batch size on the RTX 3090's 24 GiB of VRAM, then run a bunch of batches through -- a forward and backward pass for each -- and see how many we get. This won't estimate how much time we'll spend validating the model, of course. But my gut is telling me that we should spend no more than 5% of our training time running validations, so we can later on do a similar test, eval mode, forward pass only with no gradient tracking, and use that to work out how many tokens should be in the validation set. So, let's estimate training speed. This code gets an estimate of tokens/second at different batch sizes. Hopefully it's clear enough to not need an in-depth explanation. An outline: Here's what it prints out: So we can see that it gets faster as we increase the batch size, which makes sense because we're handling sequences in parallel, but it does flatten off a bit, which makes sense because there's a limit to how much parallelism we can do, even on a GPU. Let's see how that fits in with the different training sizes we looked at above: OK. We're definitely not going to be able to train this thing the GPT-2 way! I expected that to be the case, but now we have a solid proof of that. But the three-day Chinchilla-optimal train actually sounds doable! I'm heading to London to visit family soon, so won't be using my home PC. With a bit of help from Tailscale I'll be able to log into it from my laptop, though, so I can potentially nurse a run through. Can we make it any faster? Now, when doing the fine-tuning work, I found that you could generally speed things up by doing everything in 16-bit rather than 32-bit. Intuitively that makes sense -- lower-precision numbers, fewer bits, means less work for the GPU doing the various multiplications and additions that are involved in our train. Working with ChatGPT, I found a couple of ways to take advantage of that. Firstly, using TF32. The normal float32 format uses 8 bits for the exponent, and 23 for the mantissa. If you haven't looked into how floats are represented in memory (or if you've forgotten), that means that, using m to mean the mantissa and x the exponent, the numbers are represented in memory as TF32 is messier; it has the same exponent size -- and thus the same range -- as float32, but it essentially ignores the lower 13 bits of the mantissa. So it takes up the same amount of memory, but is lower-precision, which means that calculations can be faster. Most importantly, cards like the RTX 3090 have dedicated "tensor cores" -- as opposed to the normal CUDA cores that do normal matrix multiplications -- and they operate in TF32. Unsurprisingly, "TF32" is "tensor float 32-bit". The PyTorch allows you to tell it what precision to use for matrix multiplications; the default is , which means "use float32 all of the time", so you're stuck using just the CUDA cores. If, instead, you set it to , then it will use TF32 if the hardware supports it and it has the appropriate kernels available. So that will let us use the tensor cores. I added this to the code above just above the loop over the different batch sizes: Let it run, and: That's a 22% speedup! Of course, the precision of the training isn't as good. But given that many modern models are trained at 16-bit (I've seen suggestions that some are even trained as low as 4-bit) then that shouldn't matter. Let's see whether we can train in 16-bit instead. PyTorch has a smart mode where you can tell it "use 16-bit where it makes sense, otherwise use 32-bit" -- AMP, which stands for "Automatic Mixed Precision". There's a great recipe for how to use it in the docs , so let's use that. We need to create a object to handle scaling parameters from 16-bit to 32-bit as needed -- we can re-use that across all batch sizes so we can create it just before the loop: ...then we need to replace this core part of our training loop: ...with some code to use AMP and that scaler -- basically we use a context manager to switch it on when we're doing the forward pass and work out the loss, and then use the scaler to manage the backward pass and the optimiser's step: Running that gives us these results: Wow! With that we can train on 3.2B tokens in about 160,000 seconds, which is 44 hours. That's definitely doable. Now, what happens if we remove the ...so that we're using AMP, but not the tensor cores? It's basically the same. 300tps slower at the start, down to 70 at the end. Still, it looks better to keep the "high" precision in place, rather than the "highest". Right. We have the beginnings of a training loop that should be able to let us run a Chinchilla-optimal train on a GPT-2 small sized model in 44 hours, and I have the time to do it. And it looks like a batch size of six is what we can fit into the RTX 3090's 24 GiB of VRAM. What else are we going to need to build something to do this? If I want to do a long training run, then stuff might go wrong -- it might crash for some reason. So we're going to need to save checkpoints as we go and be able to restart training from those checkpoints. In those, we're going to need to save the model and the optimiser's state, plus some kind of info about how far through the dataset we are. We should keep training and validation losses too, so that we can easily chart and recover our progress, and according to this forum post we're going to need to save the scaler (which makes me think that it actually has state in it, so we probably should have used a fresh scaler for each batch size in the above -- let's hope that doesn't prove to be a problem [note from later: it wasn't]). I wrote a script to create a model, train it for a bit, and then dump out all of that apart from the metadata (which I reckon is going to be less than 1kB). I wanted to use the safetensors format for all of it, but unfortunately I couldn't get it to work for the optimiser or the scaler, so had to use for those (which I don't like because it uses pickle , which introduces serious problems if you ever want to move files from machine to machine, as the Python and library versions need to match perfectly). Ah well. Here's what the test checkpoint looks like: That's huge! And it's almost all the optimiser. From what I read, that stores two numbers per parameter, so it makes sense that it's double the size of the model weights. And at 32-bit, 4 bytes per param, then 670MiB for the model is sane. Timing-wise, it takes about a second to save, the same to load, so that's fine. So that sounds reasonable in terms of timing, and disk space is pretty high, but not so huge that it can't be managed with careful planning -- don't checkpoint so much that we run out of disk during the train (I have a 2TiB disk, but it's far from empty). It's probably worth double-checking that it works, though! Because my checkpoint test already did some training, I changed it so that it does this: Looks sane! The numbers for loss are the same before and after, so I think it's vanishingly implausible that the checkpoint we restored is different from the one we saved. And the continued training seems to be working -- at least, loss is going down -- so that sounds reasonable too. OK, so, again, the time taken to checkpoint is negligible, but the disk space isn't. I reckon we can comfortably do 100 checkpoints over the train. That's roughly one every half-hour over 44 hours. We're going to want to do a validation run each time we checkpoint, so let's think about that next. How big should our validation set be? Let's say we only want to spend 5m per checkpoint period doing validation. How many batches can we get through in that time? I wrote a simple script to run a model (after a few hundred training steps) in eval mode on different numbers of iterations to see how long each one took. It used the same trick as the training loop above in order to use mixed precision, and I ran it with instead of the that I've used in the past (ChatGPT tells me it's a little faster). I also put in some calls to around the loop that I was timing, which should apparently help make sure that the numbers are precise. The code is here if you'd like to take a look. After some fiddling with the min/max numbers at the top: OK, so let's call it 3200. That's 3200 * 6 * 1024 tokens = 19,660,800 tokens. That's about 0.006144 of our training set. Pretty low, but we're talking about such a large training set that I think we're OK. And practically we can't do more -- we're already talking about 5 mins every half-hour, so we're bumping up our train time by 88 * 5 = 440 minutes, which is seven hours. Now let's start thinking about the datasets. We can split the HF thing into train and validation sets. I'm thinking it might be useful to load all of our training and validation data into RAM for the train loop. 3.2B tokens with four bytes per token should be about 13 GiB, after all, and I have 64 GiB RAM on the machine. ...but wait, int64 is the default for PyTorch for long ints -- that's what our token lists are in the original, and it's twice the size, so we're talking 26 GiB. I believe that PyTorch expects that format for the cross entropy loss. That's not the end of the world, though -- we can store the data as int32 in RAM (with 50,257 as our vocab size we could even use int16 if we wanted to) and then we'll need to make them the right type just before using them. We can do that when splatting them onto the GPU, eg. First thought, can we store them as a Python list? Turns out they're not all that memory-efficient, though: How about PyTorch tensors? Promising! (Though ChatGPT pointed out when reviewing a draft of this post that I was using the default rather than an type here. Still, it's the same size.) Let's measure memory usage in a new interpreter. Yup, 12,801,474,560, so about 12 GiB. Can we save it? OK, let's try reloading it in a fresh session: Nice. So, I think we can write a quick script that splits our incoming dataset into say 99/1% train and validation, grabs the first 3.2B tokens from the training set, glomming them together into one big tensor with EOSes between them, and saves them, and then does likewise for the first 19,660,800 tokens from the validation set. We'll use FineWeb, with the possibility of switching to FineWeb-Edu later on. Doing it that way means that we're actually using the second of the two options I considered earlier: Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. I thought it would be harder than concatenating/padding rows, but it actually turns out to be simple enough. Let's give it a go. Here's the code . I wanted to have an round number of 6-sequence batches of 1,024 tokens each, so the the number of training tokens worked out at ...rather than the strict Chinchilla-optimal 3,260,190,720, but that's no biggie. Running it takes 5m55s, and then: Looks about the right size -- 19M * 4 for val, 3.2B * 4 for train. Cool! Let's finally write our training script. You can see the full training script here -- note that this is the final version from the repo, so isn't exactly what I'm running at this point in the post. The checkpointing code is (sensibly enough) in a separate file, . It took two days to run, and... Both train and validation losses fall nicely! Training loss is a bit choppy, but that's because I erroneously only plotted the most recent iteration's training loss rather than an average over all iterations between the last and current validation run; the validation loss is correct because I did average all of the validation numbers. (The version of the code linked above fixes that error.) The best epoch for val loss is not the last one but it was close. Looking at the last 5 iterations, their val losses were: It's time to do some evals Firstly, let's try the smoke test that we do in the book. What does our model think should come after the text "Every effort moves you"? With uninitialised weights we get gibberish, as expected But with our best checkpoint we get this: Nice! The multiple mentions of protein is actually the kind of repetition that small models tend to do, so that's not bad news. Let's try with the last iteration's checkpoint: Also very nice, perhaps better! I think that both of those are qualitatively as good as the result we got when we loaded the pre-trained weights from OpenAI , which was: That's very reassuring. But is there something a bit more quantitative that we can do? Firstly, can we compare it to anything in the GPT-2 paper? In figure 4 they give their perplexity against their train and test sets for the different model sizes; for the small one it's a bit over 16, Let's assume that they're basing that on natural logarithms, so they mean that they have a loss of ln 16 . That's , which is much lower than our best loss of 3.9401. However, that is across different datasets, so while it makes me suspect that their model is better than ours, we can't really say for sure either way. The cool thing is, though, that we have their model -- so we can actually run it against our dataset. I wrote a script called , and running it gives us this: Still better than ours :-( I considered doing the same thing against Qwen to see whether that was also better, but with a different tokeniser we couldn't really treat it as comparable. Loss and perplexity are both over next-token predictions, and if the meaning of "token" changes, then the numbers will change. 2 OK, so we have a model, but it's not as good as the original GPT-2 small. Our loss on our validation set is roughly 3.94, while the original weights get about 3.50. Expressing that in terms of perplexity gives our own model about 51.4, while the original has 33.1. That's actually still higher than the 16 that they had in the paper, which is interesting -- presumably it's related to the fact that they're validating over their own WebText test set rather than ours; they're both samples of web content, but there must be differences. At this point, my guess is that this shows that all of that extra training that the OpenAI team did beyond the Chinchilla-optimal number of tokens did have a real benefit -- and that's not suprising. Remember that the Chinchilla paper is about the best way to spend a FLOPs budget. They're not saying that you can't drive down loss by continuing to train your model further -- of course you can. They're saying that when you pass the optimal number of tokens, you should increase the model parameters and the tokens by the same ratio, and by doing that you'll get the best balance. But still, a Chinchilla-optimal model of 163M parameters might still be useful. What happens if we instruction fine-tune it like we did the original model in Chapter 7 of the book ? In that post and its followup , we used some training samples using the "Alpaca" one-shot question-answering format: ...to get a model that we then provided a test set of questions in the same format, then used the Llama 3 7B model to judge the results on a scale of 0 to 100. We then averaged the results and got a plausible-looking indicator of how useful the model was, as compared to the more narrowly technical loss number. One problem with that is that we ran those tests on the OpenAI weights for the medium-sized 355M-parameter GPT-2 model. If we don't want to be comparing apples to oranges, we'll need to re-run it on their weights for the small model. Let's see how we do. First, let's run it for five epochs just to see when/if it starts overfitting: OK, so two epochs looks like the right amount, just as it was with the medium model. So we can train for that (because I'm using the original code I wrote when working through the chapter, I didn't checkpoint during training -- but it takes less than a minute to run the whole thing, so no biggie). Here's the loss chart: Validation loss at the end is 0.733, noticeably above the 0.649 that I got with the medium-sized model. And the sample outputs shown at the end aren't as good, either. With the medium-sized model, I got these: ...but with the small model (remember, this is with OpenAI's original weights) I get this: Definitely worse, especially the last one! Let's see what Llama 3 thinks of it, again using the code from the book: The medium model got an average of 50, so the OpenAI small model is definitely much worse, as the examples suggested. Makes sense. Let's see how our own base model performs when fine-tuned on the same data. After a bit of fiddling I found that validation loss settled down at the end of epoch 10: (It's hard to see from the chart, but validation loss was actually very slowly dropping even after epoch 5.) It's interesting that our own model took longer to train here, but it does make sense in terms of it being that little bit dumber. The samples it printed out at the end are also interesting: The simile is pretty good, I think better than the OpenAI original weights' one, but the storm clouds one is dreadful. It's fascinating that they both chose the same wrong answer for "Pride and Prejudice" -- my guess is that it's because the training set contained this question: ...so both models picked up on Robert Frost being a useful author to reference in answers. Anyway, what does Llama 3 think of the output? Yup, it's dumber than the original weights -- but, at least to my mind, closer to the original weights' score than you might have thought based on that loss/perplexity number alone. But, on the other hand, I'm not convinced that Llama 3 7B is smart enough to be doing a good job. In the stuff the eval script printed out, we have this: This is clearly completely wrong, the mention of cumulonimbus is coming from the dataset response, not the model response. Llama 3 7B is tripping up over what came from where, which is pretty normal for a small model. Of course, it's possible that the scores for the OpenAI GPT-2 small weights also have been given a higher rating than they deserve -- or, indeed, that there were right answers that were incorrectly judged wrong. Conceivably it averages out. But there's no reason to assume it would, so it's essentially noise and is making the results less useful. Let's try using a much smarter LLM as a judge and run both of the models responses through it -- the just-released OpenAI GPT-5.1 model. The code is here . Running that against our own model's answers: ...and against the model fine-tuned from the small OpenAI weights: ...and, of course, it didn't make the mistake of confusing the dataset response with the model's in any of the cases printed out. ChatGPT 5.1 in the chat interface is very smart, I expect these results are much closer to a reasonable ground truth. Out of interest, what does it make of the model based on the GPT-2 medium weights that we train as part of the book? That's as compared to an average of about 50 from Llama 3 7B. It seems like GPT 5.1 is a tougher judge than the small local model -- and my guess is that that is because it's more accurate. 3 Anyway, the ranking remains the same; after fine-tuning on the same Alpaca dataset, GPT-2 medium > GPT-2 small > our model. But it's still a relatively close-run thing between our model and GPT-2 small. Can we close the gap without vast amounts of extra training? The results so far were from using 3.2B tokens of the FineWeb 10B corpus. Now, as I noted at the start of this post, Andrej Karpathy's nanochat project uses FineWeb-Edu, a separate corpus designed to be really informative. Indeed, back at the start when we were looking at the two datasets, the first row in the Edu dataset was about Jane Austen, so maybe we would wind up with a model that at least got that question right! That's going to take another two days to train, but that's no big deal. We first need to change our script that generates the train/validation splits to regenerate them using the Edu dataset; we'll move the old ones to one side, though -- it will be interesting to see what loss we get on the non-edu validation data with the new model. (Note to self: work out some way to split out different datasets and training runs for future experiments like this. The setup I had in my recent post on RNNs worked quite well. Throughout the remainder of this post I'm juggling directories of checkpoints and datasets, and I'm sure I got it right, but it was an error-prone process.) That being done, it's time to move the checkpoints we already have to one side, and to kick off the train! Here's what we have after two days on that -- oops, I forgot to add the code to average training loss across all of the batches, so again it's a bit spiky. But we got to a final eval loss of about 3.693 this time. Of course, that's on its own validation set, so it's not comparable with the numbers from before; loss is specific to a particular dataset. Let's see what it makes of the original run's validation set. Juggle some directories around (my messy file structure means that there is just one "datasets" directory and one "checkpoints" one, so I'm moving them around to make sure I'm using the right combination): We get 4.16! That's truly terrible, worse than both the original base model that we trained on FineWeb's non-edu dataset, and than the OpenAI GPT-2 small weights. Let's see what we get from the closer-to-real-world instruction fine-tuning test. Five epochs turns out to be best: I won't bother running it past Llama 3 7B, as that's proven unhelpful, so we'll go straight to GPT-5.1. Gosh! So it's judged slightly worse than our weights based on FineWeb. That does surprise me a bit. I was definitely expecting the Edu version of the dataset to give us a better model. So: OpenAI medium > OpenAI small > our FineWeb base model > our FineWeb-Edu base model. That last pairing does surprise me a bit. Handwaving wildly, perhaps the more "regular" nature of the Edu dataset meant that the model saw less variation in its training set, and that actually made it learn less? I think there's one more experiment I want to do before bringing this ( very lengthy) post to a close. We've shown that Chinchilla-optimal training of models produces worse results than OpenAI's original, we think longer, train. What would happen if we continued training for another two days? As I have it easily to hand, I want to use the FineWeb-Edu model for this. I want to start with the best checkpoint (which happens to be the last one), and train it on another 3.2B tokens from FineWeb-Edu. Let's see what we get. Getting a dataset is going to be a bit messy, as our existing script to generate the safetensors datasets just grabs tokens from the original dataset until it gets 534,200 batches of 6 sequences, each of 1024 tokens (3,282,124,800 total). Might as well hack it (and note that this is something worth improving for any later experiments). I'll just loop round the code to do that twice, throwing away the first set of 3.2B tokens. I was pretty sure that the ordering of the datasets I'm getting is fixed, but perhaps not -- it spent time regenerating the train/val split at the start of the script, so there's no guarantee we have different data this time. That feels like a note-to-self about data pipeline hygiene -- if the train/val split is randomised by the infra I'm using, I should persist the raw data in case I need to use more data than I though I would need to. Still, for this experiment, we can play relatively fast and loose. After all, GPT-2 small -- the original OpenAI weights -- was trained on multiple epochs, so it saw tokens multiple times. What we're trying to see here is what happens if you train for longer; a more scientific experiment can happen later (if at all...). Anyway, we have 3.2B tokens that should at least be reasonably different from the original 3.2B. Right, let's clean up some disk space so that we have enough for the new train (deleted some old optimiser checkpoints, keeping the metadata and the weights). Now, we create a new checkpoints directory, and we can copy the last/best checkpoint from the original FineWeb-Edu train there. Hack the in there to zero, create and symlinks, and then we can "restart" from that checkpoint. Due to the way the restart-from-checkpoint code works in the training script, that means that it will start with an offset of 1 into the dataset, so we're dropping one of about 530,000 iterations, but that's not exactly the end of the world. There are some interesting spikes on validation loss in there -- in particular that one at around iteration 300,000 where it goes up from 3.6 or so to 7.5 for two validation periods (which, remember, happen every ~30 minutes, or every 7020 iterations). My guess is that we got some kind of gradient spike prior to those, which led to a bad update to the parameters. However, it looks like the loss recovered really quickly after it, so while gradient clipping (that is, limiting the size of the gradients so that one-off spikes don't cause massive updates) might have prevented them, I don't think it would have improved matters much -- we might have "lost" an hour so of training, but out of a 44-hour train (48 hours including breaks for validation), it's not the end of the world. But, looking at the raw numbers, after our second two days of training on a fresh sample from FineWeb-Edu 10B, we've managed to get the loss on our validation set down from 3.693 to... drumroll... 3.661. And that's on the "best" measurement, which was an hour before the end. The last validation number was 3.663. By spending twice the time, we've managed to get our loss down by 0.032, which is a touch less than 1%. Even measured in terms of perplexity (which, being an exponential, is more sensitive to this kind of change), we've gone from 40.2 to 38.9, which is hardly show-stopping. Let's see how this one measures up against the non-edu FineWeb validation dataset that we originally used to calibrate our first training run. Run it, and: ...we get 4.13 -- that's opposed to 4.16 on the last model, trained on half as much data. Well, maybe it's a much better base model for instruction fine-tuning? Let's give that a go, again with the Alpaca training set from the book. 8 epochs turns out to be the right number: Certainly better than the 15.18 that we got on our Chinchilla-optimal FineWeb-Edu model, and a bit better than the 16.14 we got on the Chinchilla-optimal FineWeb one. So by training for double the time on twice the data, we've definitely got a better model. It's just not that much better. I think that's more -- significantly more -- than enough experimentation for one blog post, so let's do some analysis. I want to sanity-check the number of FLOPs spent on this train, just to make sure that I hadn't messed up. Feel free to skip this if you want to jump straight to the conclusion :-) In appendix F, the Chinchilla paper mentions a common approximation for how many FLOPs, C , you spend training a model with N parameters over D tokens: So based on that, each of those training runs cost us (using the exact numbers for N and D ) this many FLOPs: They also give a more carefully-worked out calculation; it doesn't look all that difficult -- it's just a case of plugging in the numbers from our architecture and pulling out a result 4 -- but the numbers they get from that are generally within 10% of the simpler calculations, so we may as well stick with the above. 5 Now, in terms of how many FLOPs we actually spent... well, manufacturers' datasheets for hardware are based on carefully-selected benchmarks and won't really be comparable to the code we were running (especially given that it's my crappy code based on top of a huge stack of PyTorch, CUDA kernels, CUDA itself, and so on), but we can do a Fermi estimate . From Wikipedia, the RTX 3090 has 35.58 TFLOPS performance on FP32. Way back earlier in this post, when I was measuring how many tokens per second I could get locally, the first experiment capped out at 12,599 tokens/second with FP32. showed the GPU usage at 100%, so let's say (again, this is very approximate) that we were getting about 35.58 TFLOPs and that enabled 12,599 tokens/second. We wound up training at about 19,921 tokens/second after adding in mixed precision and using the tensor cores. So, hand-wavingly we can say that we were getting Now, we trained for 44 hours (48 including validation), so the total number of training FLOPs should have been the number of seconds in that times the total FLOPS 6 of 56.27 × 10 12 That's pleasingly close to the 3.19 × 10 18 above! I can easily imagine that the stack we're using could somewhat-more-than-halve performance from the theoretically optimal, or that we're running at 50% of the GPU's theoretical capacity, or some combination of the two. We're in the same order of magnitude, and for a Fermi approximation, that's what matters. Now, looking at figure 3 in the Chinchilla paper, their IsoFLOP curves (each one showing the loss they got on their training set for models of a particular size, using the same number of FLOPs for each curve), we can see that the top one, which is training runs of 6 × 10 18 FLOPs, the lowest point is pretty much bang-on the 168M point on the X axis. So that is at least reassuring that we did do a proper Chinchilla-optimal train here. (Their loss on that chart is showing 3, but they're using a different dataset, so I don't think it's comparable.) Apart from the obvious answer of "skill issue", let's see if there are any obvious reasons why the base model I've trained (and retrained) in this post is worse than the original OpenAI GPT-2 small. Let's review the results first: The first row is not super-interesting, it's the second and third that matter. OpenAI is clearly winning by quite some margin! Earlier on I assumed that the difference was that they trained on more data, but let's be a bit more systematic here. What specific differences do we have to the original train? Again, the amount of data in the paper is frustratingly limited, but: Right at the start, I estimated that the WebText dataset they trained on was about 10B tokens. We've trained on 3.2B tokens for two of our models, and 6.4B tokens for the extended train one. That could well have an effect. There's more information in their larger dataset, both in terms of raw facts like "Jane Austen wrote Pride and Prejudice", and in terms of information about the structure of language. On the other hand, their dataset is, as they say, comprised of the contents of web pages that were linked from Reddit posts with more than three upvotes. FineWeb (and even more FineWeb-Edu) is a much more curated dataset, so you would expect it has more facts, and better structure -- less of the slop and junk that Andrej Karpathy talked about in his interview with Dwarkesh Patel. So I'm not sure that this is it, but it's worth keeping in mind. Again, we don't know how many epochs they trained on, but the report I linked to right at the start of this post estimated that they trained for 60, while I calculated based on their numbers that it would be 41 epochs with WebText. It certainly makes sense that grinding along, epoch after epoch, will get your loss down, at least on the training set! And there's also a phenomenon with certain kinds of neural networks where if keep training past the point where you're overfitting (that is, validation loss starts rising while training loss continues to fall), suddenly the model can have an "aha" moment and start generalising again . 8 It's not quite comparable, because it was not a second epoch, but rather continued training with more data, but we were able to eke out an extra reduction of 0.032 in loss by training our FineWeb-Edu model for twice as long. If we'd trained it for 40 times as long, then we presumably would have managed to grind it down even further. I have no idea how much further we could get it, but I'd guess that it's going to be worse than linear (that is, each extra two days gets you less loss reduction than the previous) -- so we can bound the loss reduction at a maximum of 39 × 0.032 = 1.248 . So... maybe? It would be a dull experiment to run, though, taking 78 days. If I want to do that, it would be better to find a way to do it quickly, so that I can get a better feedback loop going. The reason this post has taken so long has in part been because each training run has taken so long (as well as trips to London and other life stuff). The original GPT-2 model from OpenAI had bias on the W q , W k and W v projections -- that is, they were normal NN biased linear layers rather than simple matrices, so they did a projection into their respective spaces followed by a translation. In the book, Raschka says that this is not normally done these days, which is why I didn't do it for this base model train. But perhaps it actually is valuable with this architecture or size? Modern models presumably differ in multiple ways, and perhaps the bias would have been useful for this old design. Likewise, weight-tying -- the original GPT-2 re-used its embedding matrix to do the final projection from embedding space to vocab space, rather than having a separate one. That seems intuitively clever but not necessarily "right", given that it gives the model less flexibility in what it can output from the last layer. But perhaps with this size and architecture, it's the right thing to do? Contrariwise, having made those two changes to GPT-2 because I believed that modern models don't work that way, there was one "modern" change that I didn't make. In his post on the architectural changes since GPT-2, Raschka mentioned that dropout is normally not used nowadays. This looked to me like it was due to the move to single-epoch training. But single-epoch training was exactly what we were doing in this post! Perhaps I was holding myself back by keeping dropout in place. I don't have a good intuition as to what the right level is for this at the moment. My code blindly uses the optimiser setup from the book: I have at best a vague understanding of how those work, at least when using an optimiser (LR for simple gradient descent isn't too hard to understand, although it's hard to work out an intuition for what the right value might be in any given case). Additionally, in the Chinchilla paper, they talk about using a cosine function to vary the learning rate, which is something I'm completely unfamiliar with. I gained about a day in training time by using AMP and the TF32 tensor cores; however, I lost precision. I don't know for sure, but I suspect that the original weights were trained with pure full-fat FP32. Perhaps reducing precision lost something? I know that modern models are often trained with lower precisions, but perhaps that's balanced out by something else? This is the one that I think it least likely, but it's worth mentioning. The post that I linked to estimating the size of the training run for GPT-2 small mentioned that they used a batch size of 512, which (of course) is completely impossible on consumer hardware like mine. Indeed, I think you'd be lucky to get 512 onto a single 8-GPU node -- we're talking serious cluster training scale here. Larger batches lead to more stable updates to the gradients. So maybe that helped for OpenAI when they did their train? I suspect it did, but I'm pretty much certain that it's not a large part of the difference. (Counterpoint: Gemini thinks that this might actually be a big part of the problem! It recommends using gradient accumulation -- that is, not stepping the optimiser every iteration, but instead giving gradients time to build up -- as a way of getting a larger batch effective batch size.) While it doesn't look like we had any issues with these on the original FineWeb and FineWeb-Edu trains, they definitely did kick in on the extended Edu train. The code to clip them is easy enough, and I think it's likely that the original GPT-2 trains would have had it. I doubt this was a major part of the difference, but it probably would have helped, at least a bit. Anyway, I think that's it in terms of differences that I can see between my train and OpenAI's (as always, comments welcome -- let me know if you spot any others!), so it's time to (finally) wrap this post up. At the start of this (ridiculously long) post, I asked the question: can we train a GPT-2 style base model at home on a single RTX 3090. The answer is a resounding "yes we can", which is great! Training base models: not just for the GPU-rich. If you have a couple of days and a decent graphics card, you can train a Chinchilla-optimal GPT-2 pretty easily. But the model itself isn't quite as good as the original GPT-2 small one, and I have some ideas about why that might be. Testing any of those would take quite a long time, given that each training run takes two days. Now, my next planned step was to see whether I could work out how to move this up to the cloud and train the same model on an 8x A100 or similar machine on Lambda Labs. This still sounds like an excellent plan! With his project, Karpathy trains a larger model on more tokens in four hours; if we could get the experiment time down to one hour (plausible if training time is linear in both tokens and parameters) then it would be much easier to check out those hypotheses above. 9 So, I think that's still the right way to go: after training a base model at home for free (if you ignore the electricity costs -- and it's cold enough in Lisbon right now that the heat from the PC was probably saving me money on my home heating bill -- and the cost of having bought the RTX 3090 in the first place), the next step is to see how cheaply we can train it in the cloud. Stay tuned :-) It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩ . This is determined by the tokenizer, and I want to use the GPT-2 one, so it will need to be . . GPT-2 has a 1,024-token context length, so I'll stick with that. , , --- these define which of the different GPT-2 model classes we're training, and I want to stick to the smallest one, so they will be , and respectively . One of the most surprising things to me in the "architectural improvements" post linked above was that dropout is no longer used so much. However, this appears to be tied in to the one-epoch training that has taken off since GPT-2, so I think it would be best to stick to here. . From what Raschka says in the book, this doesn't add on much value, even though the original GPT-2 used it, so let's set it to . Crop all of the input sequences -- that is, each row in the dataset -- so that each one is no more than our 1,024 sequence length. Then we can pad them out with end-of-sequence tokens (as is the standard) so that they're all 1,024. This will lose us quite a lot of tokens, but has the big benefit of being easy. Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. Doing it this way would mean we'd use all of our training data. But it would be more complicated, especially if we hit memory constraints. We load enough GPT-2 tokens from FineWeb for batches of sequences each, every one of those sequences being long (plus one extra token for the targets we're comparing them to). Note that we're not bothering to separate them with anything for this test. We then loop over batch sizes from to . Then we create our model and put it on the CUDA device. We do this for each batch size rather than creating one and then using it for all of them so that they're all starting from the same point -- the should make sure that they're identical. For each batch size, we create input and output batches as tensors -- note that we're not putting these on CUDA yet, I wanted to do that in the training loop to mirror what a real training loop will have to do. When we're training with 3.2B tokens then having them all on CUDA will be a waste of VRAM, so we'll be pushing a batch there for each iteration. We do a stripped-down training loop -- for each batch, put the inputs and outputs onto CUDA, then a forward pass, work out the loss, backward pass, and optimiser step. We do the same iterations per batch size. Finally, we print out the number of tokens we trained on for this batch size, how long it took, and the number of tokens per second. Chinchilla heuristic, 20x parameters -- 3.2B tokens: 247,850 seconds, which is just less than three days Estimated GPT-2 train, 419B tokens: 32,452,947 seconds, which is just over a year. Create a model, optimiser and scaler. Train the model for a bit. Work out the loss. Save a checkpoint. Create a new model, optimiser, and scaler, and then restore the checkpoint into them. Work out the loss Train for a bit more to check that the optimiser and scaler still work. On our own validation set from FineWeb, our we have OpenAI > our FineWeb train > our FineWeb-Edu extended train > our FineWeb-Edu train On the answers judged by GPT-5.1 after instruction fine-tuning, we have OpenAI > our FineWeb-Edu extended train > our FineWeb train > our FineWeb-Edu train It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩

0 views
Giles's blog 3 months ago

Why smart instruction-following makes prompt injection easier

Back when I first started looking into LLMs , I noticed that I could use what I've since called the transcript hack to get LLMs to work as chatbots without specific fine-tuning. It's occurred to me that this partly explains why protection against prompt injection is so hard in practice. The transcript hack involved presenting chat text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at a base LLM: ...you would instead prepare it with an introductory paragraph, like this: That means that "simple" next-token prediction has something meaningful to work with -- a context window that is something that a sufficiently smart LLM could potentially continue in a sensible fashion without needing to be trained. That worked really well with the OpenAI API, specifically with their model -- but didn't with their earlier models. It does appear to work with modern base models (I tried Qwen/Qwen3-0.6B-Base here ). My conclusion was that had had some kind of instruction tuning (the OpenAI docs at the time said that it was good at "consistent instruction-following"), and that perhaps while the Qwen model might not have been specifically trained that way, it had been trained on so much data that it was able to generalise and learned to follow instructions anyway. The point in this case, though, is that this ability to generalise from either explicit or implicit instruction fine-tuning can actually be a problem as well as a benefit. Back in March 2023 I experimented with a simple prompt injection for ChatGPT 3.5 and 4. Firstly, I'd say: It would, of course, accept the challenge and tell me that it was thinking of a number. I would then send it, as one message, the following text: Both models told me that yes, I'd won -- the only way I can see to make sense of this is that they generalised from their expected chat formats and accepted the fake "transcript" that I sent in my message as part of the real transcript of our conversation. Somewhat to my amazement, this exact text still works with both the current ChatGPT-5 (as of 12 November 2025): ...and with Claude, as of the same date: This is a simple example of a prompt injection attack; it smuggles a fake transcript in to the context via the user message. I think that the problem is actually the power and the helpfulness of the models we have. They're trained to be smart, so they find it easy to generalise from whatever chat template they've been trained with to the ad-hoc ones I used both in the transcript hack and in the guessing game. And they're designed to be helpful, so they're happy to go with the flow of the conversation they've seen. It doesn't matter if you use clever stuff -- special tokens to mean "start of user message" and "end of user message" is a popular one these days -- because the model is clever enough to recognise differently-formatted stuff. Of course, this is a trivial example -- even back in the ChatGPT 3.5 days, when I tried to use the same trick to get it to give me terrible legal advice , the "safety" aspects of its training cut in and it shut me down pretty quickly. So that's reassuring. But it does go some way towards explaining why, however much work the labs put into preventing it, someone always seems to find some way to make the models say things that they should not.

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 27 -- what's left, and what's next?

On 22 December 2024, I wrote : Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting. More than ten months and 26 blog posts later, I've reached the end of the main body of the book -- there's just the appendices to go. Even allowing for the hedging, my optimism was adorable. I don't want to put anyone else off the book by saying that, though! I expect most people will get through it much faster. I made a deliberate decision at the start to write up everything I learned as I worked through it, and that, I think, has helped me solidify things in my mind much better than I would have done if I'd only been reading it and doing the exercises. But on the other hand, writing things up does take a lot of time, much more than the actual learning does. It's worth it for me, but probably isn't for everyone. So, what next? I've finished the main body of the book, and built up a decent backlog as I did so. What do I need to do before I can treat my "LLM from scratch" journey as done? And what other ideas have come up while I worked through it that might be good bases for future, similar series? There are a few sources of ideas for this -- from the book itself and its supplementary material, from notes I've made as I went along, and from other things that I've kept on a mental checklist. There are five appendices: Raschka also gives a link at the end of chapter 7 to a notebook showing how to do further fine tuning using Direct Preference Optimization , which also looks fascinating, and he's working on a new project, " Build a reasoning model (from scratch) ". While working through the book, I've deliberately deferred various things. I'd kind of lost track of all of them, so I gave ChatGPT the source markdown for all of the posts in this series, and asked it to find where I'd done that. It did an amazing job! There were three categories: long context and attention efficiency, maths, and optimisers. The model we've built in the book has a context length of 1,024 tokens, and is O ( n 2 ) in both space and time with respect to the number of tokens you feed it. There are lots of things that people do to work around that. Things I need to learn: I really want to understand softmax at a better level than "it's a magic thing that turns logits into probabilities". I'd also like to learn more about higher-order tensor operations -- the ones that we use in the book are essentially treating the extra dimensions as the batch, but I believe that there's more to it than that. I really want to understand in reasonable depth what optimisers do. I know that they make gradient updates work better than they do with simple gradient descent. But how? That was the set of things I noted at the time I wrote the posts so far, but there are a few more that come to mind as I write this. In some comments that he made on posts in this series, said that it seems like this book isn't really "from scratch", given that we rely on PyTorch's magic to handle the backward pass. He's 100% right! I think I understand why it is that way, though. There would be two different ways that I can see for the book to do it: I think I'd definitely like to revisit that at some point. Another one from Simon; while the book does explain how tokenisers work, even down to a high-level overview of byte-pair encoding, we don't write our own. Again, I can see why this is -- we load in the GPT-2 weights, so we need to use that model's tokeniser. And there's no point in writing our own if we're just going to throw it away. But perhaps a bit of time playing with one would be useful? The book, quite reasonably, shows you how to train your LLM, does a basic train on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. But given that I was getting a pretty good training speed on my own hardware, perhaps I could train a model really from scratch, perhaps using one of the smaller FineWeb datasets? Even if I can't do it locally, perhaps it might be doable on a rented cloud machine, like the Lambda Labs ones I used when fine-tuning Llama 3 ? After all, Andrej Karpathy is training a full model that you can chat with for $100 . I don't think I ever mentioned this on the blog, but one important plan for me is to try to build an LLM from scratch, only using my own blog posts and what I remember -- no looking at the book. If I can do that, then I can be reasonably sure that I really have learned it all. I'm also thinking that I'll do that using a different library -- that is, not PyTorch. That would stop me from regurgitating code that I've learned. If you're reading this within a day or so of the post's publication, I'm running a poll on X/Twitter about which framework to use . If you have an opinion, please do stop by and vote :-) It feels like almost every new model these days is an MoE. I have read a lot around the subject and would love to build on it. Essentially, instead of having just one feed-forward network after your attention heads, you have several. In front of them you have a router -- a trainable network of some kind -- that tells you which of these "expert" FFNs the token should be forwarded to. You then send it to the top (or top k ) experts, while leaving the others inactive. The result is that you have more space (in terms of parameters) for the LLM to know about things, but not all of those parameters are active during inference -- so your model is smarter but still fast. There's a bunch of interesting stuff there, from how you build it in the first place, to how you handle the fact that you're processing lots of tokens at once -- multiple tokens in each sequence and multiple sequences in a batch. It would be a pretty cool follow-on to the "my own LLM" series, thinking about it. I definitely don't think I need to do all of those things in order to wrap up this series. Here's the subset I'm planning on doing: For the other things, I think there are some potential future series to write. I'm certainly not promising that I'll write up all (or even any) of that second list, but they all seem really tempting to me right now. If you're particularly interested in seeing my take on any of them, please do leave a comment below. I think the next post in this series -- maybe the next several posts -- will be on trying to train the model code provided in the book from scratch to produce my own base model. Stay tuned! Here's a link to the next post in this series . A: An introduction to PyTorch B: References and further reading C: Exercise solutions D: Adding bells and whistles to the training loop E: Parameter-efficient fine-tuning with LoRA The KV cache . This is basic stuff and I feel I sorta-kinda understand it, but I haven't written about it so I can't be sure. It's a pretty obvious enhancement to avoid repeating work when generating autoregressively -- that is, the normal setup where in order to generate n tokens, we give the model its input, sample our first token from its predictions, then feed the whole thing -- the input and that first token -- back in for the second token, and so on. Obviously, because attention is causal, we're doing exactly the same work every time for all of the tokens in each round apart from the last one, so it makes sense to cache things. The result is that generating the first token is still O ( n 2 ) , but subsequent ones will be something more like O ( n ) each. That's why real-world modern models tend to take a while pondering before they generate the first token but then speed up -- they need to fill their cache. FlashAttention and related things: there are lots of ways people have found to reduce the cost of attention generally, but this seems to be the most popular one, or at least the best to get started with. Better positional embeddings : the context length of our GPT-2-style LLM is fixed in part because you need position embeddings for every possible input position. That means that we can never extend it. More modern LLMs use better ways to represent positions -- Rotary Position Embeddings (RoPE) look like they're very popular. Manually code a backward pass to go with the forward pass on each of our modules. Simon did this, and was kind enough to share his code with me -- it looks like one of those things (like attention) that is pretty hard to get your head around initially, but once it clicks it's super-clear. Definitely kudos to him for getting it all to work! The problem with this is that I don't think any ML practitioners do this nowadays, because automatic differentiation is there in every popular framework. So it might be a good learning experience, but also might nudge people into an unprofitable direction. Create our own automatic differentiation system. Andrej Karpathy pops up again when looking into this; he created micrograd , which handles back-propagation for scalar functions. That's really clever -- but it would be hard, and a bit of a side quest from the point of the book. Also, the most interesting stuff (at least from what little I know) for automatic differentiation is how you do it with non-scalars -- the matrices and higher-order tensors that our LLM uses. From what Simon says, this is where you need to use the mysterious Jacobian matrices I've heard about in the context of back-propagation. Training the full GPT-2 base model myself. I'm 100% going to try this. From the appendices -- anything that surprises me from the one on PyTorch, and perhaps from the "bells and whistles" in the training loop. The others I either won't do, or will pick up later. Building my own LLM from scratch in a different framework, without using the book. That is, I think, essential, and perhaps would be the crowning post of this series. It would be a nice way to end it, wouldn't it? Improving context length -- RoPE and other tricks -- sounds like an excellent series to start on when I'm done with this. AIs tell me that other interesting things to look into would be ALiBi, NTK/YaRN scaling, and positional interpolation. Improving performance: the KV cache, FlashAttention, and other performance enhancements likewise feel like they could make a good series. I also want to do a separate series on LoRA. In that, I'll draw on appendix E from this book, but also on other tutorials. Likewise DPO, along with other post-training that can be done to make models more useful as chatbots, like Reinforcement Learning. I'd really like to spend some time understanding that area. (And Raschka's upcoming reasoning model book might fit into that category too.) Optimisers: Adam, AdamW, maybe Muon (though the latter scares me a bit). The maths -- softmax and higher-order tensor calculations -- also seems to belong in another series, perhaps an extension of the various "maths for AI" posts I've done in the past. Automatic differentiation and the backward pass; that would make a great series. A mixture-of-experts model would be excellent fun, I think. Tokenisers would be a great stand-alone post, at least at the level that I can see myself covering it. Perhaps that would develop into a series if I found myself getting sucked in.

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model

This post is on the second half of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I covered the part of the chapter that covers instruction fine-tuning; this time round, we evaluate our model -- particularly interestingly, we try using another, smarter, model to judge how good its responses are. Once again, Raschka's explanation in this section is very clear, and there's not that much that was conceptually new to me, so I don't have that many notes -- in fact, this post is probably the shortest one in my series so far! Unusually, when at the start of section 7.7 we generate some sample responses for the instructions in our test set, I got exactly the same results as in the book. For once, I guess, everything that uses randomness was happening in the same order as it did when Raschka ran it on his machine. The next step was to generate a file with all of the responses to all of the test instructions, which took 18.9 seconds on my RTX 3090 (compared to a minute on an A100, per the book -- that's quite surprising!) Once that was done, it was time to install Ollama so that I could use the Llama 3 model to evaluate my own. I've never used Ollama before -- when playing with other people's models, I've always used Hugging Face's Transformers library. It's a neat package, though. It wraps , which is a pure C/C++ inference framework (with CUDA support), and makes it easy to download and run models that have been packaged for it. Being written in C, I would imagine that it's faster than PyTorch/Transformers -- though, being inference-only, it's less useful if you're planning to do things like training or fine-tuning the models. My desktop is running a fairly customised install of Arch Linux, and I didn't want to use the default install procedure (which puts it into your system-wide and directories). But it turns out that it's a very well-packaged app, and you don't need to do that. Using the manual install instructions for Linux , I just created a new directory , and then ed there and downloaded it: It was about 1.75 GiB. I then untarred it: ...and then I could run commands with full paths, for example: ...to start up the server, or ...to start a session. Neat! It's always good to see pre-built binary packages that have no issues with their install location. The next step was to throw all of the generated test responses (and their associated targets) at Llama 3 and see what it thought about how close they were. Again, this all worked without trouble. I noted that the responses I was getting from Llama 3 were not the same as the ones in the book -- Raschka notes that Ollama is non-deterministic, so there's no surprise there (though it does make me wonder why it accepts a parameter in the API call). When I got on to the final eval, where you run the test results through Llama 3 and ask it to rate them compared to the target outputs, it took 11 seconds to run, and I got an average score of 48.95 / 100, which is close enough to the 50.32 that appears in the book. 1 I'd run an eval on my model, using a smarter model to judge its responses! Somewhat surprisingly, that number was stable over multiple runs. So perhaps there is some level of determinism in Ollama now that wasn't present when the book was written, and the seed (eg. ) is of value. Or perhaps Raschka's comment about it being non-deterministic was more of a "between machines" thing rather than for multiple runs on the same machine -- though then I'm not sure why he suggests re-running it for multiple results. Anyway -- that was it! Eval done. And, to my amazement, that was the end of the chapter -- and almost the end of the book. We've built an LLM from scratch, fine-tuned it, and evaluated it by using a smarter model to judge how well it was following instructions. ...or at least the end of the beginning. Having run the evaluation, I've reached the end of the main part of " Build a Large Language Model (from Scratch) ". But I don't think I've reached the end of this project, there's still more to do (not least working through the appendices). So, coming up next: a post summarising what I've got through so far in this series, and what the next steps are to wrap it up. Here's a link to the next post in this series . I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩ I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 25 -- instruction fine-tuning

This post is on the first part of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", which covers instruction fine-tuning. In my last post , I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way. So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get. Just as with the last chapter , what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause. For this part of the book, we use the "medium" variant of the GPT-2 open weights rather than the "small" ones that we've been using so far. I have to assume that this is because the small really isn't very very good at this kind of thing. [Update: it isn't! See the metrics from my mammoth post on training an LLM completely from scratch .] This was quite interesting. In the past, all of the templates I've seen for instruction following have been designed for chatbots -- that's what we tend to use LLMs for, after all. There's a system prompt and then a format for "message from user", and another for "message from bot". In my series on fine-tuning , where I learned how to fine-tune an 8B-parameter Llama 3 base model to work as a chatbot, I used the format for Llama 2 , which is not dissimilar to the Phi3 one that's given as an example in the book. The Alpaca -style one is quite different; it is designed for more of a one-shot interaction than it is for chat: Now, Alpaca dates from early 2023, and it looks like they used that prompt following a paper " Self-Instruct: Aligning Language Models with Self-Generated Instructions ". I had to think a bit about why one would use that, and I think the core is that this was early days (all of two years ago!) and LLMs had very short context lengths and weren't very smart. Chat uses a lot of tokens! You need the system prompt, and then every conversational turn so far. With our GPT-2 model we have just 1024 tokens to play with -- and Alpaca wasn't much better, as it was built as a fine-tune of Meta's original Llama model , which (according to the model card ) had a context length of 4096 tokens. Chat is a good way to interact with a model, as the multiple conversational turns allow you to build up large amounts of context for the model to play with, meaning that (hopefully) it will be able to give good answers. But if that context doesn't fit into the context length, then it's not so good. Early chatbots, I believe, worked around this by replacing the "transcript" with a summary, but there's only so much you can fit into a 4k-token one. 1 Maybe modern ones do this too, but with GPT-5 having a 400,000-token context window it's not so important. So, in Alpaca times, people were thinking in terms of one-shot interactions with LLMs, and the pattern they chose was targeted at that, so that you could get all of the interesting information and a reply into one sequence. An interesting bit of history! (Again, two years ago is history. Cripes.) This was explained well in the book, but it's an interesting enough point that I thought it was worth going over. Last time around we had a bunch of text messages as our inputs to the model. We found the longest one, and then padded them all out to the same length with end-of-sequence tokens, which meant that we could construct batches -- naturally, every input in a batch has to be the same size. This time around we're being a bit smarter. Although every item in a given batch needs to be the same length, batches themselves can be of different lengths -- that is, if our batch size was 8, and the longest sequence in our first batch was 73 tokens long, then we would make our first batch 8 × 73 -- but then, if the longest sequence in our second batch was only 60 tokens long, then the second batch could be 8 × 60 . We only need to pad out sequences to match the longest sequence in their batch, and that saves us time when running the model. That got me thinking about inference at scale -- the kind of thing that LLM providers like OpenAI or Anthropic do. They're going to be receiving very large numbers of sequences to complete, and of course they are going to be running them through in batches. But padding tokens are kind of a waste of inference GPU cycles. They'll have a bunch of different instances of their models running on different machines to handle all of these requests, and they almost certainly have some kind of code to try to route sequences of similar length to the same instances. To take a toy example, if you had a batch size of two and received six sequences, with lengths of 2, 9, 100, 11, 3 and 120, then you'd want to route them so that one instance received the (2, 3) pair, another the (9, 11), and another (100, 120) -- that minimises the amount of padding required and saves wasted cycles. Following on from that, it looks like we could actually improve the book's code by doing something similar here, grouping similarly-sized inputs together. That would be quite complicated, though, so probably not worth it in an educational context like this. Anyway, our collator needs to handle the variable-length batches, and through various drafts we converge on one that does it, with one tweak. This was a really important and interesting bit. Let's say that we're feeding in a 20-token input, in a batch where the longest of the other sequences is 30 tokens long. That means that we have ten padding tokens at the end. Let's represent that input sequence like this: The numbers are our token IDs, and I've used to represent the end-of-sequence token that we use for padding. Now, we need our target sequence to predict for that. The first version that we come up with in the book looks like this: So, we've just done the normal trick of shifting left by one character, and we've added an extra end-of-sequence token at the end to make the lengths match. But as the next step, we replace all of the padding tokens, apart from the one right at the end of the "real" part of the sequence, with an invalid token ID, . Using to represent that, we have: The core thing to remember here is that we honestly don't care what the model generates after it's done the real, unpadded sequence, plus an end-of-sequence token. It could generate random junk and it wouldn't matter, because it's already done the important part of predicting next tokens for all of the input sequence. The is a magic number that PyTorch's cross_entropy function uses in target sequences to say "ignore this position". I must admit that as a software engineer, it gives me a bit of an "ick" -- magic numbers are never nice -- but it does make sense. Negative numbers are invalid targets when you're comparing predictions across tokens -- which have indexes from zero up. In general, if you're predicting categories -- which essentially we are with tokens -- then the "minus one'th" token doesn't make sense. You could use any other negative number, but -1 might cause confusion (being used heavily in ML code to get the last element of a sequence) and if you're going to use any other negative number it might as well be . "Purer" solutions would be hard, anyway. We're working with a PyTorch tensor here, so it has to be a number -- which rules out using something like or some kind of special object. You could keep an "ignore after this index" number, but you'd need as many of them as you have items in the batch and it would be just another thing to keep track of. You could even keep a tensor of boolean "ignore these tokens" of the same size as your batch -- a mask -- but that would have the same problem of being something to pass around in your code. As I understand it, those last two solutions are actually used in some systems -- imagine that your outputs were not logits to create a probability distribution across categories or tokens, but were meaningful numbers in and of themselves. Pretty much any number you picked might be a valid output from the model. You wouldn't be using cross entropy loss in those cases anyway, of course, but you'd need to keep some record of where the padding starts so that you can ignore it. One final thing that is worth noting that we only add the s on to the targets. This makes sense, as all of the inputs will be fed into the LLM, so things that aren't valid tokens are going to make the embedding layer very unhappy. That also explains why we firstly add them on to the sequence as regular padding and then convert them to -100 for the targets: it allows us to add on the padding, then get the input sequence as all but the last token, then get the targets as tokens 1 up to the end. After that's done we run the code to replace all but the first end-of-sequence padding tokens with -100 on the targets. As with the last chapter, I got different results to the ones in the book; something different about the order of execution in my version of the code when compared to Raschka's meant that despite all of the careful use of , the numbers didn't quite match up. But, again as before, they were close and the trends in -- for example -- loss were the same, so the right things were happening. When I finally ran the train on my RTX 3090, it took 48 seconds; I watched it in and saw that it was using 9GiB VRAM. Due to the usual differences in randomness, I got slightly different results to the book -- but similar enough to not be any cause for concern: Also, due to a typo, I accidentally ran it with five epochs -- that took two minutes. I noticed that validation loss started rising fairly steadily after epoch 2, with train loss dropping -- clearly overfitting. Presumably Raschka chose two epochs for exactly that reason :-) A couple of things that I noticed while working through the code; when I first ran the download script, I got . That's because of a typo in the import at the start -- instead of ...it should be: The other thing that tripped me up was the original . We add on a padding token, then pad out the sequence with more padding tokens, then remove the last one. I found that confusing -- why not just add on the required number in the first place rather than adding on an extra one and then deleting it? It became clear later on; it's to make it mirror the next function, which adds on an extra end-of-sequence token for our targets, but having this anticipatory code in there with no explanation in the first draft made me start doubting my sanity for a little while... Minor points, though. So, that was it for the first half of chapter 7 in the book. The next bit looks like fun -- we're going to use a smart model to evaluate our relatively dumb one on how well it follows instructions. Definitely looking forward to that :-) Here's a link to the next post in this series . I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩ I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 24 -- the transcript hack

Chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot. Back when I first started looking into LLMs , I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models. The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM: ...you would instead prepare it with an introductory paragraph, like this: Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could. How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model? I wrote a simple script that just allowed me to test the transcript above against different GPT-2 models, using the infrastructure that I'd already written for the code in the book. It uses the function -- that is, the basic one using greedy decoding (always pick the most likely next token) and just gets the next 23 tokens. Here's what I got with the different GPT-2 models: It looks like it has the concept of a transcript, at least -- but not very useful. Better in a way -- it's got some actual text in there. Hmm. Still not looking good, and that dodgy Unicode isn't great (unless the bot was trying to say ;-) OK, still not getting there. So it looks like the GPT-2 model can't do this. That actually makes a lot of sense -- at the time I wrote my post using this transcript hack, I found that the , and versions could not either, and nor could earlier versions of . These were all versions of GPT-3, so one generation later than the version we're working with here. The first version that "got it" was , which was GPT-3.5. Looking back at my blog post way back when, I spotted something about the model. At the time, the docs on the OpenAI site said that it can: do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models Did it perhaps have instruction-following built in -- that is, had it been instruction fine-tuned already? That doesn't quite make sense, though, because instruction fine-tuning is normally done using a particular format. Imagine that a model was fine-tuned on chats that looked like this Guanaco -format dataset: ...and then it was fed something more like this (which is from a Reddit post about the Llama 2 prompt format ): That is, at least intuitively, going to cause problems. What happens if we take a look at a more modern model, but make sure it's a base one? I've been playing with , so I wrote a script to use it. Here's what I got: That's actually not half-bad! The first response is OK apart from the "a room or a room". Its prediction of what the user would say as their second question isn't great, though. Let's see what happens if we prompt it with a more reasonable second round -- I modified the script to complete this sequence: The full result (including the input sequence) was: This is looking almost solid. And there's no mention of instruction training on the model card for . I think that my initial intuition was right -- a sufficiently advanced base model can operate as a chatbot without instruction fine-tuning. However, I think I was actually wrong about it when I originally had the intuition! The OpenAI GPT-3.5 model that I believed was a base one, , had already had some kind of instruction-following tuning after it was initially trained. It was really impressive that it was able to generalise from that to the somewhat ad-hoc chat transcript format that I was using at the time, but it was not a base model. By contrast, a modern 600M-parameter model -- smaller than GPT-2 large (at 774M) and less than half the size of GPT-2 XL (at 1.5B) -- can actually work with a transcript without difficulty. My guess is that as well as the architectural improvements that have happened over the years since GPT-2, it's the size and quality of the dataset it was trained on. GPT-2 was trained on "8 million documents for a total of 40GB of text" according to the paper . It's not entirely clear how many tokens that is, but I've seen claims of 10B tokens (for example here ), and that seems in line with 40GB, as that would be 40 billion bytes, and 4 bytes/token seems reasonable. GPT-3, according to Wikipedia , was trained on around 500B tokens. The Qwen3 series, by contrast, according to the model card , was trained on 36 trillion tokens across 119 languages. That's 72 times as much data -- and the data was probably much more curated, too. That's a big difference! Perhaps it had seen lots of transcripts in there, and that was why it was able to mimic them? Or just lots of different kinds of text in general? I guess it's no surprise that training runs are getting ever-more expensive if that's the size of a frontier model run, though. So, that's all for the transcript trick -- base models actually can work as chatbots without instruction fine-tuning, if they're sufficiently advanced and trained on enough data. That's useful to know! Time to go back to the book; coming next, my notes on the actual fine-tuning I was meant to be doing rather than messing around with this :-) Here's a link to the next post in this series .

0 views
Giles's blog 4 months ago

A classifier using Qwen3

I wanted to build on what I'd learned in chapter 6 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". That chapter takes the LLM that we've built, and then turns it into a spam/ham classifier. I wanted to see how easy it would be to take another LLM -- say, one from Hugging Face -- and do the same "decapitation" trick on it: removing the output head and replacing it with a small linear layer that outputs class logits Turns out it was really easy! I used , and you can see the code here . The only real difference between our normal PyTorch LLMs and one based on Hugging Face is that the return value when you call your model is a object with more to it than just the output from the model itself. But it has a field on it to get the raw output, and with that update, the code works largely unchanged. The only other change I needed to make was to change the padding token from the fixed 50256 that the code from the book uses to . ChatGPT wrote a nice, detailed README for it, so hopefully it's a useful standalone artifact.

0 views
Giles's blog 4 months ago

Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch

I recently posted about Andrej Karpathy's classic 2015 essay, " The Unreasonable Effectiveness of Recurrent Neural Networks ". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about. This post is a bit more hands-on. To understand how these RNNs really work, it's best to write some actual code, so I've implemented a version of Karpathy's original code using PyTorch's built-in class -- here's the repo . I've tried to stay as close as possible to the original, but I believe it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising, given that he wrote it using Torch, the Lua-based predecessor to PyTorch.) In this post, I'll walk through how it works, as of commit . In follow-up posts, I'll dig in further, actually implementing my own RNNs rather than relying on PyTorch's. If you already have a basic understanding of what RNNs are and roughly how they work, you should be fine with this post. However, if you're coming directly from normal "vanilla" neural nets, or even Transformers-based LLMs (like the one I'm working through in my LLM from scratch series), then it's definitely worth reading through the last post , where I give a crash course in the important stuff. So with that said, let's get into the weirdest bit from a "normal" LLM perspective: the dataset. Every now and then on X/Twitter you'll see wry comments from practitioners along the lines of "AI is 5% writing cool models and 95% wrangling data". My limited experience bears this out, and for RNNs it's particularly weird, because the format of the data that you feed in is very different to what you might be used to for LLMs. With a transformers-based LLM, you have a fixed context length -- for the GPT-2 style ones I've posted about in the past, for example, you have a fixed set of position embeddings. More recent position encoding mechanisms exist that aren't quite so constraining, but even then, for a given training run you're going to be thinking in terms of a specific context length -- let's call it n -- that you want to train for. So: you split up your training data into independent chunks, each one n long. Then you designate some subset of those your validation set (and perhaps another bunch your test set), and train on them -- probably in a completely random order. You'll be training with batches of course; each batch would likely be a completely random set of chunks. To get to the core of how different RNNs are, it helps to start with an idealised model of how you might train one. Remember, an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. Let's imagine that you wanted to train an RNN on the complete works of Shakespeare. A super-simple -- if impractical -- way to do that would be to feed it in, character by character. Each time you'd work out your cross-entropy loss . Once you'd run it all through, you'd use those accumulated per-character losses to work out an overall loss (probably just by averaging them). You would run a backward pass using that loss, and use that to adjust the parameters. If you're feeling all at sea with that backpropagation over multiple steps of a single neural network with hidden state, check out the " Training RNNs " section of the last post. You can see that in this model, we don't have any kind of chunked data. The whole thing is just run through as a single sequence. But there are three problems: Let's address those -- firstly, those vanishing or exploding gradients. In the last post I touched on truncated backpropagation through time (TBPTT). The idea is that instead of backpropagating through every step we took while going through our batched input sequences, we run a number of them through, then backpropagate, and then continue. Importantly, we keep the hidden state going through the whole sequence -- but we detach it from the compute graph after each of these steps, which essentially means that we start accumulating gradients afresh, as if it was a new sequence, but because it started from a non-zero initial hidden state, we're still getting some training value from the stuff we've already been through. 2 Imagine we have this simple sequence: Let's say we're doing TBPTT of length 3: we can split up our training set so that it looks like this: So now, we just feed in "a", then "b", then "c", then do our TBTT -- we calculate loss just over those items, update our gradients, and then detach the hidden state, but keep its raw, un-gradient-ed value. Then we start with that stored hidden state, and feed in "d", "e", "f". Rinse and repeat. In practice we'd probably throw away that short sequence at the end (because it would cause issues with gradient updates -- more here ), so we'd just get this: Now, let's look into batching. It's a bit harder, but with a bit of thought it's clear enough. Let's say that you want b items in your batch. You can just split your data into b separate sequences, and then "stack them up", like this with b = 2 : So for training, we'd feed our vector in as a batch, calculate loss on both of them, then , and so on. The important thing is that each batch position -- each row, in that example -- is a consistent, continuous, meaningful sequence in and of itself. Finally, for validation, you also need some real sequences. For that, you can just split up the batched subsequences, with a "vertical" slice. Let's take the rather extreme view that you want 50% of your data for validation (in reality it would be more like 10-20%, but using 50% here makes it clearer): Your training set would wind up being this: ...and the validation set this: And we're done! So that's what we wind up feeding in. And it kind of looks a bit like what we might wind up feeding in to a regular LLM training loop! It's a set of fixed-length chunks. But there's one critically important difference -- they're not in an arbitrary order, and we can't randomise anything. The sequence of inputs in, for example, batch position one, needs to be a real sequence from our original data. This has been a lot of theoretical stuff for a post that is meant to be getting down and dirty with the code. But I think it's important to get it clear before moving on to the code because when you see it, it looks pretty much like normal dataset-wrangling -- so you need to know why it's really not. Let's get into the code now. In the file , we define our dataset: The that we pass in will be our complete training corpus -- eg. the complete works of Shakespeare -- and is the limit we're going to apply to our truncated backpropagation through time -- that is, three in the example above. Karpathy's blog post mentions using 100, though he says that limiting it to 50 doesn't have any major impact. Next, we make sure that we have at least enough data to do one of those TBPTTs, plus one extra byte at the end (remember, we need our targets for the predictions -- the Ys are the Xs shifted left with an extra byte at the end). ...and we stash away the data, trimmed so that we have an exact number of these sequences, plus one extra byte for our shifted-left targets. Now we create a tokeniser. 3 This is related to something I mentioned in the last post. Karpathy's post talks about character-based RNNs, but the code works with bytes. The RNNs receive as their input a one-hot vector. Now, if we just used the bytes naively, that would mean we'd need 256 inputs (and accept 256 outputs) to handle that representation. That's quite a lot of inputs, and the network would have to learn quite a lot about them -- which would be wasteful, because real human-language text, at least in European languages, will rarely use most of them. His solution is to convert each byte into an ID; there are exactly as many possible IDs as there are different bytes in the training corpus, and they're assigned an ID based on their position in their natural sort order -- that is, if our corpus was just the bytes , and , then we'd have this mapping 4 : We just run the full dataset through to get the set of unique bytes, then sort it -- that gives us a Python list in the right order so that we can just do lookups into it to map from an ID to the actual byte. The class is defined in and is too simple to be worth digging into; it just defines quick and easy ways to get the vocab size (the number of IDs we have), and to encode sequences of bytes into PyTorch tensors of byte IDs and to decode them in the other direction. Because these byte IDs are so similar to the token IDs that we use in LLMs, I've adopted the name "tokens" for them just because it's familiar (I don't know if this is standard). So, at this point, we have our data and our tokenizer; we finish up by stashing away an encoded version of the data ready to go: Next we define a method to say how long our dataset is -- this is calculated in terms of how many TBPTT sequences it has: -- and a method: This works out the start and the end of the th subsequence of length in the data. It then returns four things: The code as it stands doesn't actually use the last two, the raw bytes -- but they did prove useful when debugging, and I've left them in just in case they're useful in the future. If you look back at the more theoretical examples above, what this Dataset is doing is essentially the first bit: the splitting into BPTT-length subsequences and dropping any short ones from the end -- the bit where we go from The only extra thing is that it also works out our target sequences, which will be a transformation like this: So that's our . Next we have a simple function to read in data; like the original code I just assume that input data is in some file called in a directory somewhere: Now we have the next step, the function : This looks a little more complicated than it actually is, because it's building up a list of tuples, each one of which is a set of , , and . If we imagine that it only did the , it would look like this: So, what it's doing is working out how many batches of size there are in the sequence. With our toy sequence ...and a batch size of two, there are . In this case, it would then loop from zero to 3 inclusive. Inside that loop it would create a list, then loop from zero to 1 inclusive. The first time round that loop it would get the item at , which is 0 + 0 * 4 = 0, so the subsequence . It would add that to the list. Then it would go round the inner loop again, and get the item at the new . is now 1, so that would be 0 + 1 * 4 = 4, so it would get the subsequence at index 4, which is , and add that to the list. We'd now have finished our first run through the inner loop, and we'd have the list [ , ], so we stack them up into a 2-D tensor: Hopefully it's now fairly clear that in our next pass around the outer loop, we'll pull out the items at index 1 and index 5 to get our next batch, and , and so on, so that at the end we have done the full calculation to get this: ...as a list of 2 × 3 PyTorch tensors. And equally hopefully, it's clear that the code in is just doing that, but not only for the but also for the , and . One thing to note before moving on is what happens if the number of items doesn't divide evenly into batches -- this code: ...means that we'll drop them. So, for example, if we wanted a batch size of three with our toy sequence ...then we'd get this: ...and the and would be dropped. And that's it for the dataset code! You might be wondering where the split to get the validation set comes -- that's actually later on, in the training code that actually uses this stuff. So let's move on to that! This is, logically enough, in the file train_rnn.py . There's quite a lot of code in there, but much of it is stuff I put in for quality-of-life (QoL) while using this. It's useful -- but I'll skip it for now and come back to it later. Initially, I want to focus on the core. We'll start with the function at the bottom. It starts like this: The -related stuff is QoL, so we'll come back to it later. All we need to know right now is that it's a way of getting information into the system about where its input data is, plus some other stuff -- in particular our TBPTT sequence length and our . So it uses that to read in some training data, then initialises one of our s with it and the , then uses to split it into batches. Next we have this: So our gives us a validation data percentage; we do some sanity checks and then just slice off an appropriate amount from the end of the we got to split the data into train and validation sets. That's the equivalent of the transform from the example earlier from To this training set: ...and this validation set: Now, we create our model: We're using a new class, which is an extension of the PyTorch built-in class -- we'll come back to that later. It's also getting parameters (things like the size of the hidden state and the number of layers) from the . Finally, we do the training in a function: So let's look at that now. It starts like this: That's fairly standard boilerplate to use CUDA if we have it, and to put the model onto whatever device we wind up using. Next: The class name for the optimiser is another one of those things from the , as are the learning rate and weight decay hyperparameters. So we just create an instance of it, and give it the model's parameters to work with along with those. Next, we get our patience: This is a QoL thing, but I think it's worth going into what it actually means. When we're training, we normally train for a fixed number of epochs. However, sometimes we might find that our model was overfitting -- say, at epoch 50 out of 100 we might see that the training loss was still decreasing, but our validation loss started rising. Any further training past that point might be pointless -- if we're doing things properly, we're saving checkpoints of the model periodically, so we'd be able to resurrect the model that we had at the point where validation loss was lowest, but we're still wasting time continuing training. A common solution to that is to have early stopping in the training loop. If the validation loss starts rising then we bail out early, and don't do the full number of epochs that we originally planned to do. Naively, we might keep track of the validation loss from the last epoch, and then if the current epoch has a higher loss, then we bail out. However, sometimes you find that validation loss rises a bit, but then starts going down again -- it's kind of like a meta version of finding a local minimum in the loss function itself. The solution to that is to use patience -- a measure of how many epochs of rising validation loss you're willing to put up with before you do your early exit. That's the number we're getting from our here -- it's a positive number (note the paranoid ), and if it's not defined we just assume that we have infinite patience. The next two lines are related to patience too -- before we go into our main training loop, we define the two variables we need to control early exit with patience: Pretty obviously, those are the best validation loss that we've seen so far, and the number of the epoch where we saw it. Right, finally we get to some training code! We have our epoch loop: We're using the rather nice module to get progress bars showing how far we are through the train (ignoring any early exits due to running out of patience, of course). We start the epoch by generating some random text from the model. This gives us a reasonably easy-to-understand indication of progress as we go. Next we put our model into training mode: ...set an initial empty hidden state: You might be wondering why the hidden state is getting a variable of its own, given that it's meant to be hidden -- it's right there in the name! Don't worry, we'll come to that. Next we initialise some variables we'll use to keep track of loss -- the total loss across all of the batches we've pushed through, plus the total number of tokens. The metric we track for each epoch is the loss per token, so we use those to work out an average. Now it's time to start the inner training loop over our batches: We're just unpacking those tuples that were created by into our and (I think I was being ultra-cautious about things here when I added to the start of ). And again we're using to have a sub-progress bar for this epoch. Next, we move our Xs and Ys to the device we have the model sitting on: And then run it through the model. The code to do this looks like this: ...and I think it's worth breaking down a bit. You can see that there's a branch at the top, if there's a hidden state then we need to pass it in and if there isn't, we don't. But let's focus on the no-hidden state option in the branch first, because there's something surprising there: Remember the description of an RNN from above: an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. We can easily extend that to handle batches -- you'd give the RNN a batch of inputs (let's say a tensor b × 1 , and get a batch of results, also b × 1 . You'd also need the RNN to hold b hidden states, but that's not a big jump. But what we're doing in that code is something different -- we're feeding in a whole series of inputs -- that is, is of size b × n , where n is our desired TBPTT sequence length. What's worse, in our description above, the hidden state was just that -- something hidden in the model. Now it's being returned by the RNN! What's going on? Let's start off with that hidden state. We often need to do stuff with the hidden state from outside the RNN -- indeed, we're detaching it as an important part of our TBPTT. So the PyTorch RNN actually does work rather like the simplified model that I described in my last post , and treats the hidden state like an output, like in this pseudocode: That is, the hidden state is an input and a return value, like this: OK, so the hidden state thing makes sense. How about the fact that we're feeding in a whole set of inputs? This is actually just due to a quality of life thing provided by PyTorch's various RNN classes. Wanting to feed in a sequence is, of course, a super-common thing to want to do with an RNN. So instead of having to do something like the pseudocode above, it's baked in. When you run ...then because is b × n , it just runs the RNN n times, accumulating the outputs, then returns the outputs as another b × n tensor, along with the final from the last run through that loop. (There is a wrinkle there that we'll come to shortly.) With that explained, hopefully that branch is clear. We don't have a hidden state right now, so we run all of the inputs across all of our batch items through the RNN in one go, and we get the outputs plus the hidden state that the RNN had at the end of processing that batch of sequences. Now let's look at the other branch, where there is a pre-existing hidden state: Hopefully the last line is clear -- we're just doing the same as we did in the branch, but we're passing the hidden state in because in this case we actually have one. The first two lines are a bit more complex. As you know, we need to detach the hidden state from PyTorch's computation graph in order to truncate our backpropagation through time. We're doing that here at the start of the loop just to make sure that each batch that we're pushing through starts with a guaranteed-detached hidden state. So that explains those calls to the methods. The fact that our hidden state is a tuple of two things that we have to detach separately is a little deeper; for now, all we need to know is that the LSTM models that we're using are a variant of RNN that has two hidden states rather than one, and so we need to handle that. I'll go into that in more depth in a future post. Once we've done that, we've completed our forward pass for the epoch. Let's move on to the backward pass. Next, we have this: Pretty standard stuff. is defined further up in the file: It's exactly the same as the function we used to calculate loss in the LLM-from-scratch posts: I wrote more about that here if you're interested in the details. Next, we do something new: This is something that is generally very useful in RNNs. They are prone to vanishing and exploding gradients, and this code is to help handle the exploding case. What it says is, if we've defined a , we use it to clip gradients when they get too big, which means that training is going be better because we're not going to have updates swinging wildly up and down. Let's say that we set to 1.0. If, at the time this code is run, the norm of the gradients -- which is a measurement of their size 5 -- is, say, 10, then they would all be scaled down to 10% of their size, making the new norm 1.0. So that keeps them in check, and stops any wild variations in gradient updates. So, in short -- it's a stabilisation technique to stop exploding gradients leading to issues with training. Next, we have our normal code to update the parameters based on these (potentially clipped) gradients: And finally, we update our count of how many inputs we've seen and our total loss so far in this epoch: That's our training loop! Once we've done that code -- run our input through the model, calculated loss, worked out our gradients, clipped them if necessary, done our update and stored away our housekeeping data, we can move on to the next batch in our sequences. When we've gone through all of the batches that we have, our training for the epoch is complete. We print out our loss per-token: ...and then it's time for our validation loop. This is so similar to the training loop that I don't think it needs a detailed explanation: The only big difference (apart from the lack of a backward pass and parameter updates) is that we're not detaching the hidden state, which makes sense -- we're in a block with the model in mode, so there is no computation graph to detach them from. Validation done, it's time for a bit of housekeeping: All we're doing here is keeping track of whether this is the best epoch in terms of validation loss. The boolean is exactly what it says it is. If we're on our first run through the loop ( is None) then we record our current val loss as , and store this epoch's number into . Otherwise, we do have an existing , and if our current val loss is lower than that one, we also stash away our current loss and epoch as the best ones. Otherwise we are clearly not in the best epoch so we update to reflect that. Once we've done that, we save a checkpoint: I'll go into the persistence stuff -- saving and loading checkpoints -- later on. Next, a QoL thing -- we generate a chart showing how training and validation loss have been going so far: Again, I'll go into that later. Finally, we do our early stopping if we need to: If the current epoch is more than epochs past the one that had the best validation loss so far, then we stop. That's the end of the outside loop over epochs for our training! If we manage to get through all of that, we print out some sample text: ...and we're done! That's our training loop. Now let's move on to the model itself. I called my model class a , and you can see the code here . It's actually not a great name, as it implies there's something specifically Andrej Karpathy-like about it as a way of doing LSTMs, while what I was trying to express is that it wraps a regular PyTorch LSTM with some extra stuff to make it work more like his original Lua Torch implementation . I tried to come up with a more descriptive name, but they all started feeling like the kinds of class names you get in "Enterprise" Java code like so I gave up and named it after Karpathy. Hopefully he'll never find out, and won't mind if he does... 6 The Lua code does four things differently to PyTorch's built-in class: Let's look at the code now: You can see that it's doing 1 to 3 of those steps above -- the one-hot, the extra dropout, and the linear layer to project back to vocab space. The only other oddity there is this kwarg: That's the wrinkle I was talking about when we went through the training loop and was discussing batches. The PyTorch LSTM by default expects the batch dimension to be the second one of the input tensors -- that is, instead of passing in a b × n tensor, it wants an n × b one. That's not what I'm used to (nor is it what the original Lua code uses, if I'm reading it correctly), but luckily it can be overridden by the logically-named option. The only step we don't do in this class is the softmaxing of the logits to convert them to probabilities. That's because PyTorch's built-in wants logits rather than probabilities, so it was easier to just call softmax on the outputs where necessary. So that's our model. Let's take a look at the code that we can use to run it and generate some text. The code for this is in . Ignoring the boilerplate that parses the command-line options, we can start here: So, we're taking the directory and run name that the QoL helpers that I'll be describing later, a specific checkpoint of a training run to use, the number of bytes that we want to generate, the temperature to use when sampling (more about temperature here ) and a "primer" text. That last one is because in order to get something out of our RNN, we need to feed something in. I tried using a single random byte from the vocab initially (that's still the default, as we'll see shortly), and that was OK, but the bytes aren't equally represented in the training data (eg. "z" is less common than "e", but weird bytes that only occur in occasional multibyte unicode characters are rarer still) -- and that means that we might be trying to get our RNN to start with something it hasn't seen very much, so we get bad results. Even worse, because some of the input text is unicode, there's no guarantee that a random byte is even valid on its own -- it might be something that only makes sense after some leader bytes. So I found that in general it's best to provide a fixed string to start with -- say, "ACT" for Shakespeare, or "He said" for "War and Peace". So, with those command-line flags, we start off by using the QoL stuff to get the metadata we need about the model: ...then we use our persistence code to load up the desired checkpoint: At this point we have the version of the model that was saved for that checkpoint, and its associated tokeniser. We move this to an appropriate device -- CUDA if we have it, CPU otherwise: ...and then use a helper function to generate some text: Once we have that, we print it out, after decoding it as UTF-8: If a primer was provided, we print it first, but if the primer was a random byte we don't. Also, because the generated bytes might include invalid Unicode, we just replace those with "?" when we decode (that kwarg). Let's look at the helper next. So, after a little bit of paranoia about our desired sequence length, we make sure we're not tracking gradients and put the model into eval mode (to disable dropout). Next, we work out our primer bytes -- either by picking a random one, or by decoding the string that we were provided into its constituent UTF-8 bytes: The primer needs to be converted to the byte token IDs that our tokeniser uses: The is something you might remember from the LLM posts -- we need to run a batch through our RNN, and the is just a tensor of n bytes. adds on an extra dimension so that it's 1 × n , as we want. Next, we put the primer onto the same device as the model: As an aside, I think I might start using code like that more often, I often find myself passing variables around and TBH it seems much more natural to just ask the model what device it's using. Next, we run it through the model: Now we use a helper function to sample from those logits to get our first generated byte: Note that we are explicitly taking the last item from . It is a b × n × v tensor, where b is our batch size (always one in this script), n is the length of the primer that we fed in, and v is our vocab size. The just extracts the last item along the n dimension so that we have the b × v logits that came out of the RNN for the last character of the primer, which is what we want. We'll get to the function later, but it returns a b × 1 tensor, so now, we just extract the byte ID from it and put it into a new list: Next comes our autoregressive loop -- we've already generated one byte, so we loop times to get the rest, each time running the model on the last byte we got, sampling from the distribution implied by the logits, and adding it onto our list: Once that's done, we have our generated byte IDs in , so we just use the tokeniser to turn them back into bytes and return the result: Easy, right? Now let's look at . The function takes logits and the temperature: Firstly, we handle the case where temperature is zero. By convention this means greedy sampling -- we just always return the highest-probability next token, so we can use for that: If the temperature is non-zero, we divide the logits by it and run softmax over the result: ...and then we just sample from the probability distribution that we get from that: And that's it! The only things to explain now are the quality of life stuff, and the persistence functions that handle saving and loading checkpoints. Let's look at our QoL things first. When I started building this code I knew I wanted to run RNNs on multiple input texts -- Shakespeare, "War and Peace", etc. I also realised that for each of those input texts, I'd want to try different model sizes. The underlying concept I came up with was to have "experiments", which would each have a particular training text. Each experiment would have multiple "runs", which would have particular training hyperparameters -- the model size, number of epochs, and so on. I decided to represent that with a directory structure, which you can see here . One subdirectory per experiment, and if you go into the one you'll see that it has two subdirectories, for the training data and for the different training runs I tried. The directory contains a file called , which is the training data itself. That one only exists in the experiment, though, because I was concerned with copyright for the other training sets. There is a file in all data directories for all experiments, though, which explains how to get the data. The directory has more in it. Each run is for a particular set of hyperparameters, so let's look at the ones for the run. We have two files, , which looks like this: It's essentially the model-specific hyperparameters, the ones we pass in when creating our -- for example, remember this from the training code: is this JSON dict loaded into Python. There's also , which has the training data: Hopefully these are all familiar from the training code; they all go into , so they're used in code like this: So, now when we look at the start of the and scripts, and see things like this: ...it should be clear that we're loading up those JSON dicts from those files. You can see that code at the start of . It looks like this: So, some basic sanity checking that we have the directories we expect. Next: ...we create a checkpoints directory if it doesn't exist, stashing away its path, then finally we load up those two JSON files: The rest of that file handles checkpointing, so let's move on to that. Remember, in the training loop, each epoch we saved a checkpoint: ..and at the start of the code to generate some text, we load one: Let's take a look at saving first. Each checkpoint is a directory with a filename based on the timestamp when it was saved, inside the directory for the run that it relates to, so firstly we work out the full path for that: (The directories inside experiments are explicitly ignored in our file so that we don't accidentally commit them.) Now, we don't want half-saved checkpoints due to crashes or anything like that, so we initially create a directory to write to using the path that we're going to use but with at the end: Next, we write a file (the path within the checkpoint's dir is worked out by a helper function) containing some useful information about the model's progress -- it's epoch number, the training and validation loss, and the mapping that its tokeniser uses (from which we can later construct a new tokeniser): Then we dump the model's current parameters into a file using function from the Hugging Face library (getting the file's path through another helper function): Now that our checkpoint is complete, we can rename our temporary directory to the real name for the checkpoint: Next, we do some symlinks. We want a symlink in the directory called , which links to the checkpoint that had the lowest validation loss. The training loop is tracking whether any given epoch had the lowest, and you can see it passed in an parameter, so if that's true, we create the symlink, removing any pre-existing one: For completeness, we also create one that points to the most recent checkpoint -- that will always be the one we're doing right now, so: And that's it for saving! Loading is even simpler (and note that we can just specify "best" as the checkpoint due to that symlink -- I pretty much always do): So, we've made sure that the checkpoint directory is indeed a directory. Next, we load up the model metadata: ...and we use ' to load our parameters: Now we can construct a tokeniser based on that mapping that we put into the metadata: ...and an based on the other metadata parameters: and load the parameters into the model: That's it! We can return the model and the tokeniser for use: So that's all the code needed for checkpointing. Now let's look at the final QoL trick, one that I left out of the earlier list because it needs the checkpoints to work: charting our progress. Remember this line from the training loop, which was called after we saved our checkpoint? It generates charts like this: The chart is updated every epoch, and saved into the root of the directory. There's also a helpful file placed there that reloads that generated chart every second, so you can just load it into a browser tab while you are training and watch it live. Let's look into the code. It's in . The function starts like this: So, we use a utility function (which we'll get into in a moment) to load up the data -- training and validation loss per epoch, and the specific epoch that was the best. Once we have that, we just use (with my preferred xkcd styling) to plot the two loss lines: We also plot a single vertical red line at the best epoch so that we can see if we're past that and running into the patience period: Then a bit more pyplot boilerplate... ...and we've got our chart, saved as . Finally, we just copy that useful auto-reloading into the same directory as the chart: ...and we're done. So, how do we get the data? Originally I was keeping lists of loss values over time, but eventually realised that the data was already there in the checkpoint metadata files. So, the helper function just iterates over the checkpoints, skipping the symlinks, creating lists of (epoch number, loss) tuples for both training and validation loss using the numbers in those metadata files, and for the symlink just storing its epoch number: Those loss lists will just be in whatever random order returned them in, so we sort them by epoch number: ...and we have something we can return to the charting code: That brings us to the end of the charting code -- and, indeed, to the end of all of the code in this repo! So let's wrap up. That was quite a long writeup, but I think it was worthwhile. Indeed, if you look at the commit history, you'll see that there were one or two things where while explaining the code I realised that it was doing things badly -- not so badly that it didn't work, or gave bad results, but doing things in a way that offended my sense of what's right as an engineer. Hopefully it was interesting, and has set things up well for the next step, where I'll use the same framework, but plug in my own RNN implementation so that we can see how it compares. Stay tuned :-) Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩ Vanishing/exploding gradients. Let's say that we're training a three-layer network on the 5,617,124 characters of the Project Gutenberg "Complete Works of Shakespeare" . That's essentially backpropagation through a 16-million layer network. You won't get far through that before your gradients vanish to zero or explode to infinity. The only meaningful parameter updates will be for the last something-or-other layers. Batching. Running multiple inputs through a model in parallel has two benefits: it's faster and more efficient, and it means that your gradient updates are informed by multiple inputs at the same time, which will make them more stable. 1 Validation . There's nothing in there as a validation set, so we will have no way of checking whether our model is really learning, or just memorising the training set. (There's the same problem with the test set, but for this writeup I'll ignore that, as the solution is the same too.) : the byte IDs of the bytes in that sequence -- these are the ones we'll run through the model, our Xs. Note that these are slices of the PyTorch tensors that were returned by the tokeniser, so they're tensors themselves. : the shifted-left-by-one-plus-an-extra-byte target sequence as byte IDs -- the Ys for those Xs. These are likewise tensors. : the raw bytes for the . :the raw bytes for the . It accepts the inputs as "token IDs", and maps them to a one-hot vector itself. It applies dropout after the last layer of the LSTM (rather than just internally between the layers). It expands the output vector back out to the vocab size with a linear layer after the LSTM so that we have logits across our vocab space. This is because an LSTM's output has the same dimensionality as the hidden state. It runs those logits through softmax so that it returns probabilities. Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 23 -- fine-tuning for classification

In chapter 5 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", we finally trained our LLM (having learned essential aspects like cross entropy loss and perplexity along the way). This is amazing -- we've gone from essentially zero to a full pretrained model. But pretrained models aren't all that useful in and of themselves -- we normally do further training to specialise them on a particular task, like being a chatbot. Chapter 6 explains a -- to me -- slightly surprising thing that we can do with this kind of fine-tuning. We take our LLM and convert it into a classifier that assesses whether or not a given piece of text is spam. That's simple enough that I can cover everything in one post -- so here it is :-) The core idea in the chapter is the trick that we use to change an LLM that is designed to predict the next token into one that classifies texts into different categories. What we do is remove its final layer -- the one that maps from 768-dimensional embedding space to vocab space so that we get our logits to turn into a probability distribution -- and replace it with a much simpler one, which maps from the 768-dimensional embedding space to a 2-dimensional one, where logits -- after softmax -- represent the respective probabilities of spam vs ham (non-spam). As Raschka says, it's as if we're constraining our model to having an output vocabulary of two tokens. We're replacing our output head with a classification head instead. Because it involves removing the output head of the existing model, I'm calling it the decapitation technique. (ChatGPT tells me that it has a more prosaic name -- it's a linear probe .) This was something I recognised from Jeremy Howard's fast.ai course -- it looks like it might no longer be part of it (the trick in question has been absorbed into the library that the course uses), but I do remember doing an image classifier by removing the output head from an existing one and then training a new, simpler one to replace it. With an LLM, there's an extra tweak; we only consider the logits provided for the last token. That makes sense; with the original head, the last token's embedding, projected into vocab space, is the predicted next token for the sequence as a whole, and it is the only one with information from all of the other tokens blended into it by the attention layers. Being the richest representation of the sequence, it's obviously the best one to check when we're trying to classify that sequence. But that leads to another interesting thing about the training -- when we're calculating our loss, we only consider the cross-entropy loss between the logits vector for the last token and the target category. That makes sense simply because we don't have spam vs ham predictions for the shorter prefix sequences -- but also, we honestly don't care what its predictions for them might be. Which leads to the interesting possibility that they might wind up not being spam vs ham predictions at all -- the model has more "freedom" in how it uses them, so they could be anything. There is a lot of data-wrangling going on in this chapter -- all of which is relatively simple, but one thing that stood out for me was that we make sure that we have the same amount of spam and ham in our training, validation and test sets. Intuitively this makes sense. If we had a training set that was (say) 95% ham and only 5% spam, then a model that was essentially this: ...would be right 95% of the time, which is around the accuracy of the trained model that we wind up with. But it would also be pretty useless. Making sure that we have similar amounts of both is a good way to avoid the model getting trained in a dumb direction like that. I also spotted something a bit odd in the code that loads the different datasets in. Here it is: Look at those parameters -- they're for the training set, but for the others. From the docs for PyTorch's drop_last (bool, optional ) -- set to to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If and the size of dataset is not divisible by the batch size, then the last batch will be smaller. So: if the number of training samples isn't divisible by the batch size, and we had set to for training data, then our last batch would have fewer items in it. I think it makes intuitive sense that you'd want all of the batches going through during your training to be the same size. Let's imagine that we have ten items in most of our batches; loosely speaking, that means that on each gradient update, each one of those items will contribute 10% of the update. But if the last batch has only four items, then those four will each contribute 25%. So the items in the last, smaller batch will have an outsized impact on the model's training, which would obviously be a bad thing. For our validation and test sets, it's pretty clear that it doesn't matter -- all we want to do for them is work out how good the model is at classifying them, and they don't have a direct impact on the model's parameter updates. Before we start training, we see whether our model -- with its original projection-to-vocab head in place -- can classify text as spam or not. We feed it this: ...and it responds with this: So, no luck -- not a big surprise. Now, quite some time ago when I started playing with LLMs, I started a series trying to build an AI chatbot using the OpenAI APIs, which -- at the time -- were just text-completion based, rather than the chat-template ones that they are now. I found that even with some of the older models, you could get it to work like a chatbot by telling it what this text was meant to look like. So I tried that: Unfortunately it didn't help much -- here's what it came up with: Ah well, worth a try! I might give it another go once we've instruction-trained it. So, we know we need to train our LLM -- the next bit of the book explains how to do it. Just like last time , even though Raschka uses in lots of places to make the results reproducible, I got different numbers to his. Interestingly, in this chapter I got pretty much the same numbers for the first few steps, but after the accuracy functions were introduced, my results started differing. Again, I don't think this matters much so long as the numbers are similar. (With the caveat of some slightly disappointing results I got at the end.) I was interested in the trick of freezing the gradients on almost all of the LLM's layers -- only allowing the last one, and the final layer-norm to be trained, plus the new output head. It kind of reminded me of LoRA, which reduces the amount of work you need to do to fine-tune a model by limiting the number of weights you change -- though the way that works is quite different (you introduce new, smaller weight matrices that "adapt" the results of the existing frozen ones, and then train those). Raschka suggests an exercise where you try training all of the layers -- that is, with none of them frozen. I did that, and the results were really interesting -- more on that later... One thing that did surprise me a bit was that in this chapter we're training with dropout of zero. It seems strange, but I think it's because we're freezing those layers. The number of parameters that are actually being trained is small, so dropping them out might just throw away signal from the training, and make it converge more slowly. What's worse is that there's nothing stopping dropout from happening in those frozen layers, too -- we've set to , so they're not being updated by the training loop, but the model is still in training mode, so dropout would happen in them if it wasn't set to zero. Anyway, the code to actually run the training is pretty simple and I won't repeat what is already explained perfectly well in the book. My train (with whatever the differences with the seed it had from Raschka's) came out with broadly similar results: It took 15 seconds on my RTX 3090 -- within the "less than half a minute" range he mentions for the V100 and A100 datacenter cards. Plotting the loss: ...you can see that it looks very similar to what happens in the book. The accuracy was kind of interesting, though: You can see that mine actually dropped off on the validation set near the start -- however, it did recover nicely, so no big deal. It looks like the best results were in somewhere around epoch 4, but there's not a huge drop off from there to where we stop at the end of epoch 5. 1 The final results were a little disappointing, though. In the last pages of the chapter, we run two sentences through our trained classifier. Firstly, something that is very spammy: And secondly something that isn't spammy: Unfortunately, due to the differences in seeding between my model and the one Raschka was working with, I got the same result for both: . After checking the code carefully, I decided to take a look at what the actual predictions looked like; in the function that we write to do these tests, I added this: For the ham case, this printed: The first element (index 0) is the probability that it is predicting for ham, and the second is the probability for spam. So it was 97% sure that the ham message was indeed ham. That's good! For the spam message, though, it was pretty much on the fence: It thought that there was a 59.87% chance that it was ham, but a 40.13% chance that it was spam. It would be interesting to know what Raschka's own train came up with, but I can imagine that it might be a similarly close-run thing, and I was just unlucky with the way the randomness in training fell out. But I decided to try training all of the layers to see if I got any different results. This took 42 seconds rather than 15 -- hardly a big deal at this scale, but you can see how not having to train everything and getting a 3x speedup would be worthwhile for larger models. The results were definitely better, though: Maybe a little bit of overfitting going on? But the important thing is that the validation and test samples' accuracy are higher too, we're not just memorising the training set. The loss graph also looks solid: And accuracy is really interesting: You can see that validation accuracy got up to 100% sometime during the first epoch, lining up with the point where the loss plateaus! Perhaps a snapshot taken then would have been the best form of the model. But anyway, I tried running the two samples from the last pages of the chapter through this new version of the model -- the results were exactly right! "You are a winner..." was classified as spam, and "Hey, just wanted to check..." as ham. Looking at the predictions that my modification to the classification code printed out, things were even clearer. The spam had this: A 97% chance that it was spam, which is excellent. Looking at the ham message, it said this: That's a 99.9% chance that it's ham! So that was a nice note to end on. Not only did I have results that showed that the classification worked -- the fact that it worked with just the 30 seconds of extra training that enabling gradients on all of the layers gave meant that it was unlikely that my original slightly-disappointing results were due to an error in the code -- it was, as I'd suspected, just a bit of bad luck with some randomness. So, that's it for chapter 6! Classification with a decapitated LLM done and dusted. Next time, it's on to instruction fine-tuning -- something that I spent quite a lot of time on last year , so let's see if all that work will pay off. Here's a link to the next post in this series . Update : I decided to see how easy it would be to do the same decapitation trick on a different LLM, one that I'd downloaded from Hugging Face. Turns out the answer is " pretty easy ". I've been running a number of separate experiments -- will be writing them up shortly -- in which I store a checkpoint after each validation check, and keep a symlink updated to point to the one with the best validation loss. This feels like a good practice to me.  ↩ I've been running a number of separate experiments -- will be writing them up shortly -- in which I store a checkpoint after each validation check, and keep a symlink updated to point to the one with the best validation loss. This feels like a good practice to me.  ↩

0 views
Giles's blog 4 months ago

Writing an LLM from scratch, part 22 -- finally training our LLM!

This post wraps up my notes on chapter 5 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Understanding cross entropy loss and perplexity were the hard bits for me in this chapter -- the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens. The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there's really not all that much to say -- but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show -- and suddenly, we have a codebase that we can let rip on a training set -- and our model starts talking to us! I trained my model on the sample dataset that we use in the book, the 20,000 characters of "The Verdict" by Edith Wharton, and then ran it to predict next tokens after "Every effort moves you". I got: Not bad for a model trained on such a small amount of data (in just over ten seconds). The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this: That's amazingly cool. Coherent enough that you could believe it's part of the instructions for a game. Now, I won't go through the remainder of the chapter in detail -- as I said, it's essentially just plugging together the various bits that we've gone through so far, even though the results are brilliant. In this post I'm just going to make a few brief notes on the things that I found interesting. One thing I really do recommend to anyone working through the book is that you type in all of the code, and run it yourself -- it really will help you remember how stuff fits together. There is one slight issue I found with that, however: the book has a number of examples where you get output from code that uses randomness -- for example, where you take a look at the loss it has on some sample text before you start training, or make it generate samples during the train. Now, in theory, because Raschka puts calls before all of these, the results you get should be exactly the same as the outputs in the book. However, the amount of code we're working with at this stage is quite large -- we have various helper functions that were created in earlier sections, for example. And some of these use randomness. That means that to get the same results as the ones in the book, you would need to ensure that all of the code that uses randomness was running in exactly the same order as it was when Raschka did it for the book. That turns out to be surprisingly hard! My instinct is that it doesn't actually matter all that much. So long as the loss numbers that you see are in the same ballpark as the ones in the book, and the outputs you see are roughly equally incoherent (before training) and become more coherent at what feels like the same kind of rate, you're fine. Probably the most important one to look out for is when the training run starts -- you should see loss on the training set decreasing steadily, just like in the book, and likewise as in the book, the validation loss should plateau out pretty early. When I have built simple backpropagation through neural networks in the past, I've generally updated parameters by multiplying the gradients by a small number, the learning rate , and then subtracting them from their respective parameters to get updated ones -- classic stochastic gradient descent . Non-trivial ML uses optimisers; I'd come across them while fine-tuning LLMs , and also used one in the RNN code I wrote last week . Instead of updating the parameters yourself, you ask the optimiser to do it for you, by calling its function. AdamW appears to be the default optimiser in most textbooks, though Muon seems to be the most popular in use, if my AI X/Twitter feed is to be believed. I don't understand how optimisers work in any detail, and I'm going to have to dig into that in the future. However, my high-level simplified picture right now is that they dynamically adjust the learning rate over time, so that it's easier to take big "jumps" downwards on the gradients when you start, and then smaller ones later. I believe they can also sometimes avoid local minima in the loss landscape -- a nice metaphor I read somewhere (lost the source, sadly) was that simple gradient descent was like rolling a ball down a hill, but (some?) optimisers give the ball a bit of momentum so that it can coast over a small uphill portion, so long as the general slope is downwards. Anyway, more investigation needed later. In practice, with AdamW, you initialise it at the start of your training loop, with a learning rate (which I imagine is similar to the one my older code used, a scaling factor for gradients) and a weight decay (:shrug:). You also provide it with the parameters it's going to be managing. In the training loop, at the start of each input batch, you tell it to zero out the gradients it's managing with , run the data through your model and calculate your loss, and then after calling to get your gradients, you just call , and that does the parameter update. Again, I want to dig into how optimisers work in more detail in the future. But for now, I think that's all I need to know. The book tells you how to train on a public domain book, "The Verdict" by Edith Wharton. Full training on the hardware that people are likely to have to hand would be extremely expensive, so we just train on that short example, then later on learn how to download and use the weights that OpenAI made available for their GPT-2 models. But there was something that surprised me a little. When talking about the training run on "The Verdict", Raschka says that it takes "about 5 minutes to complete on a MacBook Air". On my machine using CUDA on an RTX 3090, it took just less than eleven seconds. This makes perfect sense, of course -- there's a really good reason why AI training is normally done on GPUs or custom hardware, and the MacBook Air would presumably be training on the CPU. But I was a little surprised at how huge the difference was in this simple example! Now, while the book mentions that Llama 2 probably cost hundreds of thousands of dollars to train, I must admit that I do wonder how much it really would cost to train a 124M parameter model on my own hardware -- or, indeed, on the machines with 8x 80GiB A100 GPUs that I rented from Lambda Labs during my fine-tuning experiments. Andrej Karpathy was able to train a 124M GPT-2 model for $20 , using his hand-written C/CUDA LLM system . That is undoubtedly more efficient than the PyTorch code that we're working on in this book. But it really would be interesting to find out whether it would be doable for me at all! The training data he used is the 10B-token version of the FineWeb collection, which is freely available. 1 I think I have a good candidate for a next project when I've finished the book; see how many tokens/second I can train on locally -- that will allow me to estimate how long it would take to train one epoch over the whole training set. I imagine that will be longer than I'm willing to leave my desktop machine tied up doing this, but then I can try mixing in the lessons I learned doing fine-tuning, and see if I can get it up and running on Lambda Labs. If the cost is in the tens of dollars, or even a hundred or so, I really think it would be worthwhile! One thing I found a little confusing in this chapter -- and this is very much a nit -- was the section on preventing "memorisation"; I think this was due to a mismatch in the meaning I attach to the word, and the way it's used here. To me, memorisation is something that the model does during training -- if you keep training a 124M-parameter model on a 20,000-character file, as we're doing here, then whatever happens the model is going to memorise it -- it's unavoidable. The only way to reduce memorisation in this sense would be to increase the amount of training data (and even then, as the findings in the lawsuit by the New York Times against OpenAI show, some stuff would be memorised). In the book, "memorisation" is being used to mean something more like what I'd call "parroting" -- issues with the model just repeating the stuff that it has memorised, because it was always choosing the most-probable next word. Avoiding this is super-important, of course! It's just the framing that confused me a little. The techniques are nifty, anyway. The first cut -- just use the softmaxed logits as a probability distribution and sample from it -- is obvious enough. Temperature is a clever trick on top of that -- just divide the logits by some number greater than one before softmax, and you can make the distribution that comes out flatter (or you can make it more "pointy" by dividing by a number less than 1). The graphs in the book showing how that works are great, but I asked Claude to knock together a temperature playground website, which I found made things even clearer to me. And finally, the top-k technique -- only consider the k most probable tokens, and then do the temperature/softmax calculations -- was a sensible addition to add on top of that. The code is clever: identify the top k logits, get the value of the lowest one of them, and then replace every logit less than that with minus infinity. When you run that through softmax, you get zeros for the ones that were replaced, and the probability distribution is based on the remainder. So: excellent stuff, and very well explained in the book -- it just didn't feel like preventing "memorisation" specifically was what it was doing, at least based on what I take the word to mean. At the end of the chapter, we download the weights for the original GPT-2 model that OpenAI produced from their site, and load them into our own model. The code to download weights is (thankfully) something that you don't need to type in, as it's downloadable from GitHub. And in one specific related case, I'll also contradict what I said earlier about typing stuff in yourself -- I definitely recommend that you copy the that copies the downloaded weights into our own model from GitHub too. I did actually type it all in and I don't think I gained anything from doing that. One thing I did notice while going through that section was that I'd been making a mistake as I wrote up this series; I'd thought that all GPT-2 models had 768 embedding dimensions. It turns out that this is only true of the 124M model in that series, and the larger ones have more. That makes a lot of sense -- and I've updated the older posts to reflect it. That's all I really have to add to what is in the rest of chapter 5. Like I said at the start, it feels almost like a let-down to be writing so little about a section of the book that has such amazing results! But now we have a working LLM, and at least the foundations that might allow us to train our own from scratch if we had the resources. Next up: using it to classify text. Will this be quick and easy? Or will it lead down another fascinating rabbit hole? Time will tell... Here's a link to the next post in this series . His new nanochat -- a from-scratch trainable chatbot -- is even cooler.  ↩ His new nanochat -- a from-scratch trainable chatbot -- is even cooler.  ↩

0 views