Latest Posts (12 found)
Giles's blog 6 days ago

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI: Here were the results I got, sorted by the loss: Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern. I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now. In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change. Let's spec out the problem first. The instruction fine-tuning test trains our model on the Alpaca dataset in order to let it know how to follow instructions; that comprises a series of sequences like this: More details in this post . In the version I've settled on , I fine-tune on a training set of 85% of the samples, epoch by epoch, bailing out when the loss on a separate validation set of 5% of the samples starts rising. I then use the weights from the previous epoch -- that is, before validation loss started rising -- to generate responses to the remaining 10% of the samples. Once that's done, the script hits the OpenAI API, using GPT-5.1, default parameters for all of the options (eg. no explicit temperature) with queries like this: We do that for every model-generated response in the test set, then take the average of the scores and use that as our result. To see why that's problematic, imagine this simple instruction with no separate input: One response I've seen from my models was this: That's obvious garbage, and should get a zero -- and GPT-5.1 consistently does that. Another response, from OpenAI's original weights for their "medium" model (larger than the ones I've been training), is this: That's correct, so it deserves 100, or perhaps 95 due to being unnecessarily wordy (the answer "Jane Austen" is the suggested response in the dataset). But now how about this one: One of my models came up with that gem during an earlier eval. It's completely wrong, so it deserves a 0, right? And normally the GPT-5.1 model does that -- but sometimes it's a little more generous, and gives it a low, but non-zero score. When asked for its reason for that, it makes the logical point that while it's the wrong answer, at least Sarah Palin is a real person. It's better than the "the book wrote itself" complete nonsense of the first response. The problem is that the different runs against the different models are not consistent, as they're all talking to GPT-5.1 separately. One model might find it in a harsh "mood", and get a lower rating than another model that found it at a more generous moment. I came to the conclusion that the best way to fix this is to do a "batch" -- that is, fine-tune each model on the Alpaca dataset that Raschka provides, and generate responses for the test set and store them in a file. Then, once we've done that for all models, we can score them all at once, prompting GPT-5.1 with something like this: The theory is that doing it that way will mean that each individual query/response pair is graded consistently between models, even if there might still be inconsistencies between query/response pairs. That hopefully means we'll get more consistent results and can compare the models better. Here's the code: Running the first against each of our models, and then the second against all of the output files, gives us this updated table (with links to the annotated JSON files in case anyone else wants to take a look): (Still sorted by loss so that you can compare it more easily with the one above.) That's really interesting! The IFT score is still not correlated with the loss. But there does appear to be a pattern. It looks like we have three groups of models: I tried running the LLM-as-a-judge scoring script a few times, just to make sure this wasn't some kind of random weirdness, but the pattern was always the same: the OpenAI weights, the cloud FineWeb 8x A100 40 GiB, and the two local Local FineWeb-Edu models always got the best IFT scores, though sometimes they swapped positions (apart from the OpenAI medium model, which was of course always at the top). The other cloud FineWeb models and the local FineWeb one were consistently scored much lower. A hypothesis: there are two things that contribute to how good a model is at these IFT tests: Or to put it another way -- some of these models are smart but not knowledgeable, while others are knowledgeable but not smart, and some are neither. I think that could explain what we're seeing here. While OpenAI never published their "WebText" dataset for GPT-2, the paper describes it as a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. Now, the FineWeb dataset is quite similar, though I think it's a tad more curated than that. But OpenAI trained their models for quite some time and did lots of tricks to get the loss as low as possible. By contrast, the FineWeb-Edu dataset is a carefully selected subset of FineWeb, with only the most "educational" data. Models trained on it, you might think, would know more facts for a given amount of training. So we can imagine the OpenAI models are smart but not knowledgeable, as we can our cloud FineWeb 8x A100 40 GiB model, which (I believe due to an accidentally-near-optimal batch size) worked out well in terms of loss. They were trained on relatively sloppy datasets but turned out reasonably well. Their intelligence makes up for some of their lack of knowledge. Our other cloud trains and the local FineWeb one are dumb and not knowledgeable; they were trained on the low-information FineWeb dataset, but they didn't wind up with a particularly amazing loss. So they get low scores. And finally, our local FineWeb-Edu models are still dumb, but they make up for it by knowing more because their training data was better. Well, it sounds plausible ;-) And I'd like to spend some time digging in to see if there's any indication if it's actually true. But after an afternoon of poking around the results, I can't really get a handle on whether it is, or indeed how you'd test that hypothesis in any real depth. TBH, I think this has zoomed so far past my "no side quests" limit that it's not even visible in the rear view mirror, so it's probably best to shelve it as a "cool idea, bro" for now. Learning about how to run sensible evals, and how to work out what they're saying, will have to be a task for another day. I will keep on doing these IFT tests for future models, though, just out of interest. So: let's get back to our regular scheduled LLM training. Next up, how do we upload our models to Hugging Face quickly and easily so that other people can play with them. A simple cross entropy loss over a fixed test set. The results for an instruction fine-tune test that's covered in the book. A script to fine-tune a model and generate test responses and to dump them into a JSON file. The LLM-as-a-judge code to send a bunch of models' responses to GPT-5.1 . It scrambles the order of the models in each query, to try to avoid any preference the model might have for the first one vs the last one, and it stores GPT-5.1's per-response scores and comments in a new "annotated" JSON file. The OpenAI weights and the cloud train on the 8x A100 40 GiB machine using FineWeb, which have low loss and high IFT scores The other cloud models and the local train that used FineWeb, which have medium loss and low IFT scores. The FineWeb-Edu local trains, which have high loss, but IFT scores that are almost as good as the first group's. The loss. Models that are better at predicting the next token are inherently better at instruction-following after the fine-tuning. The amount of information in the dataset. It doesn't matter how clever a model is, if it never saw "Jane Austen wrote 'Pride and Prejudice'" as part of its training, it will never be able to get a good score on that question.

0 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090

Having worked through the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware? The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series , I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time. But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC. Additionally, Andrej Karpathy recently announced nanochat , "the best ChatGPT that $100 can buy". He mentions on the main page that he's trained a model called , with 32 Transformer layers, which has 1.9B parameters, for about $800. His smaller 20-layer model, with 561M parameters, he says should be trainable in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the $100 total price. What's even more interesting about nanochat is that it's built with PyTorch; initially I'd got the impression that it was based on his pure C/CUDA , which I would imagine would give a huge speedup. But no -- he's using the same stack as I have been in this series! Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help. This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project. But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs. Here's the full story. For this project, I want to use the exact same model code as Raschka presented in the LLM from scratch book -- my copy here . There have been a number of architectural improvements to LLMs since GPT-2, but for now it's best to keep things simple. But there are still some settings to decide on. The config dictionary for the models we've been using has these parameters: There's also the aspect of weight-tying -- the original GPT-2 reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits . There's nothing in the code we've been working with to enforce that, though -- when we do our small train in the book, we're using independent weights for each of those steps. The only time it is "enforced" is when we download the pretrained weights from OpenAI, where we put the same values into both the embedding matrix and the final output head. Given that Raschka says that it's in general better to avoid weight-tying, and actually doing it would be harder than not doing it, then it seems a no-brainer to not do it. So, what does that mean about our model? That matches what we got when working through the book; 163M parameters. Can we train it? It seems like every AI project starts with the question "what data can we use?" The original report on GPT-2, " Language Models are Unsupervised Multitask Learners ", is frustratingly lacking in details. However, it does say that they trained it on "8 million documents for a total of 40 GB of text". Now, according to OpenAI , it's reasonable to assume roughly four characters per token for typical English text. So 40 GB of text is ~10 billion tokens. That data was essentially gathered by scraping pages linked from Reddit that had more than three upvotes there, so was reasonably high quality. Can we get something similar? Conveniently, Hugging Face host a big dataset called FineWeb , and that has a 10 billion token "sample" dataset, randomly selected from the full 18.5 trillion tokens. So the sample feels like it's order-of-magnitude right. And while reading more about Karpathy's nanochat, I spotted that it uses FineWeb-Edu , which is a version of FineWeb that contains "only the most educational web pages". I wrote a script to download both of those , and kicked it off. It took about 20 minutes for each one (slow wifi in my study, I was getting < 5MB/s); FineWeb's 10B sample took up about 29 GiB, and FineWeb-Edu's about 27 GiB. Time to take a look at them. The Hugging Face function loads up all of the files you provide, and you can tell it how to split them up into train/validation/test sets. This command just loads up the whole FineWeb one and says "treat it all as the train split", which is good enough for now: Yikes. It took 1 minute, 53 seconds to generate the train split. However, that appears to be a one-off cost -- when I accessed it again later using the same code in a different Python session, it just did the second "Loading dataset shards" portion, taking three seconds, not the generation of the split. Presumably it caches it. Anyway, let's see what's in it: Great, so we have 14,868,862 rows, each of which has various bits of information. Checking the first one's text: Well, for FineWeb, that doesn't look particularly "fine", but I guess it's better than the stuff that Karpathy talked about in his recent interview with Dwarkesh Patel : When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet Let's take a look at FineWeb-Edu. That looks a lot better! Now let's take a look at the document lengths in terms of tokens. There's a column, but I don't know which tokeniser that's for, so to be safe we'll calculate it ourselves. How long would it take to tokenise every row in FineWeb 10B to check? Let's tokenise the first 10,000 of the 14,868,862 that we have, and see how long that would take -- then we can work out the estimated time for the whole thing. 2,160 seconds or about 36 minutes. Yikes! After a bit of digging, though, I found that tokenisers can handle batches (poorly documented, but it's there in the source ): Also, we can map a function over an entire HF dataset, and that can be made to run with multiple processes. So, we can combine the two: Just over three minutes, not too bad! (The reason the command count above jumps from 47 to 53 was that in the first run I didn't have the in there -- one of the rows in the dataset had in it, and the tokenizer rejected it. I'm going to play fast and loose and ignore that for now.) Now let's see how it added it: Cool! We've added a column with the number of GPT-2 tokens for each row, and we can extract what amounts to a list of those values. Let's plot them as a histogram. Trying to do it directly -- that is, just doing ...seems to make MatPlotLib very unhappy, and my interpreter crashed with an OOM -- I think it might be trying to load all of the dataset -- text, IDs, etc -- into RAM in one go. So I started a fresh one and did the stuff to load it and annotate it with token lengths again -- weirdly, this time the mapping only took 10 seconds or so! That was strange, I'll need to look into that. Perhaps the earlier command added the column to the files on disk? To work around the memory issue, I converted the column from the dataset to an actual list: That took ten or twenty seconds. Let's then try the plot again (full code this time): That took about 11s to run, and the result is this: That's really promising! The bulk of them are less than our 1,024 token sequence length. 1 If we present each row in the dataset as a stand-alone training sample, cropping them when necessary, perhaps we won't lose too much data? Let's see. First step, how many tokens are there in total? Nice, about 10B, as expected. How many tokens would we have if we cropped them to the default GPT-2 context length of 1,024? Ouch, 7.3B. That's quite a reduction: So we're losing 29% of our tokens by that cropping. That's from curtailing just 16% of the sequences: That's not great. I feel that we have two options here: At this point in the experiment, I'm going to keep both options open. I'm inclined towards the latter (I believe it's closer to what the real GPT-2 train did), but I'm not sure. Anyway, we're scoping things out here, so let's move on. After looking at the data, I've thought a bit more about this. I'd previously been thinking in terms of training across all of the tokens in the dataset; we'd work our way through the 10B tokens, and then we'd be done. But when training a model, you do multiple epochs, normally -- you run through the dataset once, updating your gradients as you go, then run through it again likewise, and eventually you stop when your validation loss starts rising. I think that because I'd read that LLMs are normally trained on just one epoch these days, I'd kind of internalised that we only need to do one. But it wasn't the case in 2019 when GPT-2 came out. They had less data -- just 10B tokens or so, compared to insanely huge datasets like the full FineWeb (not the 10B one we've been looking at -- the 18.5T full one), so they would have trained it for some number of epochs. How many? That's another case where the GPT-2 paper is annoyingly light. This report says in the "Replicating GPT-2" section that OpenAI trained it for 800k iterations with a batch size of 512. Plugging in a sequence length of 1024, that gives us this many tokens: Over 419B tokens! Now, if we believe that their dataset was 10B tokens, then we can work out how many epochs that came to: The same report says that they -- as in, the report authors -- make that "around a total of 60 epochs through the training set" -- I believe that the training set they're talking about could well be slightly shorter than the original GPT-2 one -- the GPT-2 authors didn't release their own, which is called "WebText", so the report's author is using a different one that tries to replicate it, OpenWebText . That sounds expensive; even without knowing how many tokens per second we can train for, 40-odd epochs of 10B tokens each sounds like it would take a long time. Are there any other comparison points that might tell us how long to train for? Well, there's a "Chinchilla heuristic" that I've heard of, which says that you should train on about 20 tokens per model parameter. I spent some time reading into where that comes from; originally it's in " Training Compute-Optimal Large Language Models " from Google DeepMind, and it's an interesting paper, and is surprisingly easy to read, with a few bits of maths that get a bit hairy (but aren't required to get a good-enough feel for what they're saying). I recommend you take a look. It was written in 2022, and the authors felt that people were scaling up models a lot, but weren't increasing the number of tokens that they used for training enough. So, they trained a huge number of models, trying to answer the question: "given a particular budget in training FLOPs, what is the optimal balance of training tokens versus parameters to make sure you're using those FLOPs most efficiently?". They were arguing against the method taken in a particular paper, where another team had trained a model (called Gopher) on significantly fewer tokens than they thought optimal. The number of FLOPs used to train a model is linear with both the number of parameters and the number of tokens you train it on, so if you get 2x the number of FLOPs that you had before, you can either train the same model on twice as many tokens, or you can double its size. Which is better? Their conclusion was that you should actually scale both parameters and tokens up by the same amount -- that is, in the 2x case you'd want to have 2 times both the parameters and tokens, which would double your FLOPs and get you better performance. As you can probably see, by doing this they indirectly worked out an optimal number of tokens to train a particular size of model for. They don't state the "20x" heuristic themselves, but it's pretty clear in table 3 in the paper, where they give a number of model sizes and the optimal number of tokens for each. Now, this number is not the number of tokens you need to train for to get the best model you can for a particular number of parameters; a model of a given size can always be trained more and will (hopefully) get better. But it tells you when you've trained on enough tokens that you could get better results by training a larger model than you have right now. They're implicitly assuming that models can get as large as you want, which of course is not the case -- in reality, you're going to be targeting a particular model size, the size that can fit on your training hardware (or more likely with production models, the size that can fit on your planned inference hardware). But interestingly, looking at the README.md for Karpathy's nanochat project, he trained his 1.9B "d32" model on 38B tokens -- exactly 20x. And if you look at the script in the same repo, he explicitly says that he's training for 20x parameters for the smaller model: If Andrej Karpathy thinks that training for Chinchilla-optimality is the right way to go, then who am I to disagree? ;-) More seriously, perhaps the better quality of the dataset makes this a reasonable thing to do. From the GPT-2 paper, their description of how they got the data: ...we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. That's a clever trick, but I believe that FineWeb is much more carefully filtered and improved than the WebText dataset they got from that. Back in 2019, they had to do everything from scratch -- find appropriate ways to get data, filter it, and so on. Now we can just download stuff from Hugging Face. So maybe Chinchilla-optimal is enough. Anyway, we have 163,009,536 parameters, so on that basis, let's train for: ...tokens. (I'll just use 3.2B from now on, but that's the actual number I mean.) That's pretty cool! We have more than that number of tokens already in our FineWeb 10B sample, so we can do a single-epoch training run. So the question is -- is that even doable on my hardware? It all hinges on how many tokens per second we can train at. A good way to check this is to write a throwaway "trainer". We can use that to work out what our maximum batch size on the RTX 3090's 24 GiB of VRAM, then run a bunch of batches through -- a forward and backward pass for each -- and see how many we get. This won't estimate how much time we'll spend validating the model, of course. But my gut is telling me that we should spend no more than 5% of our training time running validations, so we can later on do a similar test, eval mode, forward pass only with no gradient tracking, and use that to work out how many tokens should be in the validation set. So, let's estimate training speed. This code gets an estimate of tokens/second at different batch sizes. Hopefully it's clear enough to not need an in-depth explanation. An outline: Here's what it prints out: So we can see that it gets faster as we increase the batch size, which makes sense because we're handling sequences in parallel, but it does flatten off a bit, which makes sense because there's a limit to how much parallelism we can do, even on a GPU. Let's see how that fits in with the different training sizes we looked at above: OK. We're definitely not going to be able to train this thing the GPT-2 way! I expected that to be the case, but now we have a solid proof of that. But the three-day Chinchilla-optimal train actually sounds doable! I'm heading to London to visit family soon, so won't be using my home PC. With a bit of help from Tailscale I'll be able to log into it from my laptop, though, so I can potentially nurse a run through. Can we make it any faster? Now, when doing the fine-tuning work, I found that you could generally speed things up by doing everything in 16-bit rather than 32-bit. Intuitively that makes sense -- lower-precision numbers, fewer bits, means less work for the GPU doing the various multiplications and additions that are involved in our train. Working with ChatGPT, I found a couple of ways to take advantage of that. Firstly, using TF32. The normal float32 format uses 8 bits for the exponent, and 23 for the mantissa. If you haven't looked into how floats are represented in memory (or if you've forgotten), that means that, using m to mean the mantissa and x the exponent, the numbers are represented in memory as TF32 is messier; it has the same exponent size -- and thus the same range -- as float32, but it essentially ignores the lower 13 bits of the mantissa. So it takes up the same amount of memory, but is lower-precision, which means that calculations can be faster. Most importantly, cards like the RTX 3090 have dedicated "tensor cores" -- as opposed to the normal CUDA cores that do normal matrix multiplications -- and they operate in TF32. Unsurprisingly, "TF32" is "tensor float 32-bit". The PyTorch allows you to tell it what precision to use for matrix multiplications; the default is , which means "use float32 all of the time", so you're stuck using just the CUDA cores. If, instead, you set it to , then it will use TF32 if the hardware supports it and it has the appropriate kernels available. So that will let us use the tensor cores. I added this to the code above just above the loop over the different batch sizes: Let it run, and: That's a 22% speedup! Of course, the precision of the training isn't as good. But given that many modern models are trained at 16-bit (I've seen suggestions that some are even trained as low as 4-bit) then that shouldn't matter. Let's see whether we can train in 16-bit instead. PyTorch has a smart mode where you can tell it "use 16-bit where it makes sense, otherwise use 32-bit" -- AMP, which stands for "Automatic Mixed Precision". There's a great recipe for how to use it in the docs , so let's use that. We need to create a object to handle scaling parameters from 16-bit to 32-bit as needed -- we can re-use that across all batch sizes so we can create it just before the loop: ...then we need to replace this core part of our training loop: ...with some code to use AMP and that scaler -- basically we use a context manager to switch it on when we're doing the forward pass and work out the loss, and then use the scaler to manage the backward pass and the optimiser's step: Running that gives us these results: Wow! With that we can train on 3.2B tokens in about 160,000 seconds, which is 44 hours. That's definitely doable. Now, what happens if we remove the ...so that we're using AMP, but not the tensor cores? It's basically the same. 300tps slower at the start, down to 70 at the end. Still, it looks better to keep the "high" precision in place, rather than the "highest". Right. We have the beginnings of a training loop that should be able to let us run a Chinchilla-optimal train on a GPT-2 small sized model in 44 hours, and I have the time to do it. And it looks like a batch size of six is what we can fit into the RTX 3090's 24 GiB of VRAM. What else are we going to need to build something to do this? If I want to do a long training run, then stuff might go wrong -- it might crash for some reason. So we're going to need to save checkpoints as we go and be able to restart training from those checkpoints. In those, we're going to need to save the model and the optimiser's state, plus some kind of info about how far through the dataset we are. We should keep training and validation losses too, so that we can easily chart and recover our progress, and according to this forum post we're going to need to save the scaler (which makes me think that it actually has state in it, so we probably should have used a fresh scaler for each batch size in the above -- let's hope that doesn't prove to be a problem [note from later: it wasn't]). I wrote a script to create a model, train it for a bit, and then dump out all of that apart from the metadata (which I reckon is going to be less than 1kB). I wanted to use the safetensors format for all of it, but unfortunately I couldn't get it to work for the optimiser or the scaler, so had to use for those (which I don't like because it uses pickle , which introduces serious problems if you ever want to move files from machine to machine, as the Python and library versions need to match perfectly). Ah well. Here's what the test checkpoint looks like: That's huge! And it's almost all the optimiser. From what I read, that stores two numbers per parameter, so it makes sense that it's double the size of the model weights. And at 32-bit, 4 bytes per param, then 670MiB for the model is sane. Timing-wise, it takes about a second to save, the same to load, so that's fine. So that sounds reasonable in terms of timing, and disk space is pretty high, but not so huge that it can't be managed with careful planning -- don't checkpoint so much that we run out of disk during the train (I have a 2TiB disk, but it's far from empty). It's probably worth double-checking that it works, though! Because my checkpoint test already did some training, I changed it so that it does this: Looks sane! The numbers for loss are the same before and after, so I think it's vanishingly implausible that the checkpoint we restored is different from the one we saved. And the continued training seems to be working -- at least, loss is going down -- so that sounds reasonable too. OK, so, again, the time taken to checkpoint is negligible, but the disk space isn't. I reckon we can comfortably do 100 checkpoints over the train. That's roughly one every half-hour over 44 hours. We're going to want to do a validation run each time we checkpoint, so let's think about that next. How big should our validation set be? Let's say we only want to spend 5m per checkpoint period doing validation. How many batches can we get through in that time? I wrote a simple script to run a model (after a few hundred training steps) in eval mode on different numbers of iterations to see how long each one took. It used the same trick as the training loop above in order to use mixed precision, and I ran it with instead of the that I've used in the past (ChatGPT tells me it's a little faster). I also put in some calls to around the loop that I was timing, which should apparently help make sure that the numbers are precise. The code is here if you'd like to take a look. After some fiddling with the min/max numbers at the top: OK, so let's call it 3200. That's 3200 * 6 * 1024 tokens = 19,660,800 tokens. That's about 0.006144 of our training set. Pretty low, but we're talking about such a large training set that I think we're OK. And practically we can't do more -- we're already talking about 5 mins every half-hour, so we're bumping up our train time by 88 * 5 = 440 minutes, which is seven hours. Now let's start thinking about the datasets. We can split the HF thing into train and validation sets. I'm thinking it might be useful to load all of our training and validation data into RAM for the train loop. 3.2B tokens with four bytes per token should be about 13 GiB, after all, and I have 64 GiB RAM on the machine. ...but wait, int64 is the default for PyTorch for long ints -- that's what our token lists are in the original, and it's twice the size, so we're talking 26 GiB. I believe that PyTorch expects that format for the cross entropy loss. That's not the end of the world, though -- we can store the data as int32 in RAM (with 50,257 as our vocab size we could even use int16 if we wanted to) and then we'll need to make them the right type just before using them. We can do that when splatting them onto the GPU, eg. First thought, can we store them as a Python list? Turns out they're not all that memory-efficient, though: How about PyTorch tensors? Promising! (Though ChatGPT pointed out when reviewing a draft of this post that I was using the default rather than an type here. Still, it's the same size.) Let's measure memory usage in a new interpreter. Yup, 12,801,474,560, so about 12 GiB. Can we save it? OK, let's try reloading it in a fresh session: Nice. So, I think we can write a quick script that splits our incoming dataset into say 99/1% train and validation, grabs the first 3.2B tokens from the training set, glomming them together into one big tensor with EOSes between them, and saves them, and then does likewise for the first 19,660,800 tokens from the validation set. We'll use FineWeb, with the possibility of switching to FineWeb-Edu later on. Doing it that way means that we're actually using the second of the two options I considered earlier: Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. I thought it would be harder than concatenating/padding rows, but it actually turns out to be simple enough. Let's give it a go. Here's the code . I wanted to have an round number of 6-sequence batches of 1,024 tokens each, so the the number of training tokens worked out at ...rather than the strict Chinchilla-optimal 3,260,190,720, but that's no biggie. Running it takes 5m55s, and then: Looks about the right size -- 19M * 4 for val, 3.2B * 4 for train. Cool! Let's finally write our training script. You can see the full training script here -- note that this is the final version from the repo, so isn't exactly what I'm running at this point in the post. The checkpointing code is (sensibly enough) in a separate file, . It took two days to run, and... Both train and validation losses fall nicely! Training loss is a bit choppy, but that's because I erroneously only plotted the most recent iteration's training loss rather than an average over all iterations between the last and current validation run; the validation loss is correct because I did average all of the validation numbers. (The version of the code linked above fixes that error.) The best epoch for val loss is not the last one but it was close. Looking at the last 5 iterations, their val losses were: It's time to do some evals Firstly, let's try the smoke test that we do in the book. What does our model think should come after the text "Every effort moves you"? With uninitialised weights we get gibberish, as expected But with our best checkpoint we get this: Nice! The multiple mentions of protein is actually the kind of repetition that small models tend to do, so that's not bad news. Let's try with the last iteration's checkpoint: Also very nice, perhaps better! I think that both of those are qualitatively as good as the result we got when we loaded the pre-trained weights from OpenAI , which was: That's very reassuring. But is there something a bit more quantitative that we can do? Firstly, can we compare it to anything in the GPT-2 paper? In figure 4 they give their perplexity against their train and test sets for the different model sizes; for the small one it's a bit over 16, Let's assume that they're basing that on natural logarithms, so they mean that they have a loss of ln 16 . That's , which is much lower than our best loss of 3.9401. However, that is across different datasets, so while it makes me suspect that their model is better than ours, we can't really say for sure either way. The cool thing is, though, that we have their model -- so we can actually run it against our dataset. I wrote a script called , and running it gives us this: Still better than ours :-( I considered doing the same thing against Qwen to see whether that was also better, but with a different tokeniser we couldn't really treat it as comparable. Loss and perplexity are both over next-token predictions, and if the meaning of "token" changes, then the numbers will change. 2 OK, so we have a model, but it's not as good as the original GPT-2 small. Our loss on our validation set is roughly 3.94, while the original weights get about 3.50. Expressing that in terms of perplexity gives our own model about 51.4, while the original has 33.1. That's actually still higher than the 16 that they had in the paper, which is interesting -- presumably it's related to the fact that they're validating over their own WebText test set rather than ours; they're both samples of web content, but there must be differences. At this point, my guess is that this shows that all of that extra training that the OpenAI team did beyond the Chinchilla-optimal number of tokens did have a real benefit -- and that's not suprising. Remember that the Chinchilla paper is about the best way to spend a FLOPs budget. They're not saying that you can't drive down loss by continuing to train your model further -- of course you can. They're saying that when you pass the optimal number of tokens, you should increase the model parameters and the tokens by the same ratio, and by doing that you'll get the best balance. But still, a Chinchilla-optimal model of 163M parameters might still be useful. What happens if we instruction fine-tune it like we did the original model in Chapter 7 of the book ? In that post and its followup , we used some training samples using the "Alpaca" one-shot question-answering format: ...to get a model that we then provided a test set of questions in the same format, then used the Llama 3 7B model to judge the results on a scale of 0 to 100. We then averaged the results and got a plausible-looking indicator of how useful the model was, as compared to the more narrowly technical loss number. One problem with that is that we ran those tests on the OpenAI weights for the medium-sized 355M-parameter GPT-2 model. If we don't want to be comparing apples to oranges, we'll need to re-run it on their weights for the small model. Let's see how we do. First, let's run it for five epochs just to see when/if it starts overfitting: OK, so two epochs looks like the right amount, just as it was with the medium model. So we can train for that (because I'm using the original code I wrote when working through the chapter, I didn't checkpoint during training -- but it takes less than a minute to run the whole thing, so no biggie). Here's the loss chart: Validation loss at the end is 0.733, noticeably above the 0.649 that I got with the medium-sized model. And the sample outputs shown at the end aren't as good, either. With the medium-sized model, I got these: ...but with the small model (remember, this is with OpenAI's original weights) I get this: Definitely worse, especially the last one! Let's see what Llama 3 thinks of it, again using the code from the book: The medium model got an average of 50, so the OpenAI small model is definitely much worse, as the examples suggested. Makes sense. Let's see how our own base model performs when fine-tuned on the same data. After a bit of fiddling I found that validation loss settled down at the end of epoch 10: (It's hard to see from the chart, but validation loss was actually very slowly dropping even after epoch 5.) It's interesting that our own model took longer to train here, but it does make sense in terms of it being that little bit dumber. The samples it printed out at the end are also interesting: The simile is pretty good, I think better than the OpenAI original weights' one, but the storm clouds one is dreadful. It's fascinating that they both chose the same wrong answer for "Pride and Prejudice" -- my guess is that it's because the training set contained this question: ...so both models picked up on Robert Frost being a useful author to reference in answers. Anyway, what does Llama 3 think of the output? Yup, it's dumber than the original weights -- but, at least to my mind, closer to the original weights' score than you might have thought based on that loss/perplexity number alone. But, on the other hand, I'm not convinced that Llama 3 7B is smart enough to be doing a good job. In the stuff the eval script printed out, we have this: This is clearly completely wrong, the mention of cumulonimbus is coming from the dataset response, not the model response. Llama 3 7B is tripping up over what came from where, which is pretty normal for a small model. Of course, it's possible that the scores for the OpenAI GPT-2 small weights also have been given a higher rating than they deserve -- or, indeed, that there were right answers that were incorrectly judged wrong. Conceivably it averages out. But there's no reason to assume it would, so it's essentially noise and is making the results less useful. Let's try using a much smarter LLM as a judge and run both of the models responses through it -- the just-released OpenAI GPT-5.1 model. The code is here . Running that against our own model's answers: ...and against the model fine-tuned from the small OpenAI weights: ...and, of course, it didn't make the mistake of confusing the dataset response with the model's in any of the cases printed out. ChatGPT 5.1 in the chat interface is very smart, I expect these results are much closer to a reasonable ground truth. Out of interest, what does it make of the model based on the GPT-2 medium weights that we train as part of the book? That's as compared to an average of about 50 from Llama 3 7B. It seems like GPT 5.1 is a tougher judge than the small local model -- and my guess is that that is because it's more accurate. 3 Anyway, the ranking remains the same; after fine-tuning on the same Alpaca dataset, GPT-2 medium > GPT-2 small > our model. But it's still a relatively close-run thing between our model and GPT-2 small. Can we close the gap without vast amounts of extra training? The results so far were from using 3.2B tokens of the FineWeb 10B corpus. Now, as I noted at the start of this post, Andrej Karpathy's nanochat project uses FineWeb-Edu, a separate corpus designed to be really informative. Indeed, back at the start when we were looking at the two datasets, the first row in the Edu dataset was about Jane Austen, so maybe we would wind up with a model that at least got that question right! That's going to take another two days to train, but that's no big deal. We first need to change our script that generates the train/validation splits to regenerate them using the Edu dataset; we'll move the old ones to one side, though -- it will be interesting to see what loss we get on the non-edu validation data with the new model. (Note to self: work out some way to split out different datasets and training runs for future experiments like this. The setup I had in my recent post on RNNs worked quite well. Throughout the remainder of this post I'm juggling directories of checkpoints and datasets, and I'm sure I got it right, but it was an error-prone process.) That being done, it's time to move the checkpoints we already have to one side, and to kick off the train! Here's what we have after two days on that -- oops, I forgot to add the code to average training loss across all of the batches, so again it's a bit spiky. But we got to a final eval loss of about 3.693 this time. Of course, that's on its own validation set, so it's not comparable with the numbers from before; loss is specific to a particular dataset. Let's see what it makes of the original run's validation set. Juggle some directories around (my messy file structure means that there is just one "datasets" directory and one "checkpoints" one, so I'm moving them around to make sure I'm using the right combination): We get 4.16! That's truly terrible, worse than both the original base model that we trained on FineWeb's non-edu dataset, and than the OpenAI GPT-2 small weights. Let's see what we get from the closer-to-real-world instruction fine-tuning test. Five epochs turns out to be best: I won't bother running it past Llama 3 7B, as that's proven unhelpful, so we'll go straight to GPT-5.1. Gosh! So it's judged slightly worse than our weights based on FineWeb. That does surprise me a bit. I was definitely expecting the Edu version of the dataset to give us a better model. So: OpenAI medium > OpenAI small > our FineWeb base model > our FineWeb-Edu base model. That last pairing does surprise me a bit. Handwaving wildly, perhaps the more "regular" nature of the Edu dataset meant that the model saw less variation in its training set, and that actually made it learn less? I think there's one more experiment I want to do before bringing this ( very lengthy) post to a close. We've shown that Chinchilla-optimal training of models produces worse results than OpenAI's original, we think longer, train. What would happen if we continued training for another two days? As I have it easily to hand, I want to use the FineWeb-Edu model for this. I want to start with the best checkpoint (which happens to be the last one), and train it on another 3.2B tokens from FineWeb-Edu. Let's see what we get. Getting a dataset is going to be a bit messy, as our existing script to generate the safetensors datasets just grabs tokens from the original dataset until it gets 534,200 batches of 6 sequences, each of 1024 tokens (3,282,124,800 total). Might as well hack it (and note that this is something worth improving for any later experiments). I'll just loop round the code to do that twice, throwing away the first set of 3.2B tokens. I was pretty sure that the ordering of the datasets I'm getting is fixed, but perhaps not -- it spent time regenerating the train/val split at the start of the script, so there's no guarantee we have different data this time. That feels like a note-to-self about data pipeline hygiene -- if the train/val split is randomised by the infra I'm using, I should persist the raw data in case I need to use more data than I though I would need to. Still, for this experiment, we can play relatively fast and loose. After all, GPT-2 small -- the original OpenAI weights -- was trained on multiple epochs, so it saw tokens multiple times. What we're trying to see here is what happens if you train for longer; a more scientific experiment can happen later (if at all...). Anyway, we have 3.2B tokens that should at least be reasonably different from the original 3.2B. Right, let's clean up some disk space so that we have enough for the new train (deleted some old optimiser checkpoints, keeping the metadata and the weights). Now, we create a new checkpoints directory, and we can copy the last/best checkpoint from the original FineWeb-Edu train there. Hack the in there to zero, create and symlinks, and then we can "restart" from that checkpoint. Due to the way the restart-from-checkpoint code works in the training script, that means that it will start with an offset of 1 into the dataset, so we're dropping one of about 530,000 iterations, but that's not exactly the end of the world. There are some interesting spikes on validation loss in there -- in particular that one at around iteration 300,000 where it goes up from 3.6 or so to 7.5 for two validation periods (which, remember, happen every ~30 minutes, or every 7020 iterations). My guess is that we got some kind of gradient spike prior to those, which led to a bad update to the parameters. However, it looks like the loss recovered really quickly after it, so while gradient clipping (that is, limiting the size of the gradients so that one-off spikes don't cause massive updates) might have prevented them, I don't think it would have improved matters much -- we might have "lost" an hour so of training, but out of a 44-hour train (48 hours including breaks for validation), it's not the end of the world. But, looking at the raw numbers, after our second two days of training on a fresh sample from FineWeb-Edu 10B, we've managed to get the loss on our validation set down from 3.693 to... drumroll... 3.661. And that's on the "best" measurement, which was an hour before the end. The last validation number was 3.663. By spending twice the time, we've managed to get our loss down by 0.032, which is a touch less than 1%. Even measured in terms of perplexity (which, being an exponential, is more sensitive to this kind of change), we've gone from 40.2 to 38.9, which is hardly show-stopping. Let's see how this one measures up against the non-edu FineWeb validation dataset that we originally used to calibrate our first training run. Run it, and: ...we get 4.13 -- that's opposed to 4.16 on the last model, trained on half as much data. Well, maybe it's a much better base model for instruction fine-tuning? Let's give that a go, again with the Alpaca training set from the book. 8 epochs turns out to be the right number: Certainly better than the 15.18 that we got on our Chinchilla-optimal FineWeb-Edu model, and a bit better than the 16.14 we got on the Chinchilla-optimal FineWeb one. So by training for double the time on twice the data, we've definitely got a better model. It's just not that much better. I think that's more -- significantly more -- than enough experimentation for one blog post, so let's do some analysis. I want to sanity-check the number of FLOPs spent on this train, just to make sure that I hadn't messed up. Feel free to skip this if you want to jump straight to the conclusion :-) In appendix F, the Chinchilla paper mentions a common approximation for how many FLOPs, C , you spend training a model with N parameters over D tokens: So based on that, each of those training runs cost us (using the exact numbers for N and D ) this many FLOPs: They also give a more carefully-worked out calculation; it doesn't look all that difficult -- it's just a case of plugging in the numbers from our architecture and pulling out a result 4 -- but the numbers they get from that are generally within 10% of the simpler calculations, so we may as well stick with the above. 5 Now, in terms of how many FLOPs we actually spent... well, manufacturers' datasheets for hardware are based on carefully-selected benchmarks and won't really be comparable to the code we were running (especially given that it's my crappy code based on top of a huge stack of PyTorch, CUDA kernels, CUDA itself, and so on), but we can do a Fermi estimate . From Wikipedia, the RTX 3090 has 35.58 TFLOPS performance on FP32. Way back earlier in this post, when I was measuring how many tokens per second I could get locally, the first experiment capped out at 12,599 tokens/second with FP32. showed the GPU usage at 100%, so let's say (again, this is very approximate) that we were getting about 35.58 TFLOPs and that enabled 12,599 tokens/second. We wound up training at about 19,921 tokens/second after adding in mixed precision and using the tensor cores. So, hand-wavingly we can say that we were getting Now, we trained for 44 hours (48 including validation), so the total number of training FLOPs should have been the number of seconds in that times the total FLOPS 6 of 56.27 × 10 12 That's pleasingly close to the 3.19 × 10 18 above! I can easily imagine that the stack we're using could somewhat-more-than-halve performance from the theoretically optimal, or that we're running at 50% of the GPU's theoretical capacity, or some combination of the two. We're in the same order of magnitude, and for a Fermi approximation, that's what matters. Now, looking at figure 3 in the Chinchilla paper, their IsoFLOP curves (each one showing the loss they got on their training set for models of a particular size, using the same number of FLOPs for each curve), we can see that the top one, which is training runs of 6 × 10 18 FLOPs, the lowest point is pretty much bang-on the 168M point on the X axis. So that is at least reassuring that we did do a proper Chinchilla-optimal train here. (Their loss on that chart is showing 3, but they're using a different dataset, so I don't think it's comparable.) Apart from the obvious answer of "skill issue", let's see if there are any obvious reasons why the base model I've trained (and retrained) in this post is worse than the original OpenAI GPT-2 small. Let's review the results first: The first row is not super-interesting, it's the second and third that matter. OpenAI is clearly winning by quite some margin! Earlier on I assumed that the difference was that they trained on more data, but let's be a bit more systematic here. What specific differences do we have to the original train? Again, the amount of data in the paper is frustratingly limited, but: Right at the start, I estimated that the WebText dataset they trained on was about 10B tokens. We've trained on 3.2B tokens for two of our models, and 6.4B tokens for the extended train one. That could well have an effect. There's more information in their larger dataset, both in terms of raw facts like "Jane Austen wrote Pride and Prejudice", and in terms of information about the structure of language. On the other hand, their dataset is, as they say, comprised of the contents of web pages that were linked from Reddit posts with more than three upvotes. FineWeb (and even more FineWeb-Edu) is a much more curated dataset, so you would expect it has more facts, and better structure -- less of the slop and junk that Andrej Karpathy talked about in his interview with Dwarkesh Patel. So I'm not sure that this is it, but it's worth keeping in mind. Again, we don't know how many epochs they trained on, but the report I linked to right at the start of this post estimated that they trained for 60, while I calculated based on their numbers that it would be 41 epochs with WebText. It certainly makes sense that grinding along, epoch after epoch, will get your loss down, at least on the training set! And there's also a phenomenon with certain kinds of neural networks where if keep training past the point where you're overfitting (that is, validation loss starts rising while training loss continues to fall), suddenly the model can have an "aha" moment and start generalising again . 8 It's not quite comparable, because it was not a second epoch, but rather continued training with more data, but we were able to eke out an extra reduction of 0.032 in loss by training our FineWeb-Edu model for twice as long. If we'd trained it for 40 times as long, then we presumably would have managed to grind it down even further. I have no idea how much further we could get it, but I'd guess that it's going to be worse than linear (that is, each extra two days gets you less loss reduction than the previous) -- so we can bound the loss reduction at a maximum of 39 × 0.032 = 1.248 . So... maybe? It would be a dull experiment to run, though, taking 78 days. If I want to do that, it would be better to find a way to do it quickly, so that I can get a better feedback loop going. The reason this post has taken so long has in part been because each training run has taken so long (as well as trips to London and other life stuff). The original GPT-2 model from OpenAI had bias on the W q , W k and W v projections -- that is, they were normal NN biased linear layers rather than simple matrices, so they did a projection into their respective spaces followed by a translation. In the book, Raschka says that this is not normally done these days, which is why I didn't do it for this base model train. But perhaps it actually is valuable with this architecture or size? Modern models presumably differ in multiple ways, and perhaps the bias would have been useful for this old design. Likewise, weight-tying -- the original GPT-2 re-used its embedding matrix to do the final projection from embedding space to vocab space, rather than having a separate one. That seems intuitively clever but not necessarily "right", given that it gives the model less flexibility in what it can output from the last layer. But perhaps with this size and architecture, it's the right thing to do? Contrariwise, having made those two changes to GPT-2 because I believed that modern models don't work that way, there was one "modern" change that I didn't make. In his post on the architectural changes since GPT-2, Raschka mentioned that dropout is normally not used nowadays. This looked to me like it was due to the move to single-epoch training. But single-epoch training was exactly what we were doing in this post! Perhaps I was holding myself back by keeping dropout in place. I don't have a good intuition as to what the right level is for this at the moment. My code blindly uses the optimiser setup from the book: I have at best a vague understanding of how those work, at least when using an optimiser (LR for simple gradient descent isn't too hard to understand, although it's hard to work out an intuition for what the right value might be in any given case). Additionally, in the Chinchilla paper, they talk about using a cosine function to vary the learning rate, which is something I'm completely unfamiliar with. I gained about a day in training time by using AMP and the TF32 tensor cores; however, I lost precision. I don't know for sure, but I suspect that the original weights were trained with pure full-fat FP32. Perhaps reducing precision lost something? I know that modern models are often trained with lower precisions, but perhaps that's balanced out by something else? This is the one that I think it least likely, but it's worth mentioning. The post that I linked to estimating the size of the training run for GPT-2 small mentioned that they used a batch size of 512, which (of course) is completely impossible on consumer hardware like mine. Indeed, I think you'd be lucky to get 512 onto a single 8-GPU node -- we're talking serious cluster training scale here. Larger batches lead to more stable updates to the gradients. So maybe that helped for OpenAI when they did their train? I suspect it did, but I'm pretty much certain that it's not a large part of the difference. (Counterpoint: Gemini thinks that this might actually be a big part of the problem! It recommends using gradient accumulation -- that is, not stepping the optimiser every iteration, but instead giving gradients time to build up -- as a way of getting a larger batch effective batch size.) While it doesn't look like we had any issues with these on the original FineWeb and FineWeb-Edu trains, they definitely did kick in on the extended Edu train. The code to clip them is easy enough, and I think it's likely that the original GPT-2 trains would have had it. I doubt this was a major part of the difference, but it probably would have helped, at least a bit. Anyway, I think that's it in terms of differences that I can see between my train and OpenAI's (as always, comments welcome -- let me know if you spot any others!), so it's time to (finally) wrap this post up. At the start of this (ridiculously long) post, I asked the question: can we train a GPT-2 style base model at home on a single RTX 3090. The answer is a resounding "yes we can", which is great! Training base models: not just for the GPU-rich. If you have a couple of days and a decent graphics card, you can train a Chinchilla-optimal GPT-2 pretty easily. But the model itself isn't quite as good as the original GPT-2 small one, and I have some ideas about why that might be. Testing any of those would take quite a long time, given that each training run takes two days. Now, my next planned step was to see whether I could work out how to move this up to the cloud and train the same model on an 8x A100 or similar machine on Lambda Labs. This still sounds like an excellent plan! With his project, Karpathy trains a larger model on more tokens in four hours; if we could get the experiment time down to one hour (plausible if training time is linear in both tokens and parameters) then it would be much easier to check out those hypotheses above. 9 So, I think that's still the right way to go: after training a base model at home for free (if you ignore the electricity costs -- and it's cold enough in Lisbon right now that the heat from the PC was probably saving me money on my home heating bill -- and the cost of having bought the RTX 3090 in the first place), the next step is to see how cheaply we can train it in the cloud. Stay tuned :-) It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩ . This is determined by the tokenizer, and I want to use the GPT-2 one, so it will need to be . . GPT-2 has a 1,024-token context length, so I'll stick with that. , , --- these define which of the different GPT-2 model classes we're training, and I want to stick to the smallest one, so they will be , and respectively . One of the most surprising things to me in the "architectural improvements" post linked above was that dropout is no longer used so much. However, this appears to be tied in to the one-epoch training that has taken off since GPT-2, so I think it would be best to stick to here. . From what Raschka says in the book, this doesn't add on much value, even though the original GPT-2 used it, so let's set it to . Crop all of the input sequences -- that is, each row in the dataset -- so that each one is no more than our 1,024 sequence length. Then we can pad them out with end-of-sequence tokens (as is the standard) so that they're all 1,024. This will lose us quite a lot of tokens, but has the big benefit of being easy. Treat the corpus as, essentially, one long document, with end-of-sequence delimiters between each row, then split that up into 1,024-token sequences. Doing it this way would mean we'd use all of our training data. But it would be more complicated, especially if we hit memory constraints. We load enough GPT-2 tokens from FineWeb for batches of sequences each, every one of those sequences being long (plus one extra token for the targets we're comparing them to). Note that we're not bothering to separate them with anything for this test. We then loop over batch sizes from to . Then we create our model and put it on the CUDA device. We do this for each batch size rather than creating one and then using it for all of them so that they're all starting from the same point -- the should make sure that they're identical. For each batch size, we create input and output batches as tensors -- note that we're not putting these on CUDA yet, I wanted to do that in the training loop to mirror what a real training loop will have to do. When we're training with 3.2B tokens then having them all on CUDA will be a waste of VRAM, so we'll be pushing a batch there for each iteration. We do a stripped-down training loop -- for each batch, put the inputs and outputs onto CUDA, then a forward pass, work out the loss, backward pass, and optimiser step. We do the same iterations per batch size. Finally, we print out the number of tokens we trained on for this batch size, how long it took, and the number of tokens per second. Chinchilla heuristic, 20x parameters -- 3.2B tokens: 247,850 seconds, which is just less than three days Estimated GPT-2 train, 419B tokens: 32,452,947 seconds, which is just over a year. Create a model, optimiser and scaler. Train the model for a bit. Work out the loss. Save a checkpoint. Create a new model, optimiser, and scaler, and then restore the checkpoint into them. Work out the loss Train for a bit more to check that the optimiser and scaler still work. On our own validation set from FineWeb, our we have OpenAI > our FineWeb train > our FineWeb-Edu extended train > our FineWeb-Edu train On the answers judged by GPT-5.1 after instruction fine-tuning, we have OpenAI > our FineWeb-Edu extended train > our FineWeb train > our FineWeb-Edu train It's useful here, but it does make me wonder how good FineWeb would be for training a base model with a longer context length, however.  ↩ There are ways to get comparable numbers even with a different tokeniser, using a bits-per-byte or nats-per-byte measure. Let's say we're using the normal cross entropy loss with the natural logarithm; that means that loss is expressed in nats. So you add up all of the per-token losses and divide it by the number of bytes across all of the inputs you've seen, and that would give you nats-per-byte. Likewise, if you used l o g 2 for cross entropy, you'd get bits-per-byte. The latter is used in the Chinchilla paper (eg. table A5) as a way to compare their model with the Gopher model. I did consider digging into this a bit, but I think it's a bit of a side quest for now.  ↩ Those evals cost me $0.09 in API credits, which is actually a little more than I was expecting -- there were some responses which took quite a while to come back, though, and I believe that the GPT 5.1 model spends time thinking when it seems appropriate, so perhaps I spent a bit on thinking tokens.  ↩ Apart from a reference to a "dense layer", which I'm unsure about -- I believe it's the linear feed-forward layer after the attention calculations, though, as that doesn't appear elsewhere, and the calculation looks right. I also noticed that they don't have any terms in there for things like normalisation, which seems odd for such a carefully-worked-out formula; I assume they are small enough to vanish into the noise.  ↩ If you want a more careful calculation of the numbers -- and indeed a really nice explanation of some of the details of the Chinchilla paper, I recommend this blog post from Tomek Korbak .  ↩ I hate that we appear to have settled on FLOPs with a lower-case "s" for "floating-point operations" when "FLOPS" (and equivalently MFLOPS, GFLOPS, TFLOPS) with an upper-case "S" already meant "floating-point operations per second" because the difference in capitalisation should really not change the units. But here we are.  ↩ I estimated the OpenAI weights loss on their own dataset by taking the perplexity number for the small model from figure 4, which is about 16.5, and then taking its natural log.  ↩ The authors of the paper call it "grokking", which is a great name, but is so overloaded in the context of LLMs (even if you disregard xAI's Grok ) that I'm slightly loath to use it here. This phenomenon also looks somewhat more limited in scope than I thought -- I'd been under the impression that it happens a lot with LLMs, but it looks like it's more a thing that happens with small models trained on very structured datasets.  ↩ It would also be interesting to see how easy it is to offload the optimiser to the CPU: in my old fine-tuning experiments I found that freed up a ton of VRAM, so we could benefit from that and maybe get the batch size up to something closer to the 512 that OpenAI apparently trained with.  ↩

0 views
Giles's blog 2 months ago

Why smart instruction-following makes prompt injection easier

Back when I first started looking into LLMs , I noticed that I could use what I've since called the transcript hack to get LLMs to work as chatbots without specific fine-tuning. It's occurred to me that this partly explains why protection against prompt injection is so hard in practice. The transcript hack involved presenting chat text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at a base LLM: ...you would instead prepare it with an introductory paragraph, like this: That means that "simple" next-token prediction has something meaningful to work with -- a context window that is something that a sufficiently smart LLM could potentially continue in a sensible fashion without needing to be trained. That worked really well with the OpenAI API, specifically with their model -- but didn't with their earlier models. It does appear to work with modern base models (I tried Qwen/Qwen3-0.6B-Base here ). My conclusion was that had had some kind of instruction tuning (the OpenAI docs at the time said that it was good at "consistent instruction-following"), and that perhaps while the Qwen model might not have been specifically trained that way, it had been trained on so much data that it was able to generalise and learned to follow instructions anyway. The point in this case, though, is that this ability to generalise from either explicit or implicit instruction fine-tuning can actually be a problem as well as a benefit. Back in March 2023 I experimented with a simple prompt injection for ChatGPT 3.5 and 4. Firstly, I'd say: It would, of course, accept the challenge and tell me that it was thinking of a number. I would then send it, as one message, the following text: Both models told me that yes, I'd won -- the only way I can see to make sense of this is that they generalised from their expected chat formats and accepted the fake "transcript" that I sent in my message as part of the real transcript of our conversation. Somewhat to my amazement, this exact text still works with both the current ChatGPT-5 (as of 12 November 2025): ...and with Claude, as of the same date: This is a simple example of a prompt injection attack; it smuggles a fake transcript in to the context via the user message. I think that the problem is actually the power and the helpfulness of the models we have. They're trained to be smart, so they find it easy to generalise from whatever chat template they've been trained with to the ad-hoc ones I used both in the transcript hack and in the guessing game. And they're designed to be helpful, so they're happy to go with the flow of the conversation they've seen. It doesn't matter if you use clever stuff -- special tokens to mean "start of user message" and "end of user message" is a popular one these days -- because the model is clever enough to recognise differently-formatted stuff. Of course, this is a trivial example -- even back in the ChatGPT 3.5 days, when I tried to use the same trick to get it to give me terrible legal advice , the "safety" aspects of its training cut in and it shut me down pretty quickly. So that's reassuring. But it does go some way towards explaining why, however much work the labs put into preventing it, someone always seems to find some way to make the models say things that they should not.

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 27 -- what's left, and what's next?

On 22 December 2024, I wrote : Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting. More than ten months and 26 blog posts later, I've reached the end of the main body of the book -- there's just the appendices to go. Even allowing for the hedging, my optimism was adorable. I don't want to put anyone else off the book by saying that, though! I expect most people will get through it much faster. I made a deliberate decision at the start to write up everything I learned as I worked through it, and that, I think, has helped me solidify things in my mind much better than I would have done if I'd only been reading it and doing the exercises. But on the other hand, writing things up does take a lot of time, much more than the actual learning does. It's worth it for me, but probably isn't for everyone. So, what next? I've finished the main body of the book, and built up a decent backlog as I did so. What do I need to do before I can treat my "LLM from scratch" journey as done? And what other ideas have come up while I worked through it that might be good bases for future, similar series? There are a few sources of ideas for this -- from the book itself and its supplementary material, from notes I've made as I went along, and from other things that I've kept on a mental checklist. There are five appendices: Raschka also gives a link at the end of chapter 7 to a notebook showing how to do further fine tuning using Direct Preference Optimization , which also looks fascinating, and he's working on a new project, " Build a reasoning model (from scratch) ". While working through the book, I've deliberately deferred various things. I'd kind of lost track of all of them, so I gave ChatGPT the source markdown for all of the posts in this series, and asked it to find where I'd done that. It did an amazing job! There were three categories: long context and attention efficiency, maths, and optimisers. The model we've built in the book has a context length of 1,024 tokens, and is O ( n 2 ) in both space and time with respect to the number of tokens you feed it. There are lots of things that people do to work around that. Things I need to learn: I really want to understand softmax at a better level than "it's a magic thing that turns logits into probabilities". I'd also like to learn more about higher-order tensor operations -- the ones that we use in the book are essentially treating the extra dimensions as the batch, but I believe that there's more to it than that. I really want to understand in reasonable depth what optimisers do. I know that they make gradient updates work better than they do with simple gradient descent. But how? That was the set of things I noted at the time I wrote the posts so far, but there are a few more that come to mind as I write this. In some comments that he made on posts in this series, said that it seems like this book isn't really "from scratch", given that we rely on PyTorch's magic to handle the backward pass. He's 100% right! I think I understand why it is that way, though. There would be two different ways that I can see for the book to do it: I think I'd definitely like to revisit that at some point. Another one from Simon; while the book does explain how tokenisers work, even down to a high-level overview of byte-pair encoding, we don't write our own. Again, I can see why this is -- we load in the GPT-2 weights, so we need to use that model's tokeniser. And there's no point in writing our own if we're just going to throw it away. But perhaps a bit of time playing with one would be useful? The book, quite reasonably, shows you how to train your LLM, does a basic train on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. But given that I was getting a pretty good training speed on my own hardware, perhaps I could train a model really from scratch, perhaps using one of the smaller FineWeb datasets? Even if I can't do it locally, perhaps it might be doable on a rented cloud machine, like the Lambda Labs ones I used when fine-tuning Llama 3 ? After all, Andrej Karpathy is training a full model that you can chat with for $100 . I don't think I ever mentioned this on the blog, but one important plan for me is to try to build an LLM from scratch, only using my own blog posts and what I remember -- no looking at the book. If I can do that, then I can be reasonably sure that I really have learned it all. I'm also thinking that I'll do that using a different library -- that is, not PyTorch. That would stop me from regurgitating code that I've learned. If you're reading this within a day or so of the post's publication, I'm running a poll on X/Twitter about which framework to use . If you have an opinion, please do stop by and vote :-) It feels like almost every new model these days is an MoE. I have read a lot around the subject and would love to build on it. Essentially, instead of having just one feed-forward network after your attention heads, you have several. In front of them you have a router -- a trainable network of some kind -- that tells you which of these "expert" FFNs the token should be forwarded to. You then send it to the top (or top k ) experts, while leaving the others inactive. The result is that you have more space (in terms of parameters) for the LLM to know about things, but not all of those parameters are active during inference -- so your model is smarter but still fast. There's a bunch of interesting stuff there, from how you build it in the first place, to how you handle the fact that you're processing lots of tokens at once -- multiple tokens in each sequence and multiple sequences in a batch. It would be a pretty cool follow-on to the "my own LLM" series, thinking about it. I definitely don't think I need to do all of those things in order to wrap up this series. Here's the subset I'm planning on doing: For the other things, I think there are some potential future series to write. I'm certainly not promising that I'll write up all (or even any) of that second list, but they all seem really tempting to me right now. If you're particularly interested in seeing my take on any of them, please do leave a comment below. I think the next post in this series -- maybe the next several posts -- will be on trying to train the model code provided in the book from scratch to produce my own base model. Stay tuned! Here's a link to the next post in this series . A: An introduction to PyTorch B: References and further reading C: Exercise solutions D: Adding bells and whistles to the training loop E: Parameter-efficient fine-tuning with LoRA The KV cache . This is basic stuff and I feel I sorta-kinda understand it, but I haven't written about it so I can't be sure. It's a pretty obvious enhancement to avoid repeating work when generating autoregressively -- that is, the normal setup where in order to generate n tokens, we give the model its input, sample our first token from its predictions, then feed the whole thing -- the input and that first token -- back in for the second token, and so on. Obviously, because attention is causal, we're doing exactly the same work every time for all of the tokens in each round apart from the last one, so it makes sense to cache things. The result is that generating the first token is still O ( n 2 ) , but subsequent ones will be something more like O ( n ) each. That's why real-world modern models tend to take a while pondering before they generate the first token but then speed up -- they need to fill their cache. FlashAttention and related things: there are lots of ways people have found to reduce the cost of attention generally, but this seems to be the most popular one, or at least the best to get started with. Better positional embeddings : the context length of our GPT-2-style LLM is fixed in part because you need position embeddings for every possible input position. That means that we can never extend it. More modern LLMs use better ways to represent positions -- Rotary Position Embeddings (RoPE) look like they're very popular. Manually code a backward pass to go with the forward pass on each of our modules. Simon did this, and was kind enough to share his code with me -- it looks like one of those things (like attention) that is pretty hard to get your head around initially, but once it clicks it's super-clear. Definitely kudos to him for getting it all to work! The problem with this is that I don't think any ML practitioners do this nowadays, because automatic differentiation is there in every popular framework. So it might be a good learning experience, but also might nudge people into an unprofitable direction. Create our own automatic differentiation system. Andrej Karpathy pops up again when looking into this; he created micrograd , which handles back-propagation for scalar functions. That's really clever -- but it would be hard, and a bit of a side quest from the point of the book. Also, the most interesting stuff (at least from what little I know) for automatic differentiation is how you do it with non-scalars -- the matrices and higher-order tensors that our LLM uses. From what Simon says, this is where you need to use the mysterious Jacobian matrices I've heard about in the context of back-propagation. Training the full GPT-2 base model myself. I'm 100% going to try this. From the appendices -- anything that surprises me from the one on PyTorch, and perhaps from the "bells and whistles" in the training loop. The others I either won't do, or will pick up later. Building my own LLM from scratch in a different framework, without using the book. That is, I think, essential, and perhaps would be the crowning post of this series. It would be a nice way to end it, wouldn't it? Improving context length -- RoPE and other tricks -- sounds like an excellent series to start on when I'm done with this. AIs tell me that other interesting things to look into would be ALiBi, NTK/YaRN scaling, and positional interpolation. Improving performance: the KV cache, FlashAttention, and other performance enhancements likewise feel like they could make a good series. I also want to do a separate series on LoRA. In that, I'll draw on appendix E from this book, but also on other tutorials. Likewise DPO, along with other post-training that can be done to make models more useful as chatbots, like Reinforcement Learning. I'd really like to spend some time understanding that area. (And Raschka's upcoming reasoning model book might fit into that category too.) Optimisers: Adam, AdamW, maybe Muon (though the latter scares me a bit). The maths -- softmax and higher-order tensor calculations -- also seems to belong in another series, perhaps an extension of the various "maths for AI" posts I've done in the past. Automatic differentiation and the backward pass; that would make a great series. A mixture-of-experts model would be excellent fun, I think. Tokenisers would be a great stand-alone post, at least at the level that I can see myself covering it. Perhaps that would develop into a series if I found myself getting sucked in.

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model

This post is on the second half of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I covered the part of the chapter that covers instruction fine-tuning; this time round, we evaluate our model -- particularly interestingly, we try using another, smarter, model to judge how good its responses are. Once again, Raschka's explanation in this section is very clear, and there's not that much that was conceptually new to me, so I don't have that many notes -- in fact, this post is probably the shortest one in my series so far! Unusually, when at the start of section 7.7 we generate some sample responses for the instructions in our test set, I got exactly the same results as in the book. For once, I guess, everything that uses randomness was happening in the same order as it did when Raschka ran it on his machine. The next step was to generate a file with all of the responses to all of the test instructions, which took 18.9 seconds on my RTX 3090 (compared to a minute on an A100, per the book -- that's quite surprising!) Once that was done, it was time to install Ollama so that I could use the Llama 3 model to evaluate my own. I've never used Ollama before -- when playing with other people's models, I've always used Hugging Face's Transformers library. It's a neat package, though. It wraps , which is a pure C/C++ inference framework (with CUDA support), and makes it easy to download and run models that have been packaged for it. Being written in C, I would imagine that it's faster than PyTorch/Transformers -- though, being inference-only, it's less useful if you're planning to do things like training or fine-tuning the models. My desktop is running a fairly customised install of Arch Linux, and I didn't want to use the default install procedure (which puts it into your system-wide and directories). But it turns out that it's a very well-packaged app, and you don't need to do that. Using the manual install instructions for Linux , I just created a new directory , and then ed there and downloaded it: It was about 1.75 GiB. I then untarred it: ...and then I could run commands with full paths, for example: ...to start up the server, or ...to start a session. Neat! It's always good to see pre-built binary packages that have no issues with their install location. The next step was to throw all of the generated test responses (and their associated targets) at Llama 3 and see what it thought about how close they were. Again, this all worked without trouble. I noted that the responses I was getting from Llama 3 were not the same as the ones in the book -- Raschka notes that Ollama is non-deterministic, so there's no surprise there (though it does make me wonder why it accepts a parameter in the API call). When I got on to the final eval, where you run the test results through Llama 3 and ask it to rate them compared to the target outputs, it took 11 seconds to run, and I got an average score of 48.95 / 100, which is close enough to the 50.32 that appears in the book. 1 I'd run an eval on my model, using a smarter model to judge its responses! Somewhat surprisingly, that number was stable over multiple runs. So perhaps there is some level of determinism in Ollama now that wasn't present when the book was written, and the seed (eg. ) is of value. Or perhaps Raschka's comment about it being non-deterministic was more of a "between machines" thing rather than for multiple runs on the same machine -- though then I'm not sure why he suggests re-running it for multiple results. Anyway -- that was it! Eval done. And, to my amazement, that was the end of the chapter -- and almost the end of the book. We've built an LLM from scratch, fine-tuned it, and evaluated it by using a smarter model to judge how well it was following instructions. ...or at least the end of the beginning. Having run the evaluation, I've reached the end of the main part of " Build a Large Language Model (from Scratch) ". But I don't think I've reached the end of this project, there's still more to do (not least working through the appendices). So, coming up next: a post summarising what I've got through so far in this series, and what the next steps are to wrap it up. Here's a link to the next post in this series . I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩ I also got 110 out of 110 scores -- that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to "threaten" Bard to stop it from saying "Sure, no problem! Here's your JSON" was almost two years ago.  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 25 -- instruction fine-tuning

This post is on the first part of chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", which covers instruction fine-tuning. In my last post , I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way. So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get. Just as with the last chapter , what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause. For this part of the book, we use the "medium" variant of the GPT-2 open weights rather than the "small" ones that we've been using so far. I have to assume that this is because the small really isn't very very good at this kind of thing. [Update: it isn't! See the metrics from my mammoth post on training an LLM completely from scratch .] This was quite interesting. In the past, all of the templates I've seen for instruction following have been designed for chatbots -- that's what we tend to use LLMs for, after all. There's a system prompt and then a format for "message from user", and another for "message from bot". In my series on fine-tuning , where I learned how to fine-tune an 8B-parameter Llama 3 base model to work as a chatbot, I used the format for Llama 2 , which is not dissimilar to the Phi3 one that's given as an example in the book. The Alpaca -style one is quite different; it is designed for more of a one-shot interaction than it is for chat: Now, Alpaca dates from early 2023, and it looks like they used that prompt following a paper " Self-Instruct: Aligning Language Models with Self-Generated Instructions ". I had to think a bit about why one would use that, and I think the core is that this was early days (all of two years ago!) and LLMs had very short context lengths and weren't very smart. Chat uses a lot of tokens! You need the system prompt, and then every conversational turn so far. With our GPT-2 model we have just 1024 tokens to play with -- and Alpaca wasn't much better, as it was built as a fine-tune of Meta's original Llama model , which (according to the model card ) had a context length of 4096 tokens. Chat is a good way to interact with a model, as the multiple conversational turns allow you to build up large amounts of context for the model to play with, meaning that (hopefully) it will be able to give good answers. But if that context doesn't fit into the context length, then it's not so good. Early chatbots, I believe, worked around this by replacing the "transcript" with a summary, but there's only so much you can fit into a 4k-token one. 1 Maybe modern ones do this too, but with GPT-5 having a 400,000-token context window it's not so important. So, in Alpaca times, people were thinking in terms of one-shot interactions with LLMs, and the pattern they chose was targeted at that, so that you could get all of the interesting information and a reply into one sequence. An interesting bit of history! (Again, two years ago is history. Cripes.) This was explained well in the book, but it's an interesting enough point that I thought it was worth going over. Last time around we had a bunch of text messages as our inputs to the model. We found the longest one, and then padded them all out to the same length with end-of-sequence tokens, which meant that we could construct batches -- naturally, every input in a batch has to be the same size. This time around we're being a bit smarter. Although every item in a given batch needs to be the same length, batches themselves can be of different lengths -- that is, if our batch size was 8, and the longest sequence in our first batch was 73 tokens long, then we would make our first batch 8 × 73 -- but then, if the longest sequence in our second batch was only 60 tokens long, then the second batch could be 8 × 60 . We only need to pad out sequences to match the longest sequence in their batch, and that saves us time when running the model. That got me thinking about inference at scale -- the kind of thing that LLM providers like OpenAI or Anthropic do. They're going to be receiving very large numbers of sequences to complete, and of course they are going to be running them through in batches. But padding tokens are kind of a waste of inference GPU cycles. They'll have a bunch of different instances of their models running on different machines to handle all of these requests, and they almost certainly have some kind of code to try to route sequences of similar length to the same instances. To take a toy example, if you had a batch size of two and received six sequences, with lengths of 2, 9, 100, 11, 3 and 120, then you'd want to route them so that one instance received the (2, 3) pair, another the (9, 11), and another (100, 120) -- that minimises the amount of padding required and saves wasted cycles. Following on from that, it looks like we could actually improve the book's code by doing something similar here, grouping similarly-sized inputs together. That would be quite complicated, though, so probably not worth it in an educational context like this. Anyway, our collator needs to handle the variable-length batches, and through various drafts we converge on one that does it, with one tweak. This was a really important and interesting bit. Let's say that we're feeding in a 20-token input, in a batch where the longest of the other sequences is 30 tokens long. That means that we have ten padding tokens at the end. Let's represent that input sequence like this: The numbers are our token IDs, and I've used to represent the end-of-sequence token that we use for padding. Now, we need our target sequence to predict for that. The first version that we come up with in the book looks like this: So, we've just done the normal trick of shifting left by one character, and we've added an extra end-of-sequence token at the end to make the lengths match. But as the next step, we replace all of the padding tokens, apart from the one right at the end of the "real" part of the sequence, with an invalid token ID, . Using to represent that, we have: The core thing to remember here is that we honestly don't care what the model generates after it's done the real, unpadded sequence, plus an end-of-sequence token. It could generate random junk and it wouldn't matter, because it's already done the important part of predicting next tokens for all of the input sequence. The is a magic number that PyTorch's cross_entropy function uses in target sequences to say "ignore this position". I must admit that as a software engineer, it gives me a bit of an "ick" -- magic numbers are never nice -- but it does make sense. Negative numbers are invalid targets when you're comparing predictions across tokens -- which have indexes from zero up. In general, if you're predicting categories -- which essentially we are with tokens -- then the "minus one'th" token doesn't make sense. You could use any other negative number, but -1 might cause confusion (being used heavily in ML code to get the last element of a sequence) and if you're going to use any other negative number it might as well be . "Purer" solutions would be hard, anyway. We're working with a PyTorch tensor here, so it has to be a number -- which rules out using something like or some kind of special object. You could keep an "ignore after this index" number, but you'd need as many of them as you have items in the batch and it would be just another thing to keep track of. You could even keep a tensor of boolean "ignore these tokens" of the same size as your batch -- a mask -- but that would have the same problem of being something to pass around in your code. As I understand it, those last two solutions are actually used in some systems -- imagine that your outputs were not logits to create a probability distribution across categories or tokens, but were meaningful numbers in and of themselves. Pretty much any number you picked might be a valid output from the model. You wouldn't be using cross entropy loss in those cases anyway, of course, but you'd need to keep some record of where the padding starts so that you can ignore it. One final thing that is worth noting that we only add the s on to the targets. This makes sense, as all of the inputs will be fed into the LLM, so things that aren't valid tokens are going to make the embedding layer very unhappy. That also explains why we firstly add them on to the sequence as regular padding and then convert them to -100 for the targets: it allows us to add on the padding, then get the input sequence as all but the last token, then get the targets as tokens 1 up to the end. After that's done we run the code to replace all but the first end-of-sequence padding tokens with -100 on the targets. As with the last chapter, I got different results to the ones in the book; something different about the order of execution in my version of the code when compared to Raschka's meant that despite all of the careful use of , the numbers didn't quite match up. But, again as before, they were close and the trends in -- for example -- loss were the same, so the right things were happening. When I finally ran the train on my RTX 3090, it took 48 seconds; I watched it in and saw that it was using 9GiB VRAM. Due to the usual differences in randomness, I got slightly different results to the book -- but similar enough to not be any cause for concern: Also, due to a typo, I accidentally ran it with five epochs -- that took two minutes. I noticed that validation loss started rising fairly steadily after epoch 2, with train loss dropping -- clearly overfitting. Presumably Raschka chose two epochs for exactly that reason :-) A couple of things that I noticed while working through the code; when I first ran the download script, I got . That's because of a typo in the import at the start -- instead of ...it should be: The other thing that tripped me up was the original . We add on a padding token, then pad out the sequence with more padding tokens, then remove the last one. I found that confusing -- why not just add on the required number in the first place rather than adding on an extra one and then deleting it? It became clear later on; it's to make it mirror the next function, which adds on an extra end-of-sequence token for our targets, but having this anticipatory code in there with no explanation in the first draft made me start doubting my sanity for a little while... Minor points, though. So, that was it for the first half of chapter 7 in the book. The next bit looks like fun -- we're going to use a smart model to evaluate our relatively dumb one on how well it follows instructions. Definitely looking forward to that :-) Here's a link to the next post in this series . I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩ I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context. My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again. If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue. Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 24 -- the transcript hack

Chapter 7 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot. Back when I first started looking into LLMs , I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models. The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM: ...you would instead prepare it with an introductory paragraph, like this: Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could. How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model? I wrote a simple script that just allowed me to test the transcript above against different GPT-2 models, using the infrastructure that I'd already written for the code in the book. It uses the function -- that is, the basic one using greedy decoding (always pick the most likely next token) and just gets the next 23 tokens. Here's what I got with the different GPT-2 models: It looks like it has the concept of a transcript, at least -- but not very useful. Better in a way -- it's got some actual text in there. Hmm. Still not looking good, and that dodgy Unicode isn't great (unless the bot was trying to say ;-) OK, still not getting there. So it looks like the GPT-2 model can't do this. That actually makes a lot of sense -- at the time I wrote my post using this transcript hack, I found that the , and versions could not either, and nor could earlier versions of . These were all versions of GPT-3, so one generation later than the version we're working with here. The first version that "got it" was , which was GPT-3.5. Looking back at my blog post way back when, I spotted something about the model. At the time, the docs on the OpenAI site said that it can: do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models Did it perhaps have instruction-following built in -- that is, had it been instruction fine-tuned already? That doesn't quite make sense, though, because instruction fine-tuning is normally done using a particular format. Imagine that a model was fine-tuned on chats that looked like this Guanaco -format dataset: ...and then it was fed something more like this (which is from a Reddit post about the Llama 2 prompt format ): That is, at least intuitively, going to cause problems. What happens if we take a look at a more modern model, but make sure it's a base one? I've been playing with , so I wrote a script to use it. Here's what I got: That's actually not half-bad! The first response is OK apart from the "a room or a room". Its prediction of what the user would say as their second question isn't great, though. Let's see what happens if we prompt it with a more reasonable second round -- I modified the script to complete this sequence: The full result (including the input sequence) was: This is looking almost solid. And there's no mention of instruction training on the model card for . I think that my initial intuition was right -- a sufficiently advanced base model can operate as a chatbot without instruction fine-tuning. However, I think I was actually wrong about it when I originally had the intuition! The OpenAI GPT-3.5 model that I believed was a base one, , had already had some kind of instruction-following tuning after it was initially trained. It was really impressive that it was able to generalise from that to the somewhat ad-hoc chat transcript format that I was using at the time, but it was not a base model. By contrast, a modern 600M-parameter model -- smaller than GPT-2 large (at 774M) and less than half the size of GPT-2 XL (at 1.5B) -- can actually work with a transcript without difficulty. My guess is that as well as the architectural improvements that have happened over the years since GPT-2, it's the size and quality of the dataset it was trained on. GPT-2 was trained on "8 million documents for a total of 40GB of text" according to the paper . It's not entirely clear how many tokens that is, but I've seen claims of 10B tokens (for example here ), and that seems in line with 40GB, as that would be 40 billion bytes, and 4 bytes/token seems reasonable. GPT-3, according to Wikipedia , was trained on around 500B tokens. The Qwen3 series, by contrast, according to the model card , was trained on 36 trillion tokens across 119 languages. That's 72 times as much data -- and the data was probably much more curated, too. That's a big difference! Perhaps it had seen lots of transcripts in there, and that was why it was able to mimic them? Or just lots of different kinds of text in general? I guess it's no surprise that training runs are getting ever-more expensive if that's the size of a frontier model run, though. So, that's all for the transcript trick -- base models actually can work as chatbots without instruction fine-tuning, if they're sufficiently advanced and trained on enough data. That's useful to know! Time to go back to the book; coming next, my notes on the actual fine-tuning I was meant to be doing rather than messing around with this :-) Here's a link to the next post in this series .

0 views
Giles's blog 2 months ago

A classifier using Qwen3

I wanted to build on what I'd learned in chapter 6 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". That chapter takes the LLM that we've built, and then turns it into a spam/ham classifier. I wanted to see how easy it would be to take another LLM -- say, one from Hugging Face -- and do the same "decapitation" trick on it: removing the output head and replacing it with a small linear layer that outputs class logits Turns out it was really easy! I used , and you can see the code here . The only real difference between our normal PyTorch LLMs and one based on Hugging Face is that the return value when you call your model is a object with more to it than just the output from the model itself. But it has a field on it to get the raw output, and with that update, the code works largely unchanged. The only other change I needed to make was to change the padding token from the fixed 50256 that the code from the book uses to . ChatGPT wrote a nice, detailed README for it, so hopefully it's a useful standalone artifact.

0 views
Giles's blog 2 months ago

Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch

I recently posted about Andrej Karpathy's classic 2015 essay, " The Unreasonable Effectiveness of Recurrent Neural Networks ". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about. This post is a bit more hands-on. To understand how these RNNs really work, it's best to write some actual code, so I've implemented a version of Karpathy's original code using PyTorch's built-in class -- here's the repo . I've tried to stay as close as possible to the original, but I believe it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising, given that he wrote it using Torch, the Lua-based predecessor to PyTorch.) In this post, I'll walk through how it works, as of commit . In follow-up posts, I'll dig in further, actually implementing my own RNNs rather than relying on PyTorch's. If you already have a basic understanding of what RNNs are and roughly how they work, you should be fine with this post. However, if you're coming directly from normal "vanilla" neural nets, or even Transformers-based LLMs (like the one I'm working through in my LLM from scratch series), then it's definitely worth reading through the last post , where I give a crash course in the important stuff. So with that said, let's get into the weirdest bit from a "normal" LLM perspective: the dataset. Every now and then on X/Twitter you'll see wry comments from practitioners along the lines of "AI is 5% writing cool models and 95% wrangling data". My limited experience bears this out, and for RNNs it's particularly weird, because the format of the data that you feed in is very different to what you might be used to for LLMs. With a transformers-based LLM, you have a fixed context length -- for the GPT-2 style ones I've posted about in the past, for example, you have a fixed set of position embeddings. More recent position encoding mechanisms exist that aren't quite so constraining, but even then, for a given training run you're going to be thinking in terms of a specific context length -- let's call it n -- that you want to train for. So: you split up your training data into independent chunks, each one n long. Then you designate some subset of those your validation set (and perhaps another bunch your test set), and train on them -- probably in a completely random order. You'll be training with batches of course; each batch would likely be a completely random set of chunks. To get to the core of how different RNNs are, it helps to start with an idealised model of how you might train one. Remember, an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. Let's imagine that you wanted to train an RNN on the complete works of Shakespeare. A super-simple -- if impractical -- way to do that would be to feed it in, character by character. Each time you'd work out your cross-entropy loss . Once you'd run it all through, you'd use those accumulated per-character losses to work out an overall loss (probably just by averaging them). You would run a backward pass using that loss, and use that to adjust the parameters. If you're feeling all at sea with that backpropagation over multiple steps of a single neural network with hidden state, check out the " Training RNNs " section of the last post. You can see that in this model, we don't have any kind of chunked data. The whole thing is just run through as a single sequence. But there are three problems: Let's address those -- firstly, those vanishing or exploding gradients. In the last post I touched on truncated backpropagation through time (TBPTT). The idea is that instead of backpropagating through every step we took while going through our batched input sequences, we run a number of them through, then backpropagate, and then continue. Importantly, we keep the hidden state going through the whole sequence -- but we detach it from the compute graph after each of these steps, which essentially means that we start accumulating gradients afresh, as if it was a new sequence, but because it started from a non-zero initial hidden state, we're still getting some training value from the stuff we've already been through. 2 Imagine we have this simple sequence: Let's say we're doing TBPTT of length 3: we can split up our training set so that it looks like this: So now, we just feed in "a", then "b", then "c", then do our TBTT -- we calculate loss just over those items, update our gradients, and then detach the hidden state, but keep its raw, un-gradient-ed value. Then we start with that stored hidden state, and feed in "d", "e", "f". Rinse and repeat. In practice we'd probably throw away that short sequence at the end (because it would cause issues with gradient updates -- more here ), so we'd just get this: Now, let's look into batching. It's a bit harder, but with a bit of thought it's clear enough. Let's say that you want b items in your batch. You can just split your data into b separate sequences, and then "stack them up", like this with b = 2 : So for training, we'd feed our vector in as a batch, calculate loss on both of them, then , and so on. The important thing is that each batch position -- each row, in that example -- is a consistent, continuous, meaningful sequence in and of itself. Finally, for validation, you also need some real sequences. For that, you can just split up the batched subsequences, with a "vertical" slice. Let's take the rather extreme view that you want 50% of your data for validation (in reality it would be more like 10-20%, but using 50% here makes it clearer): Your training set would wind up being this: ...and the validation set this: And we're done! So that's what we wind up feeding in. And it kind of looks a bit like what we might wind up feeding in to a regular LLM training loop! It's a set of fixed-length chunks. But there's one critically important difference -- they're not in an arbitrary order, and we can't randomise anything. The sequence of inputs in, for example, batch position one, needs to be a real sequence from our original data. This has been a lot of theoretical stuff for a post that is meant to be getting down and dirty with the code. But I think it's important to get it clear before moving on to the code because when you see it, it looks pretty much like normal dataset-wrangling -- so you need to know why it's really not. Let's get into the code now. In the file , we define our dataset: The that we pass in will be our complete training corpus -- eg. the complete works of Shakespeare -- and is the limit we're going to apply to our truncated backpropagation through time -- that is, three in the example above. Karpathy's blog post mentions using 100, though he says that limiting it to 50 doesn't have any major impact. Next, we make sure that we have at least enough data to do one of those TBPTTs, plus one extra byte at the end (remember, we need our targets for the predictions -- the Ys are the Xs shifted left with an extra byte at the end). ...and we stash away the data, trimmed so that we have an exact number of these sequences, plus one extra byte for our shifted-left targets. Now we create a tokeniser. 3 This is related to something I mentioned in the last post. Karpathy's post talks about character-based RNNs, but the code works with bytes. The RNNs receive as their input a one-hot vector. Now, if we just used the bytes naively, that would mean we'd need 256 inputs (and accept 256 outputs) to handle that representation. That's quite a lot of inputs, and the network would have to learn quite a lot about them -- which would be wasteful, because real human-language text, at least in European languages, will rarely use most of them. His solution is to convert each byte into an ID; there are exactly as many possible IDs as there are different bytes in the training corpus, and they're assigned an ID based on their position in their natural sort order -- that is, if our corpus was just the bytes , and , then we'd have this mapping 4 : We just run the full dataset through to get the set of unique bytes, then sort it -- that gives us a Python list in the right order so that we can just do lookups into it to map from an ID to the actual byte. The class is defined in and is too simple to be worth digging into; it just defines quick and easy ways to get the vocab size (the number of IDs we have), and to encode sequences of bytes into PyTorch tensors of byte IDs and to decode them in the other direction. Because these byte IDs are so similar to the token IDs that we use in LLMs, I've adopted the name "tokens" for them just because it's familiar (I don't know if this is standard). So, at this point, we have our data and our tokenizer; we finish up by stashing away an encoded version of the data ready to go: Next we define a method to say how long our dataset is -- this is calculated in terms of how many TBPTT sequences it has: -- and a method: This works out the start and the end of the th subsequence of length in the data. It then returns four things: The code as it stands doesn't actually use the last two, the raw bytes -- but they did prove useful when debugging, and I've left them in just in case they're useful in the future. If you look back at the more theoretical examples above, what this Dataset is doing is essentially the first bit: the splitting into BPTT-length subsequences and dropping any short ones from the end -- the bit where we go from The only extra thing is that it also works out our target sequences, which will be a transformation like this: So that's our . Next we have a simple function to read in data; like the original code I just assume that input data is in some file called in a directory somewhere: Now we have the next step, the function : This looks a little more complicated than it actually is, because it's building up a list of tuples, each one of which is a set of , , and . If we imagine that it only did the , it would look like this: So, what it's doing is working out how many batches of size there are in the sequence. With our toy sequence ...and a batch size of two, there are . In this case, it would then loop from zero to 3 inclusive. Inside that loop it would create a list, then loop from zero to 1 inclusive. The first time round that loop it would get the item at , which is 0 + 0 * 4 = 0, so the subsequence . It would add that to the list. Then it would go round the inner loop again, and get the item at the new . is now 1, so that would be 0 + 1 * 4 = 4, so it would get the subsequence at index 4, which is , and add that to the list. We'd now have finished our first run through the inner loop, and we'd have the list [ , ], so we stack them up into a 2-D tensor: Hopefully it's now fairly clear that in our next pass around the outer loop, we'll pull out the items at index 1 and index 5 to get our next batch, and , and so on, so that at the end we have done the full calculation to get this: ...as a list of 2 × 3 PyTorch tensors. And equally hopefully, it's clear that the code in is just doing that, but not only for the but also for the , and . One thing to note before moving on is what happens if the number of items doesn't divide evenly into batches -- this code: ...means that we'll drop them. So, for example, if we wanted a batch size of three with our toy sequence ...then we'd get this: ...and the and would be dropped. And that's it for the dataset code! You might be wondering where the split to get the validation set comes -- that's actually later on, in the training code that actually uses this stuff. So let's move on to that! This is, logically enough, in the file train_rnn.py . There's quite a lot of code in there, but much of it is stuff I put in for quality-of-life (QoL) while using this. It's useful -- but I'll skip it for now and come back to it later. Initially, I want to focus on the core. We'll start with the function at the bottom. It starts like this: The -related stuff is QoL, so we'll come back to it later. All we need to know right now is that it's a way of getting information into the system about where its input data is, plus some other stuff -- in particular our TBPTT sequence length and our . So it uses that to read in some training data, then initialises one of our s with it and the , then uses to split it into batches. Next we have this: So our gives us a validation data percentage; we do some sanity checks and then just slice off an appropriate amount from the end of the we got to split the data into train and validation sets. That's the equivalent of the transform from the example earlier from To this training set: ...and this validation set: Now, we create our model: We're using a new class, which is an extension of the PyTorch built-in class -- we'll come back to that later. It's also getting parameters (things like the size of the hidden state and the number of layers) from the . Finally, we do the training in a function: So let's look at that now. It starts like this: That's fairly standard boilerplate to use CUDA if we have it, and to put the model onto whatever device we wind up using. Next: The class name for the optimiser is another one of those things from the , as are the learning rate and weight decay hyperparameters. So we just create an instance of it, and give it the model's parameters to work with along with those. Next, we get our patience: This is a QoL thing, but I think it's worth going into what it actually means. When we're training, we normally train for a fixed number of epochs. However, sometimes we might find that our model was overfitting -- say, at epoch 50 out of 100 we might see that the training loss was still decreasing, but our validation loss started rising. Any further training past that point might be pointless -- if we're doing things properly, we're saving checkpoints of the model periodically, so we'd be able to resurrect the model that we had at the point where validation loss was lowest, but we're still wasting time continuing training. A common solution to that is to have early stopping in the training loop. If the validation loss starts rising then we bail out early, and don't do the full number of epochs that we originally planned to do. Naively, we might keep track of the validation loss from the last epoch, and then if the current epoch has a higher loss, then we bail out. However, sometimes you find that validation loss rises a bit, but then starts going down again -- it's kind of like a meta version of finding a local minimum in the loss function itself. The solution to that is to use patience -- a measure of how many epochs of rising validation loss you're willing to put up with before you do your early exit. That's the number we're getting from our here -- it's a positive number (note the paranoid ), and if it's not defined we just assume that we have infinite patience. The next two lines are related to patience too -- before we go into our main training loop, we define the two variables we need to control early exit with patience: Pretty obviously, those are the best validation loss that we've seen so far, and the number of the epoch where we saw it. Right, finally we get to some training code! We have our epoch loop: We're using the rather nice module to get progress bars showing how far we are through the train (ignoring any early exits due to running out of patience, of course). We start the epoch by generating some random text from the model. This gives us a reasonably easy-to-understand indication of progress as we go. Next we put our model into training mode: ...set an initial empty hidden state: You might be wondering why the hidden state is getting a variable of its own, given that it's meant to be hidden -- it's right there in the name! Don't worry, we'll come to that. Next we initialise some variables we'll use to keep track of loss -- the total loss across all of the batches we've pushed through, plus the total number of tokens. The metric we track for each epoch is the loss per token, so we use those to work out an average. Now it's time to start the inner training loop over our batches: We're just unpacking those tuples that were created by into our and (I think I was being ultra-cautious about things here when I added to the start of ). And again we're using to have a sub-progress bar for this epoch. Next, we move our Xs and Ys to the device we have the model sitting on: And then run it through the model. The code to do this looks like this: ...and I think it's worth breaking down a bit. You can see that there's a branch at the top, if there's a hidden state then we need to pass it in and if there isn't, we don't. But let's focus on the no-hidden state option in the branch first, because there's something surprising there: Remember the description of an RNN from above: an RNN receives an input, uses that to modify its internal hidden state , and then emits an output based on the updated hidden state. Then you feed in the next input, update the hidden state again, get the next output, and so on. We can easily extend that to handle batches -- you'd give the RNN a batch of inputs (let's say a tensor b × 1 , and get a batch of results, also b × 1 . You'd also need the RNN to hold b hidden states, but that's not a big jump. But what we're doing in that code is something different -- we're feeding in a whole series of inputs -- that is, is of size b × n , where n is our desired TBPTT sequence length. What's worse, in our description above, the hidden state was just that -- something hidden in the model. Now it's being returned by the RNN! What's going on? Let's start off with that hidden state. We often need to do stuff with the hidden state from outside the RNN -- indeed, we're detaching it as an important part of our TBPTT. So the PyTorch RNN actually does work rather like the simplified model that I described in my last post , and treats the hidden state like an output, like in this pseudocode: That is, the hidden state is an input and a return value, like this: OK, so the hidden state thing makes sense. How about the fact that we're feeding in a whole set of inputs? This is actually just due to a quality of life thing provided by PyTorch's various RNN classes. Wanting to feed in a sequence is, of course, a super-common thing to want to do with an RNN. So instead of having to do something like the pseudocode above, it's baked in. When you run ...then because is b × n , it just runs the RNN n times, accumulating the outputs, then returns the outputs as another b × n tensor, along with the final from the last run through that loop. (There is a wrinkle there that we'll come to shortly.) With that explained, hopefully that branch is clear. We don't have a hidden state right now, so we run all of the inputs across all of our batch items through the RNN in one go, and we get the outputs plus the hidden state that the RNN had at the end of processing that batch of sequences. Now let's look at the other branch, where there is a pre-existing hidden state: Hopefully the last line is clear -- we're just doing the same as we did in the branch, but we're passing the hidden state in because in this case we actually have one. The first two lines are a bit more complex. As you know, we need to detach the hidden state from PyTorch's computation graph in order to truncate our backpropagation through time. We're doing that here at the start of the loop just to make sure that each batch that we're pushing through starts with a guaranteed-detached hidden state. So that explains those calls to the methods. The fact that our hidden state is a tuple of two things that we have to detach separately is a little deeper; for now, all we need to know is that the LSTM models that we're using are a variant of RNN that has two hidden states rather than one, and so we need to handle that. I'll go into that in more depth in a future post. Once we've done that, we've completed our forward pass for the epoch. Let's move on to the backward pass. Next, we have this: Pretty standard stuff. is defined further up in the file: It's exactly the same as the function we used to calculate loss in the LLM-from-scratch posts: I wrote more about that here if you're interested in the details. Next, we do something new: This is something that is generally very useful in RNNs. They are prone to vanishing and exploding gradients, and this code is to help handle the exploding case. What it says is, if we've defined a , we use it to clip gradients when they get too big, which means that training is going be better because we're not going to have updates swinging wildly up and down. Let's say that we set to 1.0. If, at the time this code is run, the norm of the gradients -- which is a measurement of their size 5 -- is, say, 10, then they would all be scaled down to 10% of their size, making the new norm 1.0. So that keeps them in check, and stops any wild variations in gradient updates. So, in short -- it's a stabilisation technique to stop exploding gradients leading to issues with training. Next, we have our normal code to update the parameters based on these (potentially clipped) gradients: And finally, we update our count of how many inputs we've seen and our total loss so far in this epoch: That's our training loop! Once we've done that code -- run our input through the model, calculated loss, worked out our gradients, clipped them if necessary, done our update and stored away our housekeeping data, we can move on to the next batch in our sequences. When we've gone through all of the batches that we have, our training for the epoch is complete. We print out our loss per-token: ...and then it's time for our validation loop. This is so similar to the training loop that I don't think it needs a detailed explanation: The only big difference (apart from the lack of a backward pass and parameter updates) is that we're not detaching the hidden state, which makes sense -- we're in a block with the model in mode, so there is no computation graph to detach them from. Validation done, it's time for a bit of housekeeping: All we're doing here is keeping track of whether this is the best epoch in terms of validation loss. The boolean is exactly what it says it is. If we're on our first run through the loop ( is None) then we record our current val loss as , and store this epoch's number into . Otherwise, we do have an existing , and if our current val loss is lower than that one, we also stash away our current loss and epoch as the best ones. Otherwise we are clearly not in the best epoch so we update to reflect that. Once we've done that, we save a checkpoint: I'll go into the persistence stuff -- saving and loading checkpoints -- later on. Next, a QoL thing -- we generate a chart showing how training and validation loss have been going so far: Again, I'll go into that later. Finally, we do our early stopping if we need to: If the current epoch is more than epochs past the one that had the best validation loss so far, then we stop. That's the end of the outside loop over epochs for our training! If we manage to get through all of that, we print out some sample text: ...and we're done! That's our training loop. Now let's move on to the model itself. I called my model class a , and you can see the code here . It's actually not a great name, as it implies there's something specifically Andrej Karpathy-like about it as a way of doing LSTMs, while what I was trying to express is that it wraps a regular PyTorch LSTM with some extra stuff to make it work more like his original Lua Torch implementation . I tried to come up with a more descriptive name, but they all started feeling like the kinds of class names you get in "Enterprise" Java code like so I gave up and named it after Karpathy. Hopefully he'll never find out, and won't mind if he does... 6 The Lua code does four things differently to PyTorch's built-in class: Let's look at the code now: You can see that it's doing 1 to 3 of those steps above -- the one-hot, the extra dropout, and the linear layer to project back to vocab space. The only other oddity there is this kwarg: That's the wrinkle I was talking about when we went through the training loop and was discussing batches. The PyTorch LSTM by default expects the batch dimension to be the second one of the input tensors -- that is, instead of passing in a b × n tensor, it wants an n × b one. That's not what I'm used to (nor is it what the original Lua code uses, if I'm reading it correctly), but luckily it can be overridden by the logically-named option. The only step we don't do in this class is the softmaxing of the logits to convert them to probabilities. That's because PyTorch's built-in wants logits rather than probabilities, so it was easier to just call softmax on the outputs where necessary. So that's our model. Let's take a look at the code that we can use to run it and generate some text. The code for this is in . Ignoring the boilerplate that parses the command-line options, we can start here: So, we're taking the directory and run name that the QoL helpers that I'll be describing later, a specific checkpoint of a training run to use, the number of bytes that we want to generate, the temperature to use when sampling (more about temperature here ) and a "primer" text. That last one is because in order to get something out of our RNN, we need to feed something in. I tried using a single random byte from the vocab initially (that's still the default, as we'll see shortly), and that was OK, but the bytes aren't equally represented in the training data (eg. "z" is less common than "e", but weird bytes that only occur in occasional multibyte unicode characters are rarer still) -- and that means that we might be trying to get our RNN to start with something it hasn't seen very much, so we get bad results. Even worse, because some of the input text is unicode, there's no guarantee that a random byte is even valid on its own -- it might be something that only makes sense after some leader bytes. So I found that in general it's best to provide a fixed string to start with -- say, "ACT" for Shakespeare, or "He said" for "War and Peace". So, with those command-line flags, we start off by using the QoL stuff to get the metadata we need about the model: ...then we use our persistence code to load up the desired checkpoint: At this point we have the version of the model that was saved for that checkpoint, and its associated tokeniser. We move this to an appropriate device -- CUDA if we have it, CPU otherwise: ...and then use a helper function to generate some text: Once we have that, we print it out, after decoding it as UTF-8: If a primer was provided, we print it first, but if the primer was a random byte we don't. Also, because the generated bytes might include invalid Unicode, we just replace those with "?" when we decode (that kwarg). Let's look at the helper next. So, after a little bit of paranoia about our desired sequence length, we make sure we're not tracking gradients and put the model into eval mode (to disable dropout). Next, we work out our primer bytes -- either by picking a random one, or by decoding the string that we were provided into its constituent UTF-8 bytes: The primer needs to be converted to the byte token IDs that our tokeniser uses: The is something you might remember from the LLM posts -- we need to run a batch through our RNN, and the is just a tensor of n bytes. adds on an extra dimension so that it's 1 × n , as we want. Next, we put the primer onto the same device as the model: As an aside, I think I might start using code like that more often, I often find myself passing variables around and TBH it seems much more natural to just ask the model what device it's using. Next, we run it through the model: Now we use a helper function to sample from those logits to get our first generated byte: Note that we are explicitly taking the last item from . It is a b × n × v tensor, where b is our batch size (always one in this script), n is the length of the primer that we fed in, and v is our vocab size. The just extracts the last item along the n dimension so that we have the b × v logits that came out of the RNN for the last character of the primer, which is what we want. We'll get to the function later, but it returns a b × 1 tensor, so now, we just extract the byte ID from it and put it into a new list: Next comes our autoregressive loop -- we've already generated one byte, so we loop times to get the rest, each time running the model on the last byte we got, sampling from the distribution implied by the logits, and adding it onto our list: Once that's done, we have our generated byte IDs in , so we just use the tokeniser to turn them back into bytes and return the result: Easy, right? Now let's look at . The function takes logits and the temperature: Firstly, we handle the case where temperature is zero. By convention this means greedy sampling -- we just always return the highest-probability next token, so we can use for that: If the temperature is non-zero, we divide the logits by it and run softmax over the result: ...and then we just sample from the probability distribution that we get from that: And that's it! The only things to explain now are the quality of life stuff, and the persistence functions that handle saving and loading checkpoints. Let's look at our QoL things first. When I started building this code I knew I wanted to run RNNs on multiple input texts -- Shakespeare, "War and Peace", etc. I also realised that for each of those input texts, I'd want to try different model sizes. The underlying concept I came up with was to have "experiments", which would each have a particular training text. Each experiment would have multiple "runs", which would have particular training hyperparameters -- the model size, number of epochs, and so on. I decided to represent that with a directory structure, which you can see here . One subdirectory per experiment, and if you go into the one you'll see that it has two subdirectories, for the training data and for the different training runs I tried. The directory contains a file called , which is the training data itself. That one only exists in the experiment, though, because I was concerned with copyright for the other training sets. There is a file in all data directories for all experiments, though, which explains how to get the data. The directory has more in it. Each run is for a particular set of hyperparameters, so let's look at the ones for the run. We have two files, , which looks like this: It's essentially the model-specific hyperparameters, the ones we pass in when creating our -- for example, remember this from the training code: is this JSON dict loaded into Python. There's also , which has the training data: Hopefully these are all familiar from the training code; they all go into , so they're used in code like this: So, now when we look at the start of the and scripts, and see things like this: ...it should be clear that we're loading up those JSON dicts from those files. You can see that code at the start of . It looks like this: So, some basic sanity checking that we have the directories we expect. Next: ...we create a checkpoints directory if it doesn't exist, stashing away its path, then finally we load up those two JSON files: The rest of that file handles checkpointing, so let's move on to that. Remember, in the training loop, each epoch we saved a checkpoint: ..and at the start of the code to generate some text, we load one: Let's take a look at saving first. Each checkpoint is a directory with a filename based on the timestamp when it was saved, inside the directory for the run that it relates to, so firstly we work out the full path for that: (The directories inside experiments are explicitly ignored in our file so that we don't accidentally commit them.) Now, we don't want half-saved checkpoints due to crashes or anything like that, so we initially create a directory to write to using the path that we're going to use but with at the end: Next, we write a file (the path within the checkpoint's dir is worked out by a helper function) containing some useful information about the model's progress -- it's epoch number, the training and validation loss, and the mapping that its tokeniser uses (from which we can later construct a new tokeniser): Then we dump the model's current parameters into a file using function from the Hugging Face library (getting the file's path through another helper function): Now that our checkpoint is complete, we can rename our temporary directory to the real name for the checkpoint: Next, we do some symlinks. We want a symlink in the directory called , which links to the checkpoint that had the lowest validation loss. The training loop is tracking whether any given epoch had the lowest, and you can see it passed in an parameter, so if that's true, we create the symlink, removing any pre-existing one: For completeness, we also create one that points to the most recent checkpoint -- that will always be the one we're doing right now, so: And that's it for saving! Loading is even simpler (and note that we can just specify "best" as the checkpoint due to that symlink -- I pretty much always do): So, we've made sure that the checkpoint directory is indeed a directory. Next, we load up the model metadata: ...and we use ' to load our parameters: Now we can construct a tokeniser based on that mapping that we put into the metadata: ...and an based on the other metadata parameters: and load the parameters into the model: That's it! We can return the model and the tokeniser for use: So that's all the code needed for checkpointing. Now let's look at the final QoL trick, one that I left out of the earlier list because it needs the checkpoints to work: charting our progress. Remember this line from the training loop, which was called after we saved our checkpoint? It generates charts like this: The chart is updated every epoch, and saved into the root of the directory. There's also a helpful file placed there that reloads that generated chart every second, so you can just load it into a browser tab while you are training and watch it live. Let's look into the code. It's in . The function starts like this: So, we use a utility function (which we'll get into in a moment) to load up the data -- training and validation loss per epoch, and the specific epoch that was the best. Once we have that, we just use (with my preferred xkcd styling) to plot the two loss lines: We also plot a single vertical red line at the best epoch so that we can see if we're past that and running into the patience period: Then a bit more pyplot boilerplate... ...and we've got our chart, saved as . Finally, we just copy that useful auto-reloading into the same directory as the chart: ...and we're done. So, how do we get the data? Originally I was keeping lists of loss values over time, but eventually realised that the data was already there in the checkpoint metadata files. So, the helper function just iterates over the checkpoints, skipping the symlinks, creating lists of (epoch number, loss) tuples for both training and validation loss using the numbers in those metadata files, and for the symlink just storing its epoch number: Those loss lists will just be in whatever random order returned them in, so we sort them by epoch number: ...and we have something we can return to the charting code: That brings us to the end of the charting code -- and, indeed, to the end of all of the code in this repo! So let's wrap up. That was quite a long writeup, but I think it was worthwhile. Indeed, if you look at the commit history, you'll see that there were one or two things where while explaining the code I realised that it was doing things badly -- not so badly that it didn't work, or gave bad results, but doing things in a way that offended my sense of what's right as an engineer. Hopefully it was interesting, and has set things up well for the next step, where I'll use the same framework, but plug in my own RNN implementation so that we can see how it compares. Stay tuned :-) Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩ Vanishing/exploding gradients. Let's say that we're training a three-layer network on the 5,617,124 characters of the Project Gutenberg "Complete Works of Shakespeare" . That's essentially backpropagation through a 16-million layer network. You won't get far through that before your gradients vanish to zero or explode to infinity. The only meaningful parameter updates will be for the last something-or-other layers. Batching. Running multiple inputs through a model in parallel has two benefits: it's faster and more efficient, and it means that your gradient updates are informed by multiple inputs at the same time, which will make them more stable. 1 Validation . There's nothing in there as a validation set, so we will have no way of checking whether our model is really learning, or just memorising the training set. (There's the same problem with the test set, but for this writeup I'll ignore that, as the solution is the same too.) : the byte IDs of the bytes in that sequence -- these are the ones we'll run through the model, our Xs. Note that these are slices of the PyTorch tensors that were returned by the tokeniser, so they're tensors themselves. : the shifted-left-by-one-plus-an-extra-byte target sequence as byte IDs -- the Ys for those Xs. These are likewise tensors. : the raw bytes for the . :the raw bytes for the . It accepts the inputs as "token IDs", and maps them to a one-hot vector itself. It applies dropout after the last layer of the LSTM (rather than just internally between the layers). It expands the output vector back out to the vocab size with a linear layer after the LSTM so that we have logits across our vocab space. This is because an LSTM's output has the same dimensionality as the hidden state. It runs those logits through softmax so that it returns probabilities. Intuitively: if you train on "I like bacon", then "I like cheese", then "I like wine", then you can imagine that they might have different effects -- maybe the first would have the largest impact, then the second, then the third -- or perhaps it might be the other way around. By comparison, if you trained on all three in parallel, you would expect them to be more evenly balanced in their effect.  ↩ I'm accumulating a never-ending list of things to dig into in the future, but let me add yet another one: it would be good to work through how PyTorch uses this compute graph in practice to do all of its automated differentiation magic! Andrej Karpathy will likely pop up again, as he did pretty much that in his micrograd project .  ↩ In case you're wondering: I tend to use UK spelling like "tokeniser" in writing, as it's much more natural to me. But in code I tend to standardise (or standardize) on the US spelling. For private projects like this, it doesn't matter much, but when collaborating with other people from various places in the world, it's helpful to use a standardised spelling just to make life easier when searching code.  ↩ Sharp-eyed readers might note that my token IDs start at zero, while Karpathy's start at 1. Zero-based indexing is the natural way to represent them in Python, one-based in Lua. Keeping things natural like that makes it a bit easier when we convert things into one-hot vectors later.  ↩ Remember that gradients are vectors in a high-dimensional space. So to work out a measurement of size, for each parameter we square all of the numbers in its gradient, then add them together. We then add all of those squared numbers across all parameters together, and take the square root of the sum.  ↩ Thanks to Claude for generating that monstrosity of a Java class name. It added: "For bonus points, imagine this is in a package like: And it probably has exactly one method: :-)"  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 23 -- fine-tuning for classification

In chapter 5 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", we finally trained our LLM (having learned essential aspects like cross entropy loss and perplexity along the way). This is amazing -- we've gone from essentially zero to a full pretrained model. But pretrained models aren't all that useful in and of themselves -- we normally do further training to specialise them on a particular task, like being a chatbot. Chapter 6 explains a -- to me -- slightly surprising thing that we can do with this kind of fine-tuning. We take our LLM and convert it into a classifier that assesses whether or not a given piece of text is spam. That's simple enough that I can cover everything in one post -- so here it is :-) The core idea in the chapter is the trick that we use to change an LLM that is designed to predict the next token into one that classifies texts into different categories. What we do is remove its final layer -- the one that maps from 768-dimensional embedding space to vocab space so that we get our logits to turn into a probability distribution -- and replace it with a much simpler one, which maps from the 768-dimensional embedding space to a 2-dimensional one, where logits -- after softmax -- represent the respective probabilities of spam vs ham (non-spam). As Raschka says, it's as if we're constraining our model to having an output vocabulary of two tokens. We're replacing our output head with a classification head instead. Because it involves removing the output head of the existing model, I'm calling it the decapitation technique. (ChatGPT tells me that it has a more prosaic name -- it's a linear probe .) This was something I recognised from Jeremy Howard's fast.ai course -- it looks like it might no longer be part of it (the trick in question has been absorbed into the library that the course uses), but I do remember doing an image classifier by removing the output head from an existing one and then training a new, simpler one to replace it. With an LLM, there's an extra tweak; we only consider the logits provided for the last token. That makes sense; with the original head, the last token's embedding, projected into vocab space, is the predicted next token for the sequence as a whole, and it is the only one with information from all of the other tokens blended into it by the attention layers. Being the richest representation of the sequence, it's obviously the best one to check when we're trying to classify that sequence. But that leads to another interesting thing about the training -- when we're calculating our loss, we only consider the cross-entropy loss between the logits vector for the last token and the target category. That makes sense simply because we don't have spam vs ham predictions for the shorter prefix sequences -- but also, we honestly don't care what its predictions for them might be. Which leads to the interesting possibility that they might wind up not being spam vs ham predictions at all -- the model has more "freedom" in how it uses them, so they could be anything. There is a lot of data-wrangling going on in this chapter -- all of which is relatively simple, but one thing that stood out for me was that we make sure that we have the same amount of spam and ham in our training, validation and test sets. Intuitively this makes sense. If we had a training set that was (say) 95% ham and only 5% spam, then a model that was essentially this: ...would be right 95% of the time, which is around the accuracy of the trained model that we wind up with. But it would also be pretty useless. Making sure that we have similar amounts of both is a good way to avoid the model getting trained in a dumb direction like that. I also spotted something a bit odd in the code that loads the different datasets in. Here it is: Look at those parameters -- they're for the training set, but for the others. From the docs for PyTorch's drop_last (bool, optional ) -- set to to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If and the size of dataset is not divisible by the batch size, then the last batch will be smaller. So: if the number of training samples isn't divisible by the batch size, and we had set to for training data, then our last batch would have fewer items in it. I think it makes intuitive sense that you'd want all of the batches going through during your training to be the same size. Let's imagine that we have ten items in most of our batches; loosely speaking, that means that on each gradient update, each one of those items will contribute 10% of the update. But if the last batch has only four items, then those four will each contribute 25%. So the items in the last, smaller batch will have an outsized impact on the model's training, which would obviously be a bad thing. For our validation and test sets, it's pretty clear that it doesn't matter -- all we want to do for them is work out how good the model is at classifying them, and they don't have a direct impact on the model's parameter updates. Before we start training, we see whether our model -- with its original projection-to-vocab head in place -- can classify text as spam or not. We feed it this: ...and it responds with this: So, no luck -- not a big surprise. Now, quite some time ago when I started playing with LLMs, I started a series trying to build an AI chatbot using the OpenAI APIs, which -- at the time -- were just text-completion based, rather than the chat-template ones that they are now. I found that even with some of the older models, you could get it to work like a chatbot by telling it what this text was meant to look like. So I tried that: Unfortunately it didn't help much -- here's what it came up with: Ah well, worth a try! I might give it another go once we've instruction-trained it. So, we know we need to train our LLM -- the next bit of the book explains how to do it. Just like last time , even though Raschka uses in lots of places to make the results reproducible, I got different numbers to his. Interestingly, in this chapter I got pretty much the same numbers for the first few steps, but after the accuracy functions were introduced, my results started differing. Again, I don't think this matters much so long as the numbers are similar. (With the caveat of some slightly disappointing results I got at the end.) I was interested in the trick of freezing the gradients on almost all of the LLM's layers -- only allowing the last one, and the final layer-norm to be trained, plus the new output head. It kind of reminded me of LoRA, which reduces the amount of work you need to do to fine-tune a model by limiting the number of weights you change -- though the way that works is quite different (you introduce new, smaller weight matrices that "adapt" the results of the existing frozen ones, and then train those). Raschka suggests an exercise where you try training all of the layers -- that is, with none of them frozen. I did that, and the results were really interesting -- more on that later... One thing that did surprise me a bit was that in this chapter we're training with dropout of zero. It seems strange, but I think it's because we're freezing those layers. The number of parameters that are actually being trained is small, so dropping them out might just throw away signal from the training, and make it converge more slowly. What's worse is that there's nothing stopping dropout from happening in those frozen layers, too -- we've set to , so they're not being updated by the training loop, but the model is still in training mode, so dropout would happen in them if it wasn't set to zero. Anyway, the code to actually run the training is pretty simple and I won't repeat what is already explained perfectly well in the book. My train (with whatever the differences with the seed it had from Raschka's) came out with broadly similar results: It took 15 seconds on my RTX 3090 -- within the "less than half a minute" range he mentions for the V100 and A100 datacenter cards. Plotting the loss: ...you can see that it looks very similar to what happens in the book. The accuracy was kind of interesting, though: You can see that mine actually dropped off on the validation set near the start -- however, it did recover nicely, so no big deal. It looks like the best results were in somewhere around epoch 4, but there's not a huge drop off from there to where we stop at the end of epoch 5. 1 The final results were a little disappointing, though. In the last pages of the chapter, we run two sentences through our trained classifier. Firstly, something that is very spammy: And secondly something that isn't spammy: Unfortunately, due to the differences in seeding between my model and the one Raschka was working with, I got the same result for both: . After checking the code carefully, I decided to take a look at what the actual predictions looked like; in the function that we write to do these tests, I added this: For the ham case, this printed: The first element (index 0) is the probability that it is predicting for ham, and the second is the probability for spam. So it was 97% sure that the ham message was indeed ham. That's good! For the spam message, though, it was pretty much on the fence: It thought that there was a 59.87% chance that it was ham, but a 40.13% chance that it was spam. It would be interesting to know what Raschka's own train came up with, but I can imagine that it might be a similarly close-run thing, and I was just unlucky with the way the randomness in training fell out. But I decided to try training all of the layers to see if I got any different results. This took 42 seconds rather than 15 -- hardly a big deal at this scale, but you can see how not having to train everything and getting a 3x speedup would be worthwhile for larger models. The results were definitely better, though: Maybe a little bit of overfitting going on? But the important thing is that the validation and test samples' accuracy are higher too, we're not just memorising the training set. The loss graph also looks solid: And accuracy is really interesting: You can see that validation accuracy got up to 100% sometime during the first epoch, lining up with the point where the loss plateaus! Perhaps a snapshot taken then would have been the best form of the model. But anyway, I tried running the two samples from the last pages of the chapter through this new version of the model -- the results were exactly right! "You are a winner..." was classified as spam, and "Hey, just wanted to check..." as ham. Looking at the predictions that my modification to the classification code printed out, things were even clearer. The spam had this: A 97% chance that it was spam, which is excellent. Looking at the ham message, it said this: That's a 99.9% chance that it's ham! So that was a nice note to end on. Not only did I have results that showed that the classification worked -- the fact that it worked with just the 30 seconds of extra training that enabling gradients on all of the layers gave meant that it was unlikely that my original slightly-disappointing results were due to an error in the code -- it was, as I'd suspected, just a bit of bad luck with some randomness. So, that's it for chapter 6! Classification with a decapitated LLM done and dusted. Next time, it's on to instruction fine-tuning -- something that I spent quite a lot of time on last year , so let's see if all that work will pay off. Here's a link to the next post in this series . Update : I decided to see how easy it would be to do the same decapitation trick on a different LLM, one that I'd downloaded from Hugging Face. Turns out the answer is " pretty easy ". I've been running a number of separate experiments -- will be writing them up shortly -- in which I store a checkpoint after each validation check, and keep a symlink updated to point to the one with the best validation loss. This feels like a good practice to me.  ↩ I've been running a number of separate experiments -- will be writing them up shortly -- in which I store a checkpoint after each validation check, and keep a symlink updated to point to the one with the best validation loss. This feels like a good practice to me.  ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 22 -- finally training our LLM!

This post wraps up my notes on chapter 5 of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Understanding cross entropy loss and perplexity were the hard bits for me in this chapter -- the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens. The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there's really not all that much to say -- but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show -- and suddenly, we have a codebase that we can let rip on a training set -- and our model starts talking to us! I trained my model on the sample dataset that we use in the book, the 20,000 characters of "The Verdict" by Edith Wharton, and then ran it to predict next tokens after "Every effort moves you". I got: Not bad for a model trained on such a small amount of data (in just over ten seconds). The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this: That's amazingly cool. Coherent enough that you could believe it's part of the instructions for a game. Now, I won't go through the remainder of the chapter in detail -- as I said, it's essentially just plugging together the various bits that we've gone through so far, even though the results are brilliant. In this post I'm just going to make a few brief notes on the things that I found interesting. One thing I really do recommend to anyone working through the book is that you type in all of the code, and run it yourself -- it really will help you remember how stuff fits together. There is one slight issue I found with that, however: the book has a number of examples where you get output from code that uses randomness -- for example, where you take a look at the loss it has on some sample text before you start training, or make it generate samples during the train. Now, in theory, because Raschka puts calls before all of these, the results you get should be exactly the same as the outputs in the book. However, the amount of code we're working with at this stage is quite large -- we have various helper functions that were created in earlier sections, for example. And some of these use randomness. That means that to get the same results as the ones in the book, you would need to ensure that all of the code that uses randomness was running in exactly the same order as it was when Raschka did it for the book. That turns out to be surprisingly hard! My instinct is that it doesn't actually matter all that much. So long as the loss numbers that you see are in the same ballpark as the ones in the book, and the outputs you see are roughly equally incoherent (before training) and become more coherent at what feels like the same kind of rate, you're fine. Probably the most important one to look out for is when the training run starts -- you should see loss on the training set decreasing steadily, just like in the book, and likewise as in the book, the validation loss should plateau out pretty early. When I have built simple backpropagation through neural networks in the past, I've generally updated parameters by multiplying the gradients by a small number, the learning rate , and then subtracting them from their respective parameters to get updated ones -- classic stochastic gradient descent . Non-trivial ML uses optimisers; I'd come across them while fine-tuning LLMs , and also used one in the RNN code I wrote last week . Instead of updating the parameters yourself, you ask the optimiser to do it for you, by calling its function. AdamW appears to be the default optimiser in most textbooks, though Muon seems to be the most popular in use, if my AI X/Twitter feed is to be believed. I don't understand how optimisers work in any detail, and I'm going to have to dig into that in the future. However, my high-level simplified picture right now is that they dynamically adjust the learning rate over time, so that it's easier to take big "jumps" downwards on the gradients when you start, and then smaller ones later. I believe they can also sometimes avoid local minima in the loss landscape -- a nice metaphor I read somewhere (lost the source, sadly) was that simple gradient descent was like rolling a ball down a hill, but (some?) optimisers give the ball a bit of momentum so that it can coast over a small uphill portion, so long as the general slope is downwards. Anyway, more investigation needed later. In practice, with AdamW, you initialise it at the start of your training loop, with a learning rate (which I imagine is similar to the one my older code used, a scaling factor for gradients) and a weight decay (:shrug:). You also provide it with the parameters it's going to be managing. In the training loop, at the start of each input batch, you tell it to zero out the gradients it's managing with , run the data through your model and calculate your loss, and then after calling to get your gradients, you just call , and that does the parameter update. Again, I want to dig into how optimisers work in more detail in the future. But for now, I think that's all I need to know. The book tells you how to train on a public domain book, "The Verdict" by Edith Wharton. Full training on the hardware that people are likely to have to hand would be extremely expensive, so we just train on that short example, then later on learn how to download and use the weights that OpenAI made available for their GPT-2 models. But there was something that surprised me a little. When talking about the training run on "The Verdict", Raschka says that it takes "about 5 minutes to complete on a MacBook Air". On my machine using CUDA on an RTX 3090, it took just less than eleven seconds. This makes perfect sense, of course -- there's a really good reason why AI training is normally done on GPUs or custom hardware, and the MacBook Air would presumably be training on the CPU. But I was a little surprised at how huge the difference was in this simple example! Now, while the book mentions that Llama 2 probably cost hundreds of thousands of dollars to train, I must admit that I do wonder how much it really would cost to train a 124M parameter model on my own hardware -- or, indeed, on the machines with 8x 80GiB A100 GPUs that I rented from Lambda Labs during my fine-tuning experiments. Andrej Karpathy was able to train a 124M GPT-2 model for $20 , using his hand-written C/CUDA LLM system . That is undoubtedly more efficient than the PyTorch code that we're working on in this book. But it really would be interesting to find out whether it would be doable for me at all! The training data he used is the 10B-token version of the FineWeb collection, which is freely available. 1 I think I have a good candidate for a next project when I've finished the book; see how many tokens/second I can train on locally -- that will allow me to estimate how long it would take to train one epoch over the whole training set. I imagine that will be longer than I'm willing to leave my desktop machine tied up doing this, but then I can try mixing in the lessons I learned doing fine-tuning, and see if I can get it up and running on Lambda Labs. If the cost is in the tens of dollars, or even a hundred or so, I really think it would be worthwhile! One thing I found a little confusing in this chapter -- and this is very much a nit -- was the section on preventing "memorisation"; I think this was due to a mismatch in the meaning I attach to the word, and the way it's used here. To me, memorisation is something that the model does during training -- if you keep training a 124M-parameter model on a 20,000-character file, as we're doing here, then whatever happens the model is going to memorise it -- it's unavoidable. The only way to reduce memorisation in this sense would be to increase the amount of training data (and even then, as the findings in the lawsuit by the New York Times against OpenAI show, some stuff would be memorised). In the book, "memorisation" is being used to mean something more like what I'd call "parroting" -- issues with the model just repeating the stuff that it has memorised, because it was always choosing the most-probable next word. Avoiding this is super-important, of course! It's just the framing that confused me a little. The techniques are nifty, anyway. The first cut -- just use the softmaxed logits as a probability distribution and sample from it -- is obvious enough. Temperature is a clever trick on top of that -- just divide the logits by some number greater than one before softmax, and you can make the distribution that comes out flatter (or you can make it more "pointy" by dividing by a number less than 1). The graphs in the book showing how that works are great, but I asked Claude to knock together a temperature playground website, which I found made things even clearer to me. And finally, the top-k technique -- only consider the k most probable tokens, and then do the temperature/softmax calculations -- was a sensible addition to add on top of that. The code is clever: identify the top k logits, get the value of the lowest one of them, and then replace every logit less than that with minus infinity. When you run that through softmax, you get zeros for the ones that were replaced, and the probability distribution is based on the remainder. So: excellent stuff, and very well explained in the book -- it just didn't feel like preventing "memorisation" specifically was what it was doing, at least based on what I take the word to mean. At the end of the chapter, we download the weights for the original GPT-2 model that OpenAI produced from their site, and load them into our own model. The code to download weights is (thankfully) something that you don't need to type in, as it's downloadable from GitHub. And in one specific related case, I'll also contradict what I said earlier about typing stuff in yourself -- I definitely recommend that you copy the that copies the downloaded weights into our own model from GitHub too. I did actually type it all in and I don't think I gained anything from doing that. One thing I did notice while going through that section was that I'd been making a mistake as I wrote up this series; I'd thought that all GPT-2 models had 768 embedding dimensions. It turns out that this is only true of the 124M model in that series, and the larger ones have more. That makes a lot of sense -- and I've updated the older posts to reflect it. That's all I really have to add to what is in the rest of chapter 5. Like I said at the start, it feels almost like a let-down to be writing so little about a section of the book that has such amazing results! But now we have a working LLM, and at least the foundations that might allow us to train our own from scratch if we had the resources. Next up: using it to classify text. Will this be quick and easy? Or will it lead down another fascinating rabbit hole? Time will tell... Here's a link to the next post in this series . His new nanochat -- a from-scratch trainable chatbot -- is even cooler.  ↩ His new nanochat -- a from-scratch trainable chatbot -- is even cooler.  ↩

0 views