Latest Posts (20 found)
Giles's blog 2 weeks ago

10Gb/s Ethernet: what I actually did to get it working in my home

Having learned enough about 10Gb/s Ethernet to be comfortable about setting it up in my house, it was time to bite the bullet: order it from the ISP, buy some kit, and get started. I already had 2.5Gb/s working. The apartment has structured cabling -- each room has one or more RJ45 sockets in the wall, and there's a patch panel downstairs by our front door that has a matching patch socket for each wall socket. So when we moved in, I simply set things up so that there was a 2.5Gb/s switch down by the patch panel, and wired everything together there. Most of our stuff works over WiFi, of course, but I needed a wired backbone to connect the excessive number of computers in my study both to each other, and to the outside world. What did I need to do? Simplifying a bit, I had this 2.5Gb/s setup: There are a few other things dotted around, of course -- extra APs and what-have-you -- but that's the core, and I'll focus on that to keep things simple. Would I be able to get it all upgraded to work with 10Gb/s? The most important question was the structured cabling in the walls; was it CAT-5E or CAT-6, or even CAT-6A? Remember from the last post, 10GBASE-T might work over short runs of -5E (even though officially it's not meant to be able to). It probably would run over -6, because that's generally OK up to 55 metres or so, and I don't think any of the runs in the house are longer than that. And it would be fine over -6A, which is good for 100-metre runs. I was unable to find out exactly which type I had (the parts of the cables that are visible to me don't have any kind of marking to say), so I decided to do a staged rollout. The first step was to set up the wired network within my study as 10Gb/s. There were two important things to wire up; my primary desktop, , and a Proxmox cluster I have running in an 11" rack. The setup I had was just one 2.5Gb/s switch sitting on top of the rack, linked to the wall, to the cluster machines, and to . Now, getting the Proxmox cluster up to high-speed internal networking was a non-starter. The machines there are all old ones -- it's essentially a retirement home for mini-PCs I used to use for other things 1 . They're mostly gigabit ethernet, with one 2.5Gb/s one. But getting up to 10Gb/s was an important goal, as that's where I do most of my work. I also wanted to have space for a second machine that I'm planning to set up to do training/inference without tying up 's GPU, and that would also need fast networking. I wanted to have things running reasonably cool (after all, the PC itself and its GPU pump out quite enough heat already when doing a training run ), so DAC felt like the right way to go. I bought a reasonably cheap managed 10Gb/s switch 2 , a MikroTik CRS305-1G-4S+IN , with a single 10GBASE-T adapter to allow me to connect it to the wall socket. I tend to name anything on my network with its own IP, so this became . Next, a 10Gb/s SFP+ PCIe card -- an Asus XG-C100F -- for and a DAC cable to connect the two. For the Proxmox cluster, I decided to stick with the old 2.5Gb/s unmanaged switch, a TRENDnet TEG-S5061 . I'd originally bought that one because it was the cheapest 2.5Gb/s on Amazon with decent reviews, and had completely forgotten that it had one major feature -- an SFP+ 10Gb/s port for the uplink! So another short DAC to connect that to the MikroTik, and the study network "backbone" was 10Gb/s. Of course, no two computers in there could actually communicate at that speed, as only was 10Gb/s-capable -- but I could have all of the Proxmox machines talking to at the same time at full speed. I did some tests with to make sure that it was all working as expected; I couldn't test very thoroughly, but I was able to get about 4Gb/s total throughput, which was reassuring: two machines at 1Gb/s plus one at 2.5Gb/s should be a touch less than 4.5Gb/s. The next step was to check the possibilities for the connection down to the patch panel. I bought a Ubiquiti 10G Ethernet dongle , and took my laptop, 3 , down there. The news was good! Running an test between and down the structured cabling, I was able to get just less than 10Gb/s from to , and about 7Gb/s from to . The slower receive speed at the end worried me, but when I checked it became obvious what was going on. I could see the kernel process running at 100%, so some single-core thing was maxing out. The Ethernet dongle was connected over USB, of course, and that meant it needed to do much more work on the CPU for each incoming "data has arrived" interrupt than a PCIe card like the one on . That meant that could only receive data at a rate that one core could handle, which happened to be 7Gb/s. is a ThinkPad optimised for lightness and long battery life, not CPU power, so single-core performance is not great, and it hit a wall. But the 10Gb/s speed in the other direction was enough to make me comfortable that the structured cabling could handle that speed, which was excellent news -- probably I had either short runs of CAT-6, or CAT-6A in there, though conceivably I was just getting very lucky with CAT-5E. The downside was the heat. The USB dongle got too hot to comfortably hold while it was running, and while I wasn't able to check the SFP+ module in the MikroTik during the test, when I came back upstairs again I touched it and it was even hotter. I decided that that was something to keep an eye on for later (and as you'll see, it did become a recurring theme). For now, it was time to do the rest of the upgrade. Downstairs at the patch panel, it was a simple choice. All of the connections were RJ45, of course, and I only needed four. So the MikroTik CRS304-4XG-IN was the obvious choice. The final place where I needed to do some upgrades was at the ISP end. The box that our provider gave us had just one 10Gb/s port -- a 10GBASE-T RJ45 one. Now, I don't generally trust ISP routers that much, so I've always had my own router sitting between them and the home network -- a dual-port mini-PC running a locked-down Arch installation 4 . My old one was dual-2.5Gb/s, so that needed an upgrade. I settled on a Protectli VP2440 , which has two SFP+ 10Gb/s cages, plus two normal 2.5Gb/s RJ45s. I didn't need the latter, but it was the cheapest option with 10Gb/s in their range, and I've always been very happy with their hardware and customer service. However, I was a little concerned about thermals. As I mentioned, the SFP+ module in the MikroTik in the study got very hot when I did my test. I'd need dual SFP+ modules for the Protectli -- one for the WAN port connected to the ISP box, and the other for the wall socket to go down to the patch panel. Might it overheat? The good thing about Protectli is that you can just ask them. I dropped them a line, and got a reply the next day from a customer support rep saying that he believed it would be fine, but he just wanted to double-check with one of their techs. The following day, he followed up to say that the tech had confirmed that it would be OK. Promising! And because of that, plus their 30-day money-back guarantee, I decided to go for it. A few days later, the new router arrived. I named it , set it up with my normal router Arch installation, plugged it into the ISP box and the wall... and it worked just fine! So the setup at this point was: At the same time I decided to move the main WiFi AP ( , a Ubiquiti U6 Enterprise ) that was previously next to the router over to my study -- so that was hanging off the TRENDnet switch. After a bit of bedding in, I decided I wanted to move back to the same place as the router -- it's more central so it provides better WiFi coverage from there. So I got another CRS304-4XG-IN -- the 10GBASE-T MikroTik switch, like the one by the patch panel -- so that the first part of the above topology became: All of this is sitting in a sideboard next to the dining table with no ventilation. That's probably close to a pathological case for hot-running network infrastructure like this, so... how about those thermals? I like to keep track of what is going on with my zoo of computers, so I run Telegraf on all of them. This collects stats like the CPU temperature, system load, disk space, CPU and network use, and so on. They send this to an InfluxDB instance on a Proxmox VM ( , if you're keeping track). When I set all of this up, I also wanted to monitor the switches. MikroTik switches expose their stats over SNMP, so with a bit of help from various LLMs I was able to augment the Telegraf config on to also scrape that data and send it to . I use Grafana to get all of this stuff into various dashboards, and one of them is the temperatures of the networking hardware. Firstly, -- the Protectli router with two SFP+ cages, each of which has a 10GBASE-T module. I receive separate temperatures for the CPU and for each SFP+ module: That's not exactly running cool, but TBH it's not too bad! I believe that the SFP+ cages are thermally coupled to the case (which is essentially one giant heatsink). So they're running a bit hotter than the machine as a whole, but it's not baking. Let's see how that does as the weather warms -- you can see that it's been going up over the last week or so as we had a bit of a heatwave here in Lisbon. How about , the MikroTik CRS304-4XG-IN switch -- all native 10GBASE-T, in the same sideboard as ? A bit hotter than I'd like -- above the tested ambient temperature of up to 70C, though of course this is internal rather than external; , which is right next to , having an internal temperature lower than 70C suggests that we're probably still OK, as its internal temperature can't be lower than ambient. I think that both of those could be improved, though. The sideboard they're in is unventilated, and it has the Ubiquiti U6 Enterprise WiFi AP in there too -- that runs pretty hot. So a sensible first step is probably to move the AP elsewhere, and if that's not enough, perhaps to add a USB fan to bring cooler air in through the back of the sideboard. Now, how about , the switch downstairs by the patch panel? It's also in a cupboard with no airflow, and while it's not sharing it with a router, there is a PoE injector and another WiFi AP, , in there (albeit a cooler-running one, a Ubiquiti U7 Lite ). Not too bad at all! Plenty of headroom there. Finally, let's go back upstairs to my study. If you remember, I have there, a MikroTik CRS305-1G-4S+IN -- a four-port SFP+ switch. I get just data for the switch itself and for the 10GBASE-T module -- the DACs don't report numbers. Check this out -- the right hand chart especially: Yikes! The switch itself is OK at a comfortable 48C, but that SFP+ module is hovering around 93C. That's internal rather than the "touch" temperature, but assuming they're close, it's definitely getting towards blistering temperatures if you touch it. I'm getting a stick-on mini-heatsink -- the type you can get for Raspberry Pis -- to see if that might help. It's also sitting on a 11" rack, so I might see if I can find a way to thermally couple it to that. But despite those somewhat concerning numbers, it's all working fine! I have a periodic network test running on , checking end-to-end out to Google's 8.8.8.8 nameservers, and I haven't seen a glitch. tests from to show negligible numbers of errors. It's a working system, so naturally I want to change things. What? TBH, I think I'll be able to limit my desire to tinker in the short term to just sorting those worrying thermal numbers. For and in the sideboard, I think that moving the WiFi AP out again will help. It's power-over-Ethernet, so I can just run one wire up the wall and hide the AP itself behind some art. For the almost-boiling-point SFP+ module on , the study switch, a stick-on Raspberry Pi heatsink is, as I said, probably a good starting point. If that isn't enough, perhaps one with a cooling fan. The actual amount of power being used there isn't much, just 3W or so -- it's only reaching such a high temperature because it's in such a small space. The more interesting question is, what will I do if and when it's time to take the next step up, to 40Gb/s or higher? As I said in my last post , 10GBASE-T is essentially the end of the RJ45, twisted pair world we've been in for the last 20+ years. CAT-8 cabling can, apparently, run up to 40Gb/s, but it comes with its own problems -- it's super-stiff, and hard to run around tight corners or to get into the limited space in the boxes behind wall sockets. I think that the right thing to do would probably be to switch to optical fibre. I did some initial research around this while I was still unsure if the existing cabling would work, and it seems like replacing each cable drop (that is, run from a wall socket to the patch panel) with at least a dual-fibre cable, one to send and one to receive, would work fine, potentially even up to 800Gb/s with the right setup. The wall sockets could be LC duplex, which are designed to be easy to connect (by fibre standards). If I wanted to really future-proof things, it might even make sense to run four-fibre or even eight-fibre cables, and leave all but two of each "dark". That would potentially leave even more space for improvement, and would actually cost very little extra -- the installation cost would be way higher than the cost of the cable. Still, at hundreds of Euros per cable drop, plus project overheads, I'm glad I don't have to do that now. A good decision to be able to punt down the line; who knows what will change between now and whenever my ISP starts offering even faster speeds? So let's wrap this up with the moment you've undoubtedly been waiting for... Not bad! Not quite the 10Gb/s advertised, but it's close -- and I've seen it get up to 9Gb/s from time to time (but unfortunately not screenshotted it). And to be clear, that was from -- so the speed was through all three of the switches, , and , and through the router. Direct tests from from the CLI version of the Ookla app 5 get similar results -- in fact, oddly, they tend to be about 5% slower than the ones from . Not sure what to make of that. I'll have to investigate further, but if anyone has any ideas about what might cause it, I'd love to hear them. So now, when I'm uploading models to Hugging Face and downloading others, syncing large environments, downloading the latest Arch ISO, and streaming music, while at the same time Sara is watching Netflix and my Dropbox is Dropboxing, everything can run smoothly. Nice! Mission accomplished. I hope this was an interesting read, and perhaps helpful for other people who are considering a similar upgrade. Now, time for me to go back to your regularly-scheduled all-AI, all-the-time content ;-) My OpenClaw instance, which runs there, has dubbed it "the Island of Misfit Computers".  ↩ I moved from a simple network to a multi-VLAN one at the same time as this upgrade, so managed switches have become useful -- if you're just doing an upgrade to 10Gb, you can do it all with unmanaged ones.  ↩ In case you're wondering about the naming strategy for machines on the network: What can I say. It passes the time.  ↩ It's largely old routers that populate the Proxmox cluster.  ↩ Their own one , not the more commonly-used OSS Python one , which isn't fast enough to handle speeds over about 5Gb/s.  ↩ The ISP connection came into the apartment in the living room. It went through a router/firewall machine I'd set up myself (more on that later), then via a 2.5Gb/s switch to the main WiFi AP and also to a wall socket. Down at the patch panel, I had a 2.5Gb/s switch, which was connected to the patch socket corresponding to the router's wall socket. Another connection from that switch went to the patch socket corresponding to the wall socket in my study. In the study, I had another 2.5Gb/s switch that handled internal networking. ISP box to WAN on the router. LAN on to wall socket. Patch panel socket corresponding to that wall socket to port 0 on the downstairs RJ45-only switch, . port 1 to the patch panel corresponding to my study's wall socket. (Other ports to other things I'm disregarding for simplicity.) Wall socket in the study to the RJ45 SFP+ module in port 0 on . port 1: DAC to an SFP+ network card on , my workstation. port 2: DAC to the SFP+ 10Gb/s uplink on the old TRENDnet 2.5Gb/s switch to handle the Proxmox cluster. ISP box to WAN on the router. LAN on to the new switch ( ) port 0. Port 1 on to the wall socket (thence down to the patch panel). Port 2 on to the WiFi AP via a PoE injector. My OpenClaw instance, which runs there, has dubbed it "the Island of Misfit Computers".  ↩ I moved from a simple network to a multi-VLAN one at the same time as this upgrade, so managed switches have become useful -- if you're just doing an upgrade to 10Gb, you can do it all with unmanaged ones.  ↩ In case you're wondering about the naming strategy for machines on the network: PCs, desktops, etc: name starts with P , for example or . Laptops: name starts with L . Basically just . Sara named her own work laptop, unrestricted by my convention, so it's called . Routers: name starts with R : , . Network infrastructure: name starts with N : , and . WiFi APs: name starts with W , eg. and . VMs on Proxmox: name starts with V : , , , etc. I also have a bare metal server on Hetzner, which I've named . It's largely old routers that populate the Proxmox cluster.  ↩ Their own one , not the more commonly-used OSS Python one , which isn't fast enough to handle speeds over about 5Gb/s.  ↩

1 views
Giles's blog 2 weeks ago

10Gb Ethernet: what I had to (re)learn

My ISP recently started offering a 10Gb option, and my "shiny new thing!" Pavlovian response immediately kicked in. So of course, I had to upgrade the wired networking in my home -- which meant I had to learn a few things to get it all working, and relearn a bunch of stuff I'd forgotten over the years. Wired networking for home and small offices hasn't really moved forward that much in the last 20-odd years. Back in 2006, gigabit Ethernet was standard for businesses, and most home users moved to it not long after. Perhaps due to the rise of WiFi for most "last few metres" connections, it's pretty much stagnated there, perhaps with a bit of a push towards 2.5Gb/s more recently. But with faster ISP connections arriving, I think things are starting to become a bit more interesting. Even the fastest WiFi 7 connections are only able to get up to around 6Gb/s to a single device -- and that's in an ideal "super-fast machine sitting right next to the AP in a shielded lab" setup. Here's what I had to drag up from my memory, and the new stuff I had to learn, in order to get this all working. I'll write about the background in this post, and then tomorrow I'll post about what I actually put in place. Let's start with a bit of the backstory. Bear with me, it's not just self-indulgent reminiscing! When I first started using networked computers, back in the early 90s, the most popular standard was 10BASE2 . We had this in the first office that I worked in, and in the university computer labs. In the back of your computer, you'd have a T-shaped connector like this: © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons ) The end facing the camera in that photo was the bit that went into your computer. Computers were daisy-chained together; you might have a server connected to workstation one, workstation one to workstation two, and so on, until you reached the last workstation. You'd have to cap the unused end of the T connectors at each end of the chain with a special terminator. Essentially it was a single coaxial cable, so every computer saw every bit that was sent along the bus. In turn, that meant that everyone was sharing the same bandwidth, a meagre 10Mb/s. The cool thing about Ethernet (compared to older networking technologies) was that the computers shared it without any need for coordination -- if two of them started "speaking" at the same time, they'd notice, and stop. They would then start again after a random back-off, so one of them would randomly wait for less time than the other and start first. The other would notice that "the line was busy" and would wait again for another chance. Of course, this limited the number of computers you could have on one network, as past around 20 or so, they'd spend all of their time interrupting each other and never actually be able to send anything -- and anyway, sharing 10Mb/s across a large number of computers would be an issue. On top of that, there was a hard cap of 30 machines per network. You'd use more specialised networking equipment to link different networks together -- bridges, switches and routers. More about switches later. By the time we started setting up networking in a house that I shared with friends, in around 1996 or so 1 , the most popular option had changed: now people were using 10BASE-T. Still 10Mb/s, but using the RJ45 connectors and twisted-pair cables that we've come to know and love. All of the computers would have a single cable going to a hub, in a star topology. You might link multiple hubs together to build larger networks. However, these hubs were still little more than a convenient form factor to electrically link all of the wires together into a single bus. You still had the problem that every computer could see every bit on the bus, and the same bandwidth-sharing and limits with the number of computers that you could handle as a result. Over the years after that, things moved on. Switches had been relatively expensive things; they would be used to interlink hubs, or 10BASE2 networks. They would learn (from seeing the source MAC address on incoming packets) which machines were sending to each of their ports, and use that to know where to send packets that came in on other ports. If, say, a switch learned that addresses A, B, and C were on port 1, then if a packet for one of those machines came in on port 2, it would know it could just send it out on port 1 and not on the others. That helped to address the bandwidth-sharing and the problems with collisions. Prices for switches got lower and lower, and eventually -- I think sometime between 2005 and 2010 -- they became so cheap that there was little point in bothering with hubs -- you'd just connect every computer directly to a switch. That meant that any two computers on the same switch could talk to each other at the full network speed, as packets would just be switched from port to port 2 . The connections between switches were still a bottleneck, of course, but that was much less of a problem. At the same time, speeds increased, from 10Mb/s to 100Mb and then finally to 1Gb/s, which was standard for business machines by 2005 or so -- I remember that when we bought our first computers for Resolver Systems back then, that's what they came with by default. Home computers weren't far behind -- and that's where we've been ever since. 3 Back to that bottleneck between the switches. Even back in the days of 10Mb/s networks, if you were managing a larger network, you would want a faster network to interlink them -- so, for example, if two computers on the same switch both wanted to access some external resource, they wouldn't be competing for the same 10Mb/s uplink. Once you went past small office-sized networks, that kind of thing started becoming important. ISPs and datacenters, of course, had the same problem in spades. What you would need was an uplink on the switch that could run at a faster data rate. So even when 1Gb/s Ethernet was too expensive for the connections to the computers themselves, you might have a switch with a 1Gb/s uplink to connect it to the larger network, and a bunch of 100Mb/s ports for the local stuff. Additionally, for larger networks you would have another problem -- physical distance. All of these RJ45-based networking technologies had a maximum cable length of 100m. You could extend that by putting a repeater (or even just a switch) every 100m or so as a "signal booster" -- but if, for example, you wanted to link two buildings, that could be tricky. You'd need to run both the data cable and power, and you'd need to have some way of getting access to the repeaters if they went wrong. Ethernet over fibre optic connections had been a standard thing for years, though, and it had much better range -- for single-mode, many kilometers. So while it was too fiddly for LANs, it made great sense as a backbone technology. What that meant, though, was that in order to set up some particular network topology, you might wind up having to get a whole bunch of different switches. For short connections between two of them, you might use an RJ45 uplink connection, while for longer ones you might want fibre. More complex topologies might need some entirely different mix of ports. To make this worse, there were a bunch of different fibre optic standards -- multi-mode and single mode fibres, different connectors, and so on. Rather than manufacturing a large range of different kinds of switches with all of the combinations that people needed, manufacturers separated out the physical layer of the transport from the switching hardware. A switch, instead of having specific RJ45 or fibre connectors for its ports, would have Small Form-factor Pluggable (SFP) "cages", essentially a new kind of socket. These allow people to mix and match different kinds of transceiver modules, which would slot into the cage to provide an actual usable interface -- one for RJ45 for gigabit Ethernet, or one for the particular kind of fibre connection they were using -- whatever configuration worked best for them. A typical switch for a larger network might have one or two of those for backbone connections, and then RJ45s for local connections. Over time, gigabit backbones were no longer enough, and SFP was followed by SFP+, which could handle 10Gb/s. Since then, there have been extensions for even faster speeds, way up to hundreds of Gb/s. Back in the day, this stuff was only important to network admins for medium-sized networks and larger, of course. But now, 10Gb Ethernet means that we've now hit the point where it matters even for home users, and that's because of thermals. Here's the problem. Somewhat loosely speaking, the faster a network connection on a particular kind of wiring, the hotter it runs. Over an RJ45/twisted pair connection, 10Mb/s Ethernet basically shed no heat, 100Mb/s a little more, even gigabit Ethernet just left your switches somewhat warm. The jump up to 10Gb over RJ45, called 10GBASE-T, makes things decidedly toasty -- you'll see just how toasty in tomorrow's post. There's also the issue of cabling. Because network speeds have been stable for some time -- Gigabit Ethernet being the standard for ~20 years -- most buildings with structured cabling (the kind of thing where there are RJ45 sockets in the walls wired together) will have the standard for that -- CAT-5E. Unfortunately 10Gb/s Ethernet won't officially work over it -- you might be lucky, especially with short cables, but in general it won't work, or if it does it won't be reliable. CAT-6 cabling helps -- it can handle 10Gb/s over runs up to about 55 metres. And the ideal is CAT-6A, which can handle 10Gb/s over the same 100 metre cable lengths that you'd expect for the older, slower setups. What this meant was that an interim standard was created. 10GBASE-T is hot and needs cables that people don't necessarily have, especially when you're talking about what's installed in the walls of their building. But if you run it a bit slower, you can do so over older cables and without melting them. That's why I didn't mention 2.5Gb/s Ethernet earlier (or indeed the rarer 5Gb/s). They were introduced as slowed-down versions of 10Gb/s to get it to work on existing infrastructure without major upgrades. And that's great, right up until the point your ISP emails you to say that they're offering 10Gb/s to your home now... So, what can you do to run 10Gb/s without melting things? Let's think about what an SFP or SFP+ module actually is. It slots into a cage on a switch. On one side, there's an electrical connection to the switch hardware, which is carrying the signal -- incoming and outgoing -- using a particular protocol 4 . The module does its magic, and on the other side we have -- say -- 10GBASE-T to an RJ45 socket, or a blinking laser with an appropriate interface for optical fibre. What would happen if you just had a dumb electrical cable to connect an SFP+ cage on one switch to another on another switch? That actually works pretty well! It's called a passive Direct Attach Copper (DAC) cable. The interfacing is a little more complicated than just a completely dumb wire -- the switch will want to query the module in the cage to find out some details about it, so you need a tiny bit of electronics -- but it's still really simple. On top of that, if you add a bit of amplification to the DAC, then you get an active DAC, which can double that kind of length (though these are relatively rare). The neat thing about DACs is that they run much cooler than 10GBASE-T, using about a third of the power. Of course, they lose out in terms of range. But for simple stuff within one room, and especially between switches in a rack, they work really well. The next step on top of DACs is that you can convert the underlying SFP(+) protocol directly to light, and send it down an optical fibre -- normally called an Active Optical Cable, or an AOC for short (though I've seen the rather confusing terminology "optical DAC" in various places). With that, you can normally get up to 100m. These are cheap and easy to use (because they're all-in-one units, so you don't have any fiddly alignment of the fibre to do), so they're the best option once you pass passive-DAC distances. After that, though, you really need to switch to the official standards, and go to more traditional fibre-optic setups. I've done much less research into those, so won't try to explain them. Either way, for the home, anything above this level is probably overkill right now... So: moving from the 2.5Gb/s networks that work smoothly with the same infrastructure we've been using for the last 20 years or so to 10Gb/s is a tricky step change. Suddenly, things that didn't matter -- thermal management, cable lengths, and so on -- become important. And there are solutions, but you need to start actually understanding things again rather than just plugging stuff in and assuming it will work. Fun! Time to put it into practice :-) In my next post, I'll show exactly the changes I had to make to get my existing 2.5Gb/s network ported over to 10Gb/s -- the hardware I wound up buying, how well it works, and (importantly) how hot it all runs. To share our blazingly fast bonded dual ISDN Internet connection -- 128Kb/s.  ↩ I remember feeling a little sad when that happened, because it meant that what I felt was coolest about Ethernet -- the back-off-and-retry thing -- was no longer all that important. And when the connections went full duplex (a single switch port could both send and receive at the same time over the same cable) it was finished.  ↩ If you're thinking "what about 2.5Gb/s?", I'll come back to that -- it's an interesting case.  ↩ SFF-8472 for SFP, then there's SFF-8431 and SFF-8432 for SFP+.  ↩ To share our blazingly fast bonded dual ISDN Internet connection -- 128Kb/s.  ↩ I remember feeling a little sad when that happened, because it meant that what I felt was coolest about Ethernet -- the back-off-and-retry thing -- was no longer all that important. And when the connections went full duplex (a single switch port could both send and receive at the same time over the same cable) it was finished.  ↩ If you're thinking "what about 2.5Gb/s?", I'll come back to that -- it's an interesting case.  ↩ SFF-8472 for SFP, then there's SFF-8431 and SFF-8432 for SFP+.  ↩

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendices

After finishing the main body of " Build a Large Language Model (from Scratch) ", I set myself three follow-on goals . The first was training a full GPT-2-small-style base model myself. That was reasonably easy to do but unlocked a bunch of irresistible side quests ; having finally got to the end of those, it's time to move on to the others: reading through the book's appendices, and building my own GPT-2 style model in JAX. This post is about the appendices. The TL;DR: there was stuff in there that could have saved me time in my side-questing, but I think that having to work those things out from scratch probably helped me learn them better. This is an excellent overview of PyTorch, and given that I'm writing for people who are reading the book too, all I can really say is that it's well worth reading, even if you have some experience in it. He gives an intro to what it is, some details on how to choose to use GPUs (or Apple Silicon) if you have them, and an overview of tensors. He then goes on to explain the basics of automated differentiation and back-propagation, with a bit of background detail about the chain rule. I think this bit is useful at a "how-to" level, but the mathematical details felt like they were summarised too briefly to be all that useful. I can see why -- this is an appendix to a book on an adjacent subject, not a textbook on the mathematics of training ML models. But something this brief feels like it would be confusing for people who don't know it already, but not really useful for those that do. Perhaps I'm underestimating the typical reader, but if and when I write up my own explanation of how this works (perhaps as a follow-up to " The maths you need to start understanding LLMs "), I'll go quite a lot slower and try to explain things in more detail. Anyway, as I said, the explanation is more of a bonus in this book, quite far from its main focus, so this is a nit. He then goes on to a high-level explanation of PyTorch's s and s. This was quite useful for me. I must admit that I've been struggling a bit to see the value of DataLoaders -- indexing directly into Datasets has worked very nicely for me. I suspect this is a question of scale more than anything; even my big training runs, 44 hours of training a 163M-parameter model on 3 billion tokens, worked fine without a DataLoader. But after reading this section, I felt I was getting some way towards having more of a handle on how they might help. I'm not quite there yet, but hopefully soon... Next, there are sections on training loops, both with and without GPU support. Nothing new there for me, at least. Then came the real surprise: a really solid walkthrough on training models across multiple GPUs with DistributedDataParallel! That's something I learned from the documentation and various online tutorials back in January , and reading this appendix first would have saved some time. But thinking back on it, I think that the way I did it was better pedagogically for me. By having to grind through it from first principles -- following the docs, coding something, seeing it break, trying again, and eventually getting there -- I think I internalised the knowledge much better. It's a balance, really. If I read explanations, I learn faster, but the knowledge is shallower. Learning by doing is slower but deeper. Working out a good balance is hard. It feels like I've struck a good balance on this one, but I suppose it's difficult to know for sure. The one thing in the DDP section that did stand out for me, though, was the use of a for the . That might have made some of my DDP code a bit simpler! On to the next appendix. I won't go through this in detail; it does what it says on the tin, and there's a bunch of interesting stuff in there. I scanned through and nothing felt like a must-read right now, but I'll be checking it in the future if I'm looking for suggestions for things to read about. Another one that is exactly what it says it is. Once again, something I could have saved time by reading first! In it, he covers gradient clipping, which I went over back in February , and warming up and then doing a cosine decay on the learning rate, which was something I looked into in March . Just like with DDP, I think that having to learn about these from resources I could find on the Internet meant that I got to a deeper understanding than I would have if I'd just been following the book. This is not a point against the book, of course! Again, it's one of those balancing acts: do it yourself and learn more, or read about it and learn faster. Still well worth reading though. This was a really interesting read. I've been reading about LoRA on the side, but most treatments I've seen started with an explanation of the maths, but then essentially said "now, to do it, install PEFT" (or Unsloth, or something similar). Raschka gives the full code, showing how you can write your own LoRA stuff, and I think this is excellent. Digging into it right now would be a side quest, but I'm inspired by it and might do my own LoRA writeup after finishing this LLM from scratch arc. Let's see if I manage that or if I get distracted by something shiny first... The last page in the book. Well, the first page of the index. Done. Wow! But before I start the celebrations, there's one last step. As I said last November , I wanted to: [Build] my own LLM from scratch in a different framework, without using the book. That is, I think, essential, and perhaps would be the crowning post of this series. It would be a nice way to end it, wouldn't it? I think I was right, so that's what's next. I asked people on Twitter which framework I should use, and the winner was JAX -- and so that's what's coming next. Watch this space!

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32m -- Interventions: conclusion

Last November, when I finished the main body of " Build a Large Language Model (from Scratch) ", I set myself a number of follow-on goals . One was "training the full GPT-2 base model myself". I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went. In December, I trained my first model , taking two days, but was disappointed to see that it was worse in terms of loss, and in terms of how well it could be fine-tuned to follow instructions, than the original GPT-2 model. I expected that a chunk of that difference was likely to be due to the original model having been trained for longer, but also noticed that there were a number of changes -- interventions -- that I could make to the model and the training run, and I thought they might help. In January, I got a DDP training system together that would allow me to iterate on those interventions without having to wait for two days for each result. In February, I got started by training a baseline model in the cloud , and I've since ground through all of the interventions, and come up with a set that lowered the loss nicely, both in the cloud , and locally . Along the way, I've learned about, or refined my knowledge of, a bunch of ML concepts. In increasing order of how they helped with the loss (with the first two actually making it slightly worse): I've also learned how to upload my custom models to Hugging Face , found out some interesting things about how random noise affects training , and come up with improvements in the setup I have for using an LLM as a judge for instruction fine-tuned models . There was a bit of a mystery when I tried out the instruction fine-tuning tests, though. Although two of my models were very close to GPT-2 small in terms of loss, I found that while one of them had an instruction fine-tuning result that was likewise close to GPT-2 small, the other was much worse! A mystery to dig into later, I think. But it was still very satisfying that my best model -- trained locally in 44 hours -- was almost as good as GPT-2 small, even if it did fall somewhat short. So on that positive note, I'm going to wrap up this "Interventions" series-within-a-series, and move on to the two other things I wanted to do before wrapping up the "LLM from scratch" series as a whole: The appendices first, I think -- I'll post about them shortly. But I think the big one will be the JAX implementation -- really looking forward to that. Weight tying , which I found made the loss worse, but it was interesting how simple it was to implement. PyTorch's Automated Mixed Precision , which also harmed the loss a tiny bit, but had the benefit of making training twice as fast, and 66% cheaper in the cloud -- well worth the loss penalty. Gradient clipping -- a cheap, but (somewhat to my surprise) not particularly effective intervention for this model. QKV bias -- that is, adding bias to the attention weight matrices -- which also helped a tiny bit, though I later felt that this might have been in the noise. Weight decay -- more effective, and something that's simple enough to understand with simple gradient descent. I still need to learn more about it in the context of optimisers, though -- particularly with AdamW. Dropout , which seems to be less than useful for single-epoch training: removing it helped the model quite a lot. The learning rate , which I built up quite a lot of new knowledge about, and by both increasing it and scheduling it, I got the biggest bang for the buck. Going through the appendices in the book to see if there's anything I want to highlight there. The final test as to whether I've really understood everything: building my own LLM from scratch without reference to the book. I want to do that in a different framework, not PyTorch, to minimise the risk of just regurgitating code -- I asked people on X/Twitter which one I should use, and the winner was JAX -- so it should be interesting to see how that goes!

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post , I managed to train one that was almost (if not quite) there. Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them. The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how useful it might be in reality. Unfortunately, in January, I realised that my methodology was bad ; because I was asking the LLM to score a model in isolation, the LLM's natural randomness would mean that results were not really comparable, at least for models that were reasonably close in quality. For example, if two models both replied to ...then one run of the instruction-following test might "find the judge LLM in a good mood" and get, say, 5% -- after all, the model tried to answer, and actually used a real person's name, even if the answer was totally wrong. But in another run, the judge might be in a "worse mood" and score it at 0%. My fix was to have two scripts: The details are here . Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series. I felt it would make more sense to wait until I'd tried a bunch of interventions and got a number of models to try. Now I have those, so let's give it a go! At the end of the previous round of IFT tests, I had this table. It's sorted by the loss on the test set (shown to 3 decimal places), and has the score that the model got from an instruction fine-tuning run: There's a loose correlation where lower loss means a higher IFT score, with two weird exceptions: the two FineWeb-Edu training runs, where they got much higher results than you'd expect from the loss. My working hypothesis was that there were two components that led to a model getting a good score: So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones might be dumb but knowledgeable. The ones in between, by contrast, could be relatively dumb too, but also not know very much. There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB model seemed surprisingly good on the IFT results when considering its loss -- but perhaps there was some kind of step function, where as soon as a model got better than (say) 3.7 on the loss, it suddenly became smart in whatever way mattered. All very hand-wavy, of course, but it was a hypothesis of sorts. Would the new models fit that pattern? It was time to find out. I didn't think it was worth adding all 14 models that I've trained in my intervention-testing to that table, so I decided to just add four of them: Now, I already had files containing responses from fine-tuned versions of the other models, so I just needed to run the first of my two fine-tuning scripts against all four of the new models. I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it used GPT-5.4. If you run the script multiple times, each time will normally give you different scores anyway; hopefully the ranking will remain roughly the same. So given that I was going to have to re-run the script to get new aggregate results, and those would not really be comparable to the original ones anyway, this seemed like a reasonable price to pay for (hopefully) a smarter judge. I ran that once, and got some results that surprised me -- so much that I decided to do three runs and see if the results stood up. They did; here's the new table, with scores for each run, the average, and the rank that each one got based on the average. You can see that relative rankings are fairly consistent across the IFT runs. But while in general the lower-loss runs get better IFT results, now there are even more exceptions to that trend than there were before. Let's look down the "IFT rank" column, which is based on the IFT average: That's a really odd situation. If the training runs using gradient accumulation rather than DDP had been consistently worse -- or vice versa -- then we could imagine some kind of connection. But in the first case, GA beat DDP, but in the second, it was the other way around. Apart from that, we do still see that the two FineWeb-Edu models are doing much better than the others. And the remaining models are all pretty close together, both in terms of loss and in terms of their ranking, apart from the Local FineWeb train, which is bad in both. It is, however, interesting that Local FineWeb-Edu extended train, which was trained on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of the IFT numbers, though. That wasn't the case in my tests previously. All of this puzzled me. The "lots of knowledge makes a model better at this" idea seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all, if it was true, you'd expect the model trained on more data to be consistently better). And the "smart, low-loss models are better" side seemed to be contradicted by and 's bad results. What might be going on here? Looking at the training code, one thing stood out to me. The process was: In practice, the early-exit code always cut in pretty quickly. I'd noticed that during my original generation of the results for the new models: I decided to regenerate responses for all of the models, and then run the new responses past the LLM judge again. But this time I would keep a record of how many epochs of training we got before the exit: It was getting even harder to see any useful pattern! One thing that did stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB model was being instruction-trained for seven epochs. It was also rather noticeable that the two FineWeb-Edu models had the same "advantage", if that's what it was. But the Local FineWeb train had seven epochs too, and got a poor score, the OpenAI models only got two each, and led the pack, and got a pretty poor result given its six epochs of training. Still, what would happen if we got rid of that confounder? I did yet another set of runs; this time, I changed the fine-tuning/generation script to always do four epochs -- no early exit. I chose four because it was the modal number in the previous trains -- no strong reason for it beyond that. Here's what came out at the end: Still no obvious pattern. What if we try seven epochs of training for all of them, so that they all get as much "benefit" (if that's what it is) as the FineWeb-Edu models? Just as confused as ever... Here's a table with all of the ranks we got from these tests: It's hard to draw much sense out of this, but a few things are clear: On the one hand, training different models for different numbers of epochs feels wrong for an evaluation like this, as they're being "treated differently". On the other hand, if it's meant to be a good evaluation of model usefulness in the real world, then individual models would be fine-tuned for different amounts of time, depending on validation loss. So perhaps it is better? But the differing results are still quite a puzzle. I figured that a modern AI could easily build me a data exploration interface, specifically for the original results and seven-epoch ones, so I asked Claude and got this rather nice one . After poring over that, though, I couldn't find a smoking gun -- for example, some kind of systematic error that was always making that pulled its score down. I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this. If we consider the loss landscape that these models are all in, they've all been trained to try to get to a place with as low loss as we could manage. When we do the instruction fine-tune on them, we're changing the landscape -- the objective of "be better at following instructions" is different to "be better at minimising loss". Now, those two landscapes could be completely different! You can imagine a task that we might set instead of instruction-following that could be completely uncorrelated with loss minimisation, or even inversely correlated. But instruction-following is relatively close; it at least shares features like "generate coherent text". So when we do the instruction fine-tuning, what we're trying to do is to move from the place where the model ended up after its pre-training, to a place where performance on the new goal -- instruction-following -- is best. Here's where I'm going to get more than a bit hand-wavy. You can easily imagine that some places where the loss was low, there might be downhill slopes pointing towards good locations in the new instruction-following landscape. With instruction fine-tuning, you'd be able to get a good IFT model. But other places with low loss might not have that advantage; maybe they're at or near a poor "local minimum" in the IFT landscape -- that is, a place where there is no downhill route to a better place. So simple fine-tuning like this might never get a good result! With this mindset, we might say that the OpenAI weights are pretty well-positioned, not just in the loss landscape but also in the IFT landscape. The FineWeb-Edu models happened to get lucky, and wind up in a place that (despite having poor loss), is well-positioned for the IFT objective. And by contrast, and were just unlucky: they got to a place where the loss landscape was not well-correlated with the IFT landscape. This seems plausible enough for me to use it as my working model for now, and see if I can work out some way to test it. Keeping track of the validation loss during the instruction fine-tuning process would certainly be a good start; unfortunately I only realised that after doing all of the tests above, and re-doing them would be quite a lot of work. One final thing is worth repeating. Our two "unlucky" models, and , each had a twin. The former was the DDP-trained counterpart of the gradient-accumulated , while the latter was the gradient-accumulated counterpart of . So while something odd clearly happened, it doesn't look like DDP or gradient accumulation by themselves are the culprit. I think that at this point, it's best for me to draw a line under this -- I have a bunch of other things I'd like to get to, and this is a bit of a side quest at this point. Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal. In some cases, it seems likely that lower-loss models can be worse for actual use. Coming up next: I'm going to wrap up this "interventions" mini-series, and move on to the final steps in my LLM from scratch journey. See you then! One that fine-tuned the model then got it to generate responses, then saved those responses in a file. One that took a bunch of files generated by the above, one for each of a set of different models, and presented them to the LLM together, so that it would (hopefully) be consistent in how it rated them relative to each other. Its raw intelligence: lower-loss models were smarter, so they were better at instruction-following after the fine-tune. Its knowledge. All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on what amounted to minimally-curated data from the Internet. But FineWeb-Edu is meant to be "the most educational" subset of FineWeb, so it presumably is more dense in useful facts. , the baseline cloud-trained model for all of the interventions . , the locally-trained version of the same -- the first model from this post . , the best model we managed to get in the cloud . , the best local model -- the second from this post . The first surprise is . It has the fourth-best loss, but it's the worst model out of all of them on the instruction fine-tuning test! It was trained on exactly the same data as all of the others apart from the OpenAI ones and the FineWeb-Edu ones. Even more perplexingly, it was as close a match to as I could make it, but got completely different results. You might remember from the post that those two runs started with the same weights and had exactly the same training config; the only difference was that they were trained on different architectures, and one used DDP with a real global batch size of 96, while the other used gradient accumulation to get the same batch size. also does much worse than you'd expect from its loss numbers; it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but much worse on the IFT test. Again, this one is essentially a clone of another: , which was the same training run but using DDP rather than gradient accumulation. The same problem -- one of a pair of closely-matched models has worse results on the IFT test. But in this case, it's the gradient accumulation model that turned out bad. Fine-tune the model for a maximum of 100 epochs over the training set. If loss on a held-back validation set went above the result for the previous epoch, we did an early exit and used the previous epoch's model for the generation of the responses. took 6 epochs until validation loss started rising. Performance on this test is correlated with loss, but it's far from the only factor. The OpenAI weights consistently lead the pack. Of our own models, , Cloud FineWeb, 8x A100 40 GiB, and Local FineWeb-Edu train do pretty well. Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently worse than the model it was based on. and are consistently bad. Cloud FineWeb, 8x A100 80 GiB is also not great.

1 views
Giles's blog 3 weeks ago

How an LLM becomes more coherent as we train it

I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run . What might that look like for a (relatively) modern transformers-based LLM? I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face FineWeb dataset, and over the course of that training run, I saved the current model periodically -- 57 checkpoints over two days. Here's what it looked like -- the start, the end, and some interesting waypoints in between. For each checkpoint, I asked it to generate a completion to the words "Every effort moves you". 1 When the model was first created, before any training had been done, it came up with this: If you've read the Karpathy essay, you'll see one important difference -- it's already got words in there. His RNNs were generating complete noise at this stage. Even by the 100th iteration, he gives an example like this: That's an important difference between the RNNs he was talking about, which were character-based and had to learn about words and the like, and LLMs like this one, where the text is input and then output one token at a time. ( More info here ). Still, even though it has what looks like words, it's essentially content-free token salad with no structure or coherence 2 . Let's see what happens if we train it more. In my training loop, it sees 96 sequences of 1,024 tokens, and then we update it based on its loss (an index of how wrong it was at predicting next tokens), so that's 98,304 tokens for each step. After 617 of these 3 it seems to have mostly learned something about which tokens are most common: By the next checkpoint at step 1234, we've got something that's starting to come together. It doesn't make sense, but there's some kind of glimmering of meaning: And just a little while later, at the checkpoint at step 2468, we have something that actually makes some kind of sense (at least at the start)! Now, the training data I'm using was scraped from the Internet, and unsurprisingly there's a lot of somewhat cheesy business content there. By step 9255, we're starting to get a lot of stuff like this: ...or even more cheesy self-help stuff (step 10489): To be fair, the starting point of "Every effort moves you" is probably biasing things a bit there. But let's be clear: by this point it's seen 1,031,110,656 tokens -- that is, it's about one third trained. And it's coming up with pretty coherent text! The rest of the training run is more about refining things -- the loss chart for this training run looks like this: Loosely speaking, the lower the loss number, the better the model is, so you can see that the bulk of the improvement had happened by this point. From here on, I'll just give a few of the more interesting samples: By step 14191, it's started using bullet points... Step 24680 -- more motivational stuff: Step 25297 -- small models like this do like repeating themselves. You might remember seeing ChatGPT output back in 2023 or so that had tics like this: And again at step 26531 At step 27765 it decides that it has had enough after generating just a couple of words and tries to start a new document: But step 28382 is actually rather good. I particularly like the "however": And finally, the training run finishes at step 33164 with these wise words of caution: Well worth remembering, I'm sure we can all agree. I wonder what deep wisdom we'd have gained if I had asked it to generate more than 20 new tokens... What I found most surprising when I first started playing with this is how fast even simple LLMs got to a stage where they could generate plausible text. Just one third of the way through the training run, this model was making some kind of sense. The problem, of course, is that we don't just want generators of plausible content -- we want that content to make sense and be correct. And that's why it's worth grinding through the other two thirds -- in the hope that when you ask it to complete "The capital of France is", it will reply with "Paris" rather than a coherent but wrong answer like "Rouen". Technical details: 20 GPT-2 tokens generated on top of the initial text, with a temperature of 1. I've added line breaks to make it easier to read the samples.  ↩ Well, it mentions " despicable capitalists", but I suspect that's just randomness rather than some kind of primitive political consciousness. Including the space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser.  ↩ So, 60,653,568 tokens seen.  ↩ Technical details: 20 GPT-2 tokens generated on top of the initial text, with a temperature of 1. I've added line breaks to make it easier to read the samples.  ↩ Well, it mentions " despicable capitalists", but I suspect that's just randomness rather than some kind of primitive political consciousness. Including the space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser.  ↩ So, 60,653,568 tokens seen.  ↩

0 views
Giles's blog 4 weeks ago

Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". I've trained various versions of it in the cloud to work out which interventions to the model and training code had the best effects on the loss it gets on a specific test dataset, and now I wanted to do a training run locally to match the best of those. For that, I wanted to match the batch size I was using for the cloud training runs. When I first started learning this stuff, batching seemed like a performance thing -- with highly parallel systems like GPUs, it generally turned out that you could run a batch of (say) two inputs through a model in less than twice the time you could run one, so it made sense to batch them up. For inference, that is exactly the advantage you get, but when training, it's become increasingly clear to me that you can also get an improvement in the quality of the model from batching. The best intuitive model I have is that if you run inputs through one-by-one, adjusting parameters after each, then it's easy for the model to "overcorrect" each time. With batches, you get an average set of gradients across all of the items -- which smooths things out and stabilises the training. Of course, it's possible to overdo it. As an extreme example, imagine that you were somehow able to fit your whole training set into one batch -- then you could train by running that single batch through, doing a single backward pass, and then adjusting the parameters once. It's pretty clear that that would not work very well -- just one single update of the initially-random parameters. When training on my local machine, I could fit a batch of six sequences into my RTX 3090. I'd found that when I moved to cloud machines, it had a very positive effect on the loss I got out of the models when I tested them. From a quick-and-dirty bit of curve-fitting , I estimated that the optimal batch size for this model, with that training run, was somewhere around 97. Conveniently, that was close to the maximum I could fit onto an 8x A100 40 GiB/GPU machine, so I used a batch size of 96 to test the different interventions I was trying. And when I finally put all of the interventions that helped with training together , I found (somewhat to my surprise) that their combined effect -- an improvement in loss of 0.113765 -- was less than half of the loss improvement of 0.252474 that I had got from increasing the batch size. What that all made clear was that if I wanted to do a local training run that matched the quality of the cloud-trained model, I'd need to not only add on the interventions that I'd been testing in detail, but I'd need to match the cloud batch size. And for that, I needed to learn about gradient accumulation. Gradient accumulation is pretty much what it sounds like; instead of the normal technique of doing a forward pass, working out the loss, getting gradients with a backward pass, and then applying them by stepping the optimiser, you do multiple forward-backward phases, letting the gradients accumulate, and then do one optimiser step after that. When you do that, you're getting the training stabilisation benefits of a larger batch size, even though you're not getting the performance boost. Sounds simple enough, and it is, in theory, but implementation got a little more complicated. Let's work through it step-by-step. To start with, imagine you have a really simple training loop: Adding gradient accumulation to that is really simple! Let's assume that has a length divisible by , the number of steps we want to run through before we step the optimiser. As a first (not quite correct) cut, you could just do this: You can see that we're just stepping the optimiser every steps. An alternative way to do it would be with an inner loop: Which of those is better would depend on the details of the training loop -- in general, if you wanted the "other stuff" to be done once per training batch, then you'd want to use the first option, whereas if you wanted it to be done once per optimiser step, the second would be easier. As you'll see in a bit, I went for the second one for my code. However, there's one small correction that we need to do to make either of these properly. Remember that when you calculate loss across a batch -- for example, cross entropy loss like this: ...you're getting the average loss across the batch, so when you do the backward pass, you're getting the average gradients. By contrast, in the code above, we're doing a backward pass on the complete loss at each step, so the gradients that are being generated in each backward pass are being added to each other -- you wind up with the sum of all of them rather than the average. So the gradients that the optimizer applied would be times larger than they should be -- it would be as if we'd multiplied the learning rate by that number! But that's easy enough to fix. The average gradients over a number of steps are the sum divided by the number of steps, and we can do that division ahead of time just by scaling the loss down. Adding that into the first example above: And that's basically it; with those changes, the original basic training loop becomes one that uses gradient accumulation. The effective batch size is whatever the real batch size is, times the number of gradient accumulation steps. However, the real training loop that I'm using for these experiments is a bit more complicated than that simple example. There's checkpointing, AMP, and -- most importantly -- it can handle multi-GPU training using DistributedDataParallel. That made things a little bit more complicated. The first thing was to look into the way I was selecting the data to train on. My dataset was already in batches, but we had to split those batches up between GPUs. The solution in the code was to work out how many global steps there were -- each global step being one batch going through each GPU on the machine -- like this: , if you remember from the DDP post , is the number of processes running in a multi-GPU training run -- one per GPU. Next, in the training loop, I iterated over the global steps: ...for each one, getting the appropriate batch out for the specific GPU that was running the code: is a zero-indexed number, unique to each of the per-GPU processes. So this basically split into chunks of length , and then each GPU was fed the batch at its 's offset into the chunk. I wanted to keep things shaped such that when I was running with gradient accumulation locally, it would be similar to a cloud run with per-GPU batching. Specifically: when I was training in the cloud, I had eight GPUs with a per-GPU microbatch size of 12, giving a total batch size of 96. Locally, I could fit a batch size of six on my GPU, so I needed to do gradient accumulation over 96 / 6 = 16 steps. To keep things as similar as possible, I decided that I wanted the concept of a "global step" to match between the runs. In other words, it would expand slightly, from meaning "one batch per GPU" to being "one optimiser step per GPU". So, each time through that loop, we'd do multiple forward-backward passes, and then one optimiser step. That would mean that the best way to do things would be with something much more like the second of the two bits of sample code above -- the one with the inner loop rather than the modulus. Maybe that's easier to show in code: That required a change to the data lookup; I decided that would be split into chunks of size , and then each of those would be split into chunks of size , so the code to get the appropriate batch for a given run through the loop became this: That required a corresponding change in to make sure that was divisible by both the world size, the per-GPU batch ("microbatch") size, and the number of gradient accumulation steps, but that was easy: ...became this: That was enough to get the gradient accumulation happening! Next, I needed to change the backward pass code to scale down the loss so that we got averaged rather than summed gradients. Because we might be using AMP with a scaler, the code wasn't just a simple : ...but the change was obvious enough: All of those changes put together, plus a bit of shuffling around of code, were enough to get a correct gradient accumulation training loop! But there was one small tweak I needed to add. When you're using DDP, gradients need to be synchronised between the different per-GPU processes. As a reminder, what happens is: Now, with my first cut of the gradient accumulation code above, what would have happened is this: That would be correct, but not very efficient. We're sending out gradients and averaging on every accumulation step. But because each of our per-GPU processes is keeping its own "local" average (by accumulating the scaled-down gradients), we only really need to send those local averages out and get a global average once, just before we step the optimiser. If we do that, we can save quite a lot of work. The trick to avoid that was to use the method on the class that our own model is wrapped in. What we wanted to do was suppress the gradient synchronisation for each of the accumulation steps apart from the last one. It was easy to work out whether we were on the last gradient accumulation step: Now, what we needed to do was to wrap this: ...in , but only if was false. Conditional statements can be a little fiddly, but Python has a "do-nothing" context manager in -- that is, ...is identical to just: So we can combine that with the ternary operator like this: ...which does exactly what we want 1 . With that change, I had something I was happy with; you can see the diff here . So now it was time to do a training run! I'd originally been planning to jump right in and do a training run based on my last cloud run , with all of the interventions I'd decided were worth using, but locally with gradient accumulation. However, I decided that it would be interesting to try doing a new "baseline" train first. I'd done my local training runs, and then established a baseline version in the cloud by taking exactly the same configuration and doing the training run on an 8x A100 40 GiB with an overall batch size of 96. So I could repeat that locally with gradient accumulation, and that would show two things (or perhaps, the same thing but in different lights): That would help confirm my understanding that it was the increased batch size that helped in the cloud, and not, say, some architectural difference -- and would also act as a good test of the gradient accumulation code. Here's the training run config . I kicked it off: That looked like the right number of global steps; it matched the numbers I saw when training in the cloud. And 44 hours for the training run seemed correct: my original local runs took 48, but with them I was spending quite a lot of time on validation, which this code didn't do. Just less than two days later: That all looked good. The loss chart looked like this: For comparison, here's the one from the cloud training run with the same config (but using larger batches rather than gradient accumulation): You can see that they're similar, but not identical. That's pretty much what you'd expect! The two training runs were on different architectures -- RTX 3090 vs A100 -- and so there will probably be differences in the CUDA kernels, and also PyTorch's AMP (which uses 16-bit instead of 32-bit in cases where it makes sense) might make different decisions. I think that if we'd run it on a machine with one A100, then the results of using gradient accumulation would be even closer (perhaps even identical) to a larger batch size, especially if we were training without AMP. I uploaded the model to Hugging Face and it was time for the evals. The smoke test first: As usual, reasonably coherent. But the important one was the loss on the test set: That's solid! The cloud-trained baseline model got 3.691526, so this local one was actually very slightly better, by 0.007691. But that's very close indeed, which is what we wanted to see :-) It was time to see what effect adding on the interventions would have. As a reminder, here are the changes I made to the config for this run: It did not include QKV bias. Here's the config . I kicked it off, and: It looked like it was going to take 40 hours; that matched what happened in the cloud runs, as removing dropout speeds things up quite a lot. Just less than two days later: The loss chart over the training run looked like this: That's very smooth, with no loss spikes. For comparison, here's the chart when we did the same training run in the cloud; you can see that it was a bit choppier than the local one. The gradient norm chart was also interesting: If you compare it to the one from the cloud training run below, you can see that the local one was actually noisier -- the cloud run has a few gradient spikes near the start but calms down from around global step 6,000 or so, whereas the local one is spiky up to about 3,000, then calm, but has a massive spike at around 10,000. The learning rate we don't need to compare, but it was worth sanity checking to make sure we really did train the right way: So that all looked good. The training run did have some differences to the cloud one, but (as with the previous baseline train) it looked similar enough. Architectural differences between the A100s in the cloud and the local RTX 3090 seemed like a plausible cause. I uploaded the model to Hugging Face , and it was time to run the evals. The smoke test first: Reasonably coherent -- and I think that's the first time I've seen an token in a smoke test output! But the important one is, as ever, the loss, and: Let's add both this one and the local baseline to the results table for all interventions: That's really weird! The local run with the interventions, , is 0.039600 points better than the cloud version of the same training run, . That's nice, in that lower loss is always better, but it's also rather confusing -- that's a bigger loss improvement than some of the interventions. In theory, all that we changed between the cloud version of this training run, and the local one was the architecture. I was expecting that to have an effect, but thought that it would be small -- as, indeed, it was with the baseline trains and , where you can see the loss difference was just 0.007691 -- about five times smaller. Now, when I was looking into the effects of noise on training loss , I found that changing the random seed that was used to initialise the weights (but starting the training run itself at the same random seed) had a much bigger effect on the resulting model quality than keeping the weights identical but varying the seed at the start of the post-initialisation phase of the training run. The standard deviation of the varied-weights, same-train models was about double the SD of the same-weights, varied-train. That was interesting, though not directly comparable -- those tests were done with the same training run, but the architecture held constant -- a 8x A100 40 GiB machine for each test. However, it felt like it would be a good idea to at least see whether we started with the same weights locally and when training in the cloud. My suspicion was that we probably would; the weight initialisation uses deterministic non-GPU code, so with the same seed we'd expect the same weights regardless of the computer. The similarity of the loss results for the local and cloud baseline training runs also seemed to point in that direction. But it was worth testing. I created a throwaway branch of the training code, which -- after creating the model -- just dumped the model weights to a file, then exited. I ran it locally using the config, and then I fired up yet another 8x A100 40 GiB machine on Lambda, ran the same code there, this time with the config, and then ed down the weights. Identical. That was reassuring! I considered doing more analysis on this; for example, in my investigations into noise, I found that keeping the same weights but altering the random seed for the rest of the training run, I got results with a standard deviation of 0.008672 -- more than four times smaller than the difference between the local and cloud trains with the interventions. Might that be a number I could use for some kind of comparison? However, I decided that it's not really comparable. That number was from varying the random seed, but keeping the same architecture. There's not really any solid reason to believe that keeping the seed constant but changing the architecture would cause the same kind of differences. They might be more similar, they might be less. I think that all we can really say here is that the change of machine changed some aspects of the training dynamics in a way that happened to get us a lower loss. I can easily imagine that if I'd done something slightly different -- used a local RTX 4090, for example -- it could equally well have gone in the other direction. And at least it's reassuring that the improvement was smaller than the interventions I was most convinced by; the only smaller ones were full-fat float32, gradient clipping, and QKV bias -- ones that I'd already decided might have only been beneficial due to noise. Most importantly, it was orders of magnitude smaller than the 0.252474 improvement I originally saw when I moved from local training to larger-batch cloud training. So, I think that that brings me to the end of this set of training experiments. We started with a locally-trained model that got a loss of 3.943522 on our test set, compared to the original GPT-2 small model, which got 3.499677 2 . I've tried a bunch of interventions to try to get my model closer, and finally I've managed to get almost all of the way there, to 3.538161. That's really pleasing! I think that there are two things to do before I can fully wrap up this "interventions" mini-series, and get back to the main-line LLM from scratch stuff. Firstly, I should revisit the instruction fine-tuning tests, which I put on hold while doing these training runs. That would give us some indication as to whether the loss improvement was just a technical improvement that made a number go down, or whether it actually improved the usefulness of the model. Secondly, I think I really need to write a wrap-up. I've been working on this stuff on and off since December, and I think a summary of what I did would be quite nice! I'll post soon; don't touch that dial :-) Thanks to this Stack Overflow answer for that trick.  ↩ I'm going to switch to six decimal places from now on -- previously I was rounding it to three, hence 3.500.  ↩ Each process does a forward pass. Each process does a backward pass. When they have the gradients, they essentially share them so that each process has an average of the gradients from all of those backward passes. Then they all step their optimisers to apply the average gradients to each process's copy of the model. For each gradient accumulation step: Each process does a forward pass. Each process does a backward pass. The average is worked out They all step their optimisers based on the most recent average Whether the increased effective batch size had as positive an effect on the loss as the increased real batch size did when I did my cloud runs. Whether the locally-trained gradient accumulation model was similar to the cloud-trained big-batch model in terms of its loss. Gradient clipping at 3.5 Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Weight decay changed from 0.1 to 0.01 Dropout removed Thanks to this Stack Overflow answer for that trick.  ↩ I'm going to switch to six decimal places from now on -- previously I was rounding it to three, hence 3.500.  ↩

1 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of them would help? I found a list of solid-looking interventions, and in my last post I came to the conclusion that the improvements in loss I had seen with all of them -- with two possible exceptions -- seemed unlikely to be in the noise. What would happen if I tried to put them into a new model? Let's start by looking at the results that we have for the interventions so far -- this is the table I've been using as I go through them, but I've updated it to contain the loss figures for each model to six decimal places instead of three, and made each model name link to the associated post. I've also corrected the loss for the model, which was mistakenly using the training loss at the end of the run rather than the loss on the test set 1 . As I've mentioned before, simply moving to training in the cloud improved things markedly, getting loss down from 3.944 to 3.691526; I suspect this was due to having a closer-to-optimal batch size (more about that in my next post). What to do about the other interventions, though? It seemed clear that two of them were not helping: weight tying, and the one using the figure for weight decay that I'd (I suspect incorrectly) derived from a paper by Cerebras Research. The "no-AMP" run (which would be better described as "full-fat float32") had a small positive effect, but was so costly in terms of both time and money that it wasn't worthwhile. So we had five interventions to try: How would they stack up? It seemed pretty unlikely that their independent contributions would just sum up neatly so that we got a total improvement of 0.013209 + 0.022141 + 0.048586 + 0.050244 + 0.089609 = 0.223789 (though that would certainly be nice!). One question to consider was how independent they were. For any set of interventions, you can imagine them being independent and adding up nicely, or pulling in separate directions so that the combined effect is worse than the sum, or pulling in the same direction so that they amplify each other. My intuition was that gradient clipping and removing dropout were pretty independent, at least conceptually. They might affect other interventions indirectly (eg. via changing the training run's use of the random number generator) but they'd be unlikely to have a direct effect. QKV bias I was less sure about, but it seemed -- again, just intuitively -- at least reasonably independent of the others, with one important exception (which I'll get into below). By contrast, weight decay and the learning rate interact together quite strongly, at least in standard gradient descent, and I'd tested them in isolation. The result for changing the weight decay to 0.01 was based on a fixed learning rate of 0.0004, and the result for scheduling the learning rate was based on a weight decay of 0.1. That felt like an issue, and definitely needed some thought. Additionally, there were some issues with which interventions might have not had a real effect, and instead just been the results of the use of randomness. While my analysis of how that might have affected things was somewhat limited by the number of test runs I could afford to do, it did show up two plausible issues: After some thought, I came up with a plan. If I were doing this properly and scientifically, I suppose I'd try every combination of interventions, but that would be ruinously expensive 2 , so a sensible minimal set of training runs felt like this: When those completed, I'd find the test set loss for both models. I'd choose the best run, and then do another run with those settings, but with weight decay switched back to the original value of 0.1. I chose to revert weight decay rather than the learning rate stuff because this was the one I was least sure about -- the updated "GPT-2" value of 0.01 is very unusual by today's standards, and I'd come to it via a rather circuitous route -- see the post for more details. The best of the three runs would be the winning combination of interventions. Again, this was not an exhaustive plan 3 . But it seemed to make sense. Let's see how it turned out. Just to recap, this one had these interventions against the baseline: It did not have QKV bias. You can see the config here . Here's the loss chart over the course of the training run: As normal with learning rate scheduling, I also charted that to make sure it was doing the right thing (you can see that it was): And I also tracked the gradient norms -- you can see that there was some clipping happening near the start of the run: At the end of the run, it reported this: That's a slightly lower final train loss than normal, and it took 3h10m, which is faster than usual, but about the same as the other train we did without dropout -- that makes sense, as the process of zeroing out random activations isn't free. I downloaded the model -- here it is -- and then ran the smoke test: ...and got its loss on the test set: Not bad at all -- the best result we've had so far, albeit not quite up to the standard of the original GPT-2 weights. Now the next one, with QKV bias. This one had these interventions: You can see the config here . Here's the loss chart: ...the learning rate: ...the gradient norms (note that we had more clipping, about halfway through): ...and the final printout at the end. That final train loss is slightly higher, which is normally an indicator that the test loss will be higher, but we'll have to see. Time to download the model -- here it is -- and on to the smoke test: ...and then the moment of truth -- what was its loss on the test set? As I suspected from the training loss at the end, slightly worse than the run without QKV bias. So, that meant that we should do the next run, with a weight decay of 0.1, with no QKV bias. Given the above results, this one had these interventions vs the baseline: Weight decay was back to the baseline value of 0.1, rather than the value of 0.01 used in the previous two runs, and QKV bias was switched back off. You can see the config here . Here's the loss chart: You can see that it's much choppier than the previous two runs; that initially surprised me, as the higher weight decay means that we're regularising the model more than we were with those, which I thought would "calm things down". But on reflection, I had it backward. Hand-waving a bit, a more regularised model is fitting less closely every detail to the data it has seen, considering the typical stuff more than it does the outliers. That means that when something a bit more out-of-distribution appears, it might not have yet learned how to integrate it into its model of the world. Well, it sounds plausible, anyway :-) On to the learning rate (just to double-check), and it's fine: And again, the gradient norms: ...which similarly to the loss chart show more occasions where gradients spiked and had to be clipped -- even towards the end of the training run this time. The final printout at the end: Once again, although the final train loss is not definitive, it tends to be indicative of the test loss. It's in between the last two runs, so we'd expect the test loss to be likewise in between theirs: Time to download the model -- here it is -- and on to the smoke test: Hmm. At least vaguely coherent, though I'm not 100% convinced. It looks like ads for personal injury lawyers have crept into FineWeb somehow... Still, it's time for the test loss (drumroll): As predicted from the train loss, it's in between the two runs above. Let's put these three runs into the results table: As a reminder: You can see that adding on QKV bias actually made the model worse than the learning-rate-only intervention. That pushes me slightly away from the "it's all about the initial weights" direction; perhaps instead the bias adds some kind of stability that the learning rate scheduling also provides, and they fight against each other? Unfortunately I think the only way to pick it apart would be to do a full set of runs, switching each intervention on and off independently, and that would be too costly. The fact that the weight decay change from 0.1 to 0.01 actually did help when combined with the learning rate change and scheduling was a bit of a surprise; because they're both coupled when we think about standard gradient descent, I was expecting them to be too intertwined for my tests of them in isolation to have been valid. Quite pleased that it didn't work out that way, though, because sweeping across values for different parameters is much easier than it would be if they were connected. However, at this point it occurs to me that it might be because we're using the AdamW optimiser. As I understand it, its big difference versus Adam is that it decouples weight decay. I don't have a solid mental model of what that means exactly (will read up and post about it eventually), but it certainly seems pertinent here. Anyway, I have to say, I'm both pleased with and disappointed by these results. Pleased because we got a result by putting interventions together that was better than any of them in isolation, but disappointed that the end result wasn't even better. The difference between 's loss, at 3.691526, and original GPT-2 small's, at 3.5, was 0.191526. Our best result, for , was 3.577761, so an improvement of 0.113765. That's about 60% of the way there. That said, by sheer chance, while trying out the different sizes of cloud machines, I'd got from a loss of 3.944 training locally to the baseline's value of 3.691526 -- I suspect due to the fact that training in the cloud meant that I could use batch sizes of 96. So a different way of looking at it is that we should include that in the calculations too. From 3.944 to 3.5, the gap with GPT-2 small was 0.444. And we went from 3.944 to 3.577761, an improvement of 0.366239. And that means that we managed to get 82% of the improvement we needed. On the other hand, it means that in terms of my improvements, 0.252474 came from a happy accident, while all of my careful work on interventions only got me 0.113765. :-( Anyway, I think that for now, I'll have to rest happy with that as a result -- and next time around, let's see if we can get to the same level of improvement locally, using gradient accumulation. Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩ Gradient clipping. QKV bias (that is, adding bias to the attention weight matrices). Changing weight decay to the GPT-2 value (0.01 rather than the 0.1 that is typical nowadays). Removing dropout Updating the learning rate from 0.0004 to 0.0014, but also scheduling it so that it varies over the course of the training run. Adding gradient clipping looked like it might have been within the training run noise. Adding QKV bias would have had a large effect on the model's initial weights. All of the others would have started with essentially the same weights (apart from weight tying, though even that would have had the same values for the initial weights apart from the tied ones). But adding the bias would have completely changed them, and its effect size was comfortably within the range of differences you might expect from that. Start a training run with all of the interventions apart from QKV bias. In parallel (Lambda instance availability permitting) run another one, with all of the interventions including QKV bias. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. QKV bias switched on. Gradient clipping at 3.5 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, but no QKV bias was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, with QKV bias was gradient clipping at 3.5, dropout removed, and the learning rate intervention, but no QKV bias, and no change to weight decay . Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32h -- Interventions: full fat float32

This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Back when I did my first training run for a base model, on my local RTX 3090 , I used two optimisations: The first of those boosted training speed from 12,599 tokens per second to 15,402 in my test harness, while AMP on its own boosted it to 19,921 tps (and also allowed me to increase the batch size from 5 to 6). Doing both appeared to hit some kind of diminishing returns -- it maxed out at 19,997 tps, only a little better than AMP on its own. But intuitively, you'd expect that might come at a cost. While I'm sure the PyTorch developers have solid understanding of where switching to 16-bit will have a minimal impact on training quality, it seems too good to be true that it would have no impact at all. Let's see what happens if we switch both of these optimisations off! I added a new flag to the config file for the training harness, with a default of 1 . The core implementation was pretty simple; where we had the call to , we needed to guard it: ...and where we did the forward pass and the loss calculation, we had to not wrap it in a : We also had to avoid unscaling when clipping gradients ; I did that by just not creating a scaler when in non-AMP mode, and then: ...and likewise, instead of using the scaler to step the optimiser, we step it directly if we don't have one: However, there was an issue: non-finite gradients. As I discovered when looking into gradient clipping , the scaler was actually doing something quite useful for us. Somewhat buried in the AMP recipes page is a comment: Now, from the gradient clipping train, I'd come to the conclusion that we were occasionally getting non-finite gradients, and the scaler was saving us from applying junk updates when that happened. If our new code was stepping the optimiser directly, we'd not have that safety net. We'd need something to save us from that. My first cut at this was to use the one other API feature I'd seen that handled non-finite gradients for you: has a parameter, so if we were using gradient clipping, we could set that to and use the exception to skip stepping the optimiser if it was raised. To avoid actually doing any gradient clipping when that happened, if we did not have gradient clipping explicitly enabled, we could set the to infinity. Here's the code for that version . I wasn't very happy with it, though. The use of a gradient clipping API just for its side-effect of telling us about non-finite gradients felt a bit ugly, and even worse, the exception it raised was just a generic , not a custom exception type, which meant that I had to distinguish between it and other by looking at the exception message -- not terribly safe, as that's something that could easily change in the future. So I switched to a more explicit, simpler version: scan through the parameters looking for non-finite gradients, and skip the optimiser step if any are found: I did have some concerns about the performance impact of that; on my local machine it took about 0.13 seconds to scan all of the parameters like that for one step. However, it's better than failing to train the model at all due to garbage updates! So with that, it was time to do the training run. It was pretty clear that I would not be able to run this with my normal microbatch size of 12 on the 8x A100 40 GiB machines that I'd been using so far for these intervention tests -- AMP and the lower-precision matrix multiplications save a bit of VRAM, and I was already pretty much at the limit of what would fit in there. Changing the batch size would make this a poor test of the effects of removing the FP precision stuff in isolation, so I decided that the safest minimal change was to use a machine with more VRAM -- specifically an 8x A100 80 GiB, as that was the closest to what I was using (switching to eg. H100s would add all kinds of confounding changes). The next problem was getting any kind of machine at all! Lambda (they appear to have rebranded away from "Lambda Labs") very rarely seemed to have any available instances, never mind the specific type that I wanted. Eventually, I put together a system to poll their API and launch an instance when one was available. At 3:25am today 2 , I got a Telegram message from the script saying that it had managed to find and start one. I kicked off the training run, and watched as it got started. I could see it was using 43.8 GiB/GPU, so it definitely did need the larger instance type. And it quickly became clear that this was going to be a long one -- it was estimating 8 hours to do the complete run! In a way that was good news, though, as I could just set an alarm and go to bed. When I woke up, it was done: That's 8h7m. For comparison, the baseline train took 3h24m, so we're taking more than double the time. Cost-wise, things were even worse -- more than US$135 in server costs, because as well as needing the server for much longer, being a larger machine it cost US$16.48/hour rather than $11.84. So that's more than three times as expensive as the US$42 that a typical recent train has cost me (Lambda raised their prices, so it went up from about US$35 in February). Still, at least it looked like a solid run: Very similar to the others we've seen in this series. Time to upload it to Hugging Face Hub , and on to the evals to see if all of this extra cost was worthwhile. Firstly, the smoke test -- how did it complete ? Not bad at all! But the important metric is the loss on the test set, and for that I got 3.679. Let's add it to the table to see how that compares to the other training runs: So, a tiny improvement over our baseline. Taking more than twice as long on the training run, and spending three times as much, gained us a loss improvement that's smaller than any other successful intervention. The first question is, did removing AMP and lower-precision matrix multiplications lead to a better model? The answer appears to be "yes" -- but it's a tiny enough difference that it could well be in the noise. But the follow-up has to be, was it worth the extra cost in time and money? And for that I'm certain that the answer is "no". If we'd spent twice the time training with AMP -- on an extra 3B-odd tokens, or on a second epoch with the same 3B -- it seems implausible that the resulting loss would not have been better. And anyway, given that my goal with these interventions is to train the best model I can in two days locally (or 3h30m or so on an 8x A100 40 GiB), it's pretty clear that if we'd cut this run off about halfway through it would have been worse -- and that's not even accounting for it being more memory-hungry. So, I think the takeaway from this is that AMP appears to be a huge win, at least for this model. It has a tiny cost (if any) in model quality, and a huge benefit in training speed, plus a smallish but still useful benefit in training VRAM requirements. 3 And with that, I've reached the end of the interventions that I wanted to try ! Next, I'll need to think through what we need to do to try to stack them up. In particular, is there any easy way to work out whether any of the improvements I've seen might be due to random noise? After all, even though I've been carefully using explicit seeds, each intervention will have changed the way the training run uses the random number stream, and that could easily have an effect. Stay tuned! The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩ Setting the 32-bit floating point matrix multiplication precision to "high" rather than to "highest" , which means that it uses lower-precision (but still technically 32-bit) TF32 for those operations rather than normal float32. Using PyTorch's Automated Mixed Precision (AMP) , which allows it to use 16-bit calculations rather than 32-bit in places where it makes sense to do so. The name of the flag is not quite right, as of course we're switching off not just AMP but the matrix multiplication precision, but it's a decent shorthand.  ↩ I'm a night owl, so luckily I was still awake.  ↩ I have to admit that I'm very tempted to see what effect even bigger moves in the low-precision direction might have. What if I moved to some kind of 16-bit training, like ? After all, most of the open weights models like Qwen are at least released at that kind of bittedness. But that's one to look into later, I think.  ↩

0 views
Giles's blog 1 months ago

Automating starting Lambda Labs instances

I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series , but they're really busy at the moment, and it's rare to see anything. Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager . It has three commands: Let's see if that helps -- though it's been running for six hours now, with no luck... , which prints which kinds of instances are available. , which prints out all of the possible instance types (available or not) with both their "friendly" names -- what you'd see on the website -- and the instance type names that the API uses. , which polls the API until it sees a specified type of instance, at which point it starts one and sends a Telegram message.

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32g -- Interventions: weight tying

In Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post. But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters. After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too? Worth a try! Let's give it a go. I'll start with a quick refresher on what weight tying is, and how it works. This is really targeted at people who've been reading along with this series -- if it's all new to you, you might find my post on Maths for LLMs a useful catch-up guide first. In our LLM code, right at the start, we use an embedding layer to take our input token IDs, and turn them into embeddings -- each token becomes a vector in a high-dimensional space (768 in our case), which we see as representing in some manner the "meaning" of the token. A useful way to think about that is that we could start with a one-hot vector for the token -- that is, with our 50,257-token vocabulary, it would be 50,257 items long, and have zeros in every position apart from the position corresponding to the token's ID. We'll treat that as being a vector in a "vocab space". The process of converting the token into an embedding turns out to be equivalent to multiplying that vocab space representation by an embedding matrix -- one with one row per possible token, the values in that row being the values for the appropriate embedding. 1 Because matrix multiplications can be seen as projections between different spaces, we can see that as a projection from our vocab space to the embedding space. Once we've projected our sequence of tokens into a sequence of embeddings, we do all of the steps required for the LLM -- we add in positional information, run it through the Transformers layers, normalise it, and then we have a new sequence of embeddings. The embedding at position n in that output sequence, if our model is working well, should be something that represents an appropriate next-token prediction for the portion of the input sequence from zero to position n . What we want as our final output is to map that back to the vocab space. We want logits: a list of numbers that (after being run through softmax) will represent the probability that our next token is a particular one. Just as we mapped from vocab space to embedding space with (conceptually) a matrix multiplication at the start of the process, we can map back with another one. More specifically, if we treat the embedding matrix as having the same number of rows as there are input tokens (which we'll call d vocab ) and columns as there are embedding dimensions ( d emb ), then the original vocab-space-to-embedding-space matrix will have this shape: So it's projecting from a d vocab -dimensional space to a d emb -dimensional one. Similarly, our matrix to do the projection at the end is just a matrix with the numbers of rows and columns swapped around: ...to do a projection in the other direction. The trick with weight tying is to see that these two projections can potentially be just the opposite of each other. If we assume that the embedding space on the way in to the LLM is essentially the same as the embedding space on the way out, then we can use one projection to go into it from vocab space, and the opposite to go back. The "opposite" in this case is the transpose -- that is, if we use W emb for our embedding matrix and W out for the output one, we have: That means we can re-use all of the embedding parameters for the output projection matrix, and fewer parameters means not only a smaller model, but hopefully faster training. Sounds like a win! But of course, there's no such thing as a free lunch. By constraining the output head to be the transpose of the input one, we're essentially enforcing that assumption above: we're saying that the embedding space on the way out must be the same as the embedding space on the way in. That limits what the LLM can do -- if it were able to use different embedding spaces at each end, it would have more flexibility, which might help it learn to model things better. That's the theory: what does it mean in practice? Let's take a quick look at the GPT-2 code -- just the for the top level class: For our embedding layer, we use PyTorch's class, and for the output head we use . Now, provides us with access to the underlying matrix with a field: (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . So, that's exactly the d vocab × d emb matrix that we'd expect -- it's the input dimension as the rows, and the output dimension as the columns. If we look at , we see something very similar: weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features That's actually the other way around, output dimension as the rows and input as the columns. If you're wondering why, remember that we transpose the weights matrix for a neural network before using it . But that's actually really convenient in our situation, because if we want to use the same weights for both, they're already "compatible"! And that means that adding weight tying to our code above is as simple as adding two lines at the end: For the model code, it literally is just that! There is a tiny inefficiency in that PyTorch is going to spend a bit of time initialising the weights in to appropriately-sized random values, only to have them all replaced -- but that actually works in our favour, because it means that we'll use up the same amount of the random number stream when creating the LLM in both the weight-tying and non-weight-tying cases, which is a bit better for reproducibility. There is one other change needed, though. I ran a test train with that code, and checkpointing failed like this: Safetensors doesn't like it when you reuse weights like we're doing here. The good news is that the help page the error links to is exactly about this problem with weight tying, and the suggested fix -- to replace ...and similarly for loading -- appears to work fine. Saving and loading checkpoints works, and it's compatible with the old checkpoint files too. So that's good news :-) So, that's how we code it. How much actual saving do we get in terms of the parameter count by doing this? A quick-and-easy way to count the parameters is just to create an instance of the model and see: So, we've gone from a 163M-parameter model to a 124M-parameter one. That's certainly quite some saving -- 38,597,376 fewer parameters, which is a reduction of almost a quarter. We can also sanity check the size of that saving -- our output head was, as we know, a d emb × d vocab matrix, so it should have 50257 × 768 parameters -- which is, indeed, 38,597,376. Excellent. Now, there's one thing we should consider here. We're training on a Chinchilla-optimal number of tokens, 20x our parameter count. Is that what we want to keep stable? Or is the total number of training tokens the important bit, so we wind up technically overtraining? My instinct is that the total training tokens is the important thing. Chinchilla optimality is a training heuristic rather than a true aspect of the model, so sticking with it would mean that we're training a model with fewer parameters on less data. It seems very unlikely that would do anything other than produce a worse model! So: we'll keep the same number of training tokens, and just introduce weight tying. How does it train? I kicked it off on the usual 8x A100 40 GiB machine, and after a little while I checked the loss chart. It looked like this: Yikes! It started off with a loss of about 460. Normally, we start with a loss of about 11. The normal loss makes a lot of sense. If you consider it in terms of perplexity, that value of 11 comes out at e 11 ≈ 59 , 874 -- that is, the model is giving pretty much equal probabilities to every one of the 50,257 possible tokens. A loss of 460 means that the model is making incorrect predictions and is very certain about them. How could that be? Well, let's look at the documentation again. (Tensor) -- the learnable weights of the module of shape ( , ) initialized from 𝒩 ( 0 , 1 ) . weight (torch.Tensor) – the learnable weights of the module of shape ( , ) The values are initialized from 𝒰 ( − k , k ) where k = 1 in_features They're initialised completely differently. Embeddings are set to values in a normal distribution (that is, a Gaussian bell curve) with a mean of 0 and a standard deviation of 1. But linear layers are set to random values in a uniform distribution (that is, a completely flat one) within a range based on the number of input features. In particular, those numbers for the linear layer are really small! Our output head has set to 768, so that means that the k would be: So instead of getting that kind of "ideal" linear layer initialisation within the range ( − 0.0360 , 0.0360 ) , we're getting numbers which roughly 2/3 of the time will be in the range ( − 1 , 1 ) , and the rest of the time will be even further from zero -- we could be getting -3 or +4, or potentially even crazier numbers! That means that the output logits (coming from a linear layer with higher weights) will be larger, which in turn will push softmax to come up with higher probabilities: I considered changing things to initialise the weights differently, but given that the loss had fallen to 8 or so by the second checkpoint, I decided to just let the run complete. Here's the final loss chart, with the Y axis fixed to run from 0 to 12: That's a nice smooth curve, at least! The output is: Timing-wise, that's about 180 seconds faster than our baseline model training run, only a 1.5% speedup -- clearly the lower number of parameters doesn't actually save us much time. Loss-wise, the final train loss on the baseline model was 3.743, so that's not particularly promising. Still, the proof is, as ever, in the evals. Smoke test first: Borderline coherent, but maybe worse than normal? Let's see what our test set loss looks like. That's bad -- let's see it in our comparison table: Our worst model so far :-( Weight tying certainly didn't help our train. It is worth noting that the GPT-2 small weights -- which do use it -- got 3.500 on the same test set as we're using for that table, so it is possible to get a better model with weight tying. But there was clearly something different about their train, and my suspicion, as I've said before, is that it was trained for many more epochs ( I estimated 40 ), slowly grinding that loss down. But what I'm trying to do in this mini-series of interventions is find tricks that will allow us to approach the original weights' loss without a very long training run. And for the purposes of that, I think we can safely say that weight-tying is not one of those. Next time around, our last intervention test! What happens if we switch off the use of automated mixed precision (AMP)? That is something I added right back at the start as a performance enhancement; it means that PyTorch can do certain calculations in 16-bit rather than 32-bit if it thinks there's no harm in doing so. Might we get better loss by training without it? In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩ In reality we don't multiply a one-hot vector by a matrix, as that would be extremely inefficient -- PyTorch just does a lookup into the embedding matrix. If we get token ID 1234, then it just reads out the contents of row 1234, and that's our embedding. But for the purposes of this post, it's best to see that as more of a (extremely effective) performance tweak rather than what's happening conceptually.  ↩

0 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 32f -- Interventions: weight decay

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: In my last post I looked into the learning rate, the parameter in that code, and found a value for that, plus some extra code to schedule it -- that is, to vary it over time -- which gave better training results. This time I want to go into the weight decay. What is it, what is it for, and is 0.1 really the best value? I was a little concerned going into this that in order to understand this hyperparameter, I'd need to have a good understanding of how the optimiser works; I've been building what I think is a solid mental model of optimisers, but I don't think I understand them well enough to explain them yet, and I've been hoping to delay posting about them to a separate blog post series after this one. The good news is that while weight decay is an important aspect of how optimisers work -- the "W" in AdamW, the thing that makes it different to the older Adam optimiser, is a nod to its different treatment of weight decay -- you don't need to know how the optimiser itself works to understand what weight decay is. Instead, you just need to consider an older and more fundamental aspect of building ML systems -- regularisation. In order to dig into that, let's start with overfitting. Let's imagine a simple classification task: we want to build a model that can -- for any point on this chart -- predict whether a cross or a circle should go there, training it using the sample data points that we already have: Let's say that we train a powerful model on this dataset, and it comes up with this: Now, ab initio we don't know whether that's a good result or not; we need to use our validation set to evaluate it. Let's say that the validation points are these blue ones: We can see that it looks like our powerful model has overfit. The training set is all nicely split by the boundary, but the validation points are not. A common solution to how to handle that kind of issue that you might see in introductory ML courses is to try using a less powerful model. A less powerful model in this case might come up with a less "wiggly" line to separate the two categories, perhaps because it didn't have enough parameters to make it wiggle so much, so you might find that it came up with a classifier that looked more like this: So: we use our validation set to detect overfitting, and we can adjust the complexity of our model to try to avoid it. Now, this is all very well, but it does require manual intervention. We had to do a training run, identify that we were overfitting, and then decide on parameters for the new simpler model (how many parameters should it have?). We could, perhaps have gone too far and wound up with something like this: ...and underfit. There's no way when we start out knowing what the right number of parameters is, so we need to try various values and then try to work out the optimum balance. Regularisation techniques are designed to try to automate this -- to prevent overfitting without all that tedious mucking about with the model. We've already looked at Dropout , which is one of the standard ways to do that. Although my own mental model of what it does goes some way beyond just helping to prevent overfitting, I may well be wrong -- and given that our LLM train is never seeing the same training data twice, being a single-epoch run, removing it turned out to improve our model . Another technique is just stopping the training run when you start seeing the validation loss rise, also known as "early stopping". That's such an obvious thing to do that I came up with it independently back when I was doing my early experiments with fine-tuning . Now, we don't have a separate validation set for these training runs, but because we're doing a single epoch, the training data it sees is just as "new to it" as a held-back validation set would be, so we could use a similar trick and treat "train loss starts rising" instead of validation loss rising as a reason to stop the train early. It's not exactly the same thing, but perhaps it would be close enough. But in all of the trains in this series, that's never happened -- while sometimes the train loss blips up for a bit, in the longer term it keeps going down. But there are other techniques that rely on a neat trick. Let's think back to the manual, boring way of trying to find how many parameters are appropriate for a modelling task. We tried one number, found that it overfit, then we might try a lower one, find that it underfit, then try something in the middle and find that it's better but still not perfect one way or the other, and rinse and repeat until we find something we're happy with. This kind of searching through a solution space to find an optimum is exactly what we're doing when training a model. It would be really nice to automate it in the same way. One trick is: if we want to minimise the complexity of our model so that it doesn't overfit, we can try adding a measure of the model's complexity to the loss function -- and then our normal process of gradient descent will try to minimise that, just like it will try to minimise the loss from the training results themselves. And that brings us on to weight decay. Regularisation by weight decay starts off with the hypothesis that the "size" of all of the model's weights, taken together, is a measure of the model's complexity. If the model's weights are small, then it's a simpler model than if they're large. 1 The "size" in this sense is the square of the L2 norm -- that's something we came across in gradient clipping . The L2 norm is basically all of the weights squared, added together and then the resulting sum square-rooted. You can think of it as the length of the vector that the weights represent -- that is, for our 163M-parameter model, it would be the length of the model's weights considered as a vector in 163-million dimensional space. 2 And by using its square, we get something that penalises larger values more (and we also save the time in calculating a square root). To me, it's not intuitively obvious that that measure really does express the complexity of the model in any clear sense. After all, you'd think that doubling all parameters would leave it no more complex than it was before, but it would double the L2 norm. 3 But I imagine there is solid maths behind it to say that it does work in a more general way, so in the interests of not disappearing down a mathematical rabbit hole at this stage, I'll take it as given. So: we're using the squared L2 norm as a measure of model complexity, and we're going to add that on to the training loss as a way to try to minimise both. The next question is, how do we balance between the two -- the training loss and the model complexity penalty? This is, in a somewhat hand-wavy way, similar to the decision of how much of the current loss function's gradient to use when adjusting the weights. For that, we use η , the learning rate to scale the gradients before applying them: And the balance between the "real" loss and the model complexity penalty is done in a similar way -- we have a number, the weight decay, normally represented by a lower-case lambda, λ , and we multiply the squared L2 norm by that, something like this: ...where I'm using ℒ for the normal loss on the training inputs vs the targets, N 2 for the squared L2 norm of the weights, and ℒ ′ for the combined loss. And ℒ ′ is what we -- in theory -- actually try to minimise using our optimiser. But there's actually a neat simplification that we can apply to make this even easier. Firstly, let's make one small change to the equation above: we'll halve the squared L2 norm before multiplying it by λ . That obviously doesn't change the underlying maths, it just means that we'd need to use larger values for λ to get the same effect. You'll see why that's useful in a bit. Now let's think about normal gradient descent. Again, we work out the gradient of the loss function for each weight, and subtract that times the learning rate η from the weight's value to update it: Let's reformulate that a bit. The gradient of the loss function for the weight is its partial derivative against that weight, so we can write the above like this for the version of the loss function including weight decay, ℒ ′ : Now, we defined ℒ ′ above as ℒ + λ · N 2 2 , so we can substitute that in there: Now, let's think about that L2 norm, N . It's the square root of the sum of all of the weights squared, or equivalently we can square it (like we do in the formula above) and say: Let's drop that in: Now, the derivative of a bunch of things added together is just each of them differentiated separately and then added together. Let's apply that to the two terms in the brackets: ...and now pull the constant λ and the 2 out of the second partial derivative: Then we apply the rule for the derivative of a bunch of things added together again: Now, we're doing a partial derivative versus one specific weight, w , which is one of the w 0 , w 1 , and so on in there. From that perspective, all of the other weights are constant -- which means that their derivative with respect to w is zero. So we can just get rid of all of them apart from the one that actually is w , and we wind up with this: The derivative of w 2 with respect to w is just 2 w . Thanks to that crafty halving of the N 2 earlier, that means that we can go to this: Multiplying that − η across the bracketed terms, we get: That's exactly the same as the normal gradient descent update, using the unmodified loss function without weight decay -- except that we're additionally subtracting the weight's original value scaled down by both the learning rate η and the weight decay value λ . Much simpler :-) (As an aside: the description above is correct for "traditional" simple gradient descent and -- loosely -- for Adam, but AdamW's trick is to do things somewhat differently. That's something I'll go into in more detail when I get round to writing my post on optimisers.) So: weight decay is a regularisation technique that tries to prevent our model from getting any more complex than it needs to be. We have one number, λ , which determines how much to weight complexity against the normal training loss. And, as we can see from the code: ...right now we're setting λ to 0.1. Is that the right value? As usual, the GPT-2 paper is light on the details of the hyperparameters they used, but nostalgebraist wrote a really nice post on Tumblr where they dug into what the number might have been. As they say: It does say it follows the first GPT paper in most respects, and that paper used weight decay of 0.01. Their link for the paper appears to be mistaken, as it's a different (albeit very interesting) paper from 2020, a year after the GPT-2 one, but I believe this is the paper normally called the GPT-1 one . They do indeed use 0.01 there: We also employed a modified version of L2 regularization proposed in [37], with w = 0.01 on all non bias or gain weights. The link to the GPT-3 paper looks right, though, and as they say, it uses a weight decay of 0.1: All models use weight decay of 0.1 to provide a small amount of regularization They then do a bit of maths to work out whether the GPT-2 weights are likely to have been regularised by something like weight decay, and come to the conclusion that they probably used 0.01, just like the GPT-1 paper. It seems plausible, but of course not certain. But: tentatively, GPT-2 used 0.01, while we're using 0.1, perhaps because the GPT-3 paper does. What other data points do we have? The Hugging Face "Smol training playbook" has some interesting stuff (including not using weight decay on embeddings, which they say they found helped), but the value that they use is 0.1, which they call "a very vanilla setting". And: Interestingly, over the last few years the AdamW hyperparameters have barely moved: The same triplet is reused in Llama 1, 2, and 3 and DeepSeek-V1, V2, and V3-671B, with no changes. Anyway, assuming they're right about weight decay value for the models they mention (and I assume they've done the research -- I had the link to the DeepSeek paper to hand, and that one certainly says 0.1), it looks like 0.1 is pretty much standard these days. And a quick double-check of what a typical value would be -- asking ChatGPT, Claude, Gemini and Grok -- they all recommend 0.1 as a solid sensible default with AdamW (though they all also say that values between 0.01 and 0.1 are reasonable). So on that basis, I think we can say that 0.1 is a reasonable default, and has pretty much become the standard, but it might be worth trying 0.01 just to see if it does help with tiny models like ours. Are there any dissenting voices to the 0.1 orthodoxy? I came across a paper from a team at Cerebras Systems , " Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training ". It's essentially a Chinchilla-like attempt to get scaling laws, but rather than looking just at optimal tokens per parameter in order to work out what you should scale up when adding on more compute, they're trying to find optimal batch sizes and values for weight decay. That's certainly relevant to our interests :-) However, it is very dense and in-depth, and fully understanding it at this stage would need quite a lot of work -- very much a side quest. Definitely something to come back to later, but for now, I'll just try to extract the stuff we need. Let's start off with the optimal batch size, as they have it right there on the first page. We're not going to use it, but it will be interesting to compare with what we're using, and what the DeepSeek paper that I looked at in the last post suggested. They fit this formula: ...where D is the total number of tokens that you're training on. That's quite different to the formula in the DeepSeek paper, which was: ...where C is the number of FLOPs 4 . C scales up linearly with the number of tokens D , but also with the number of parameters in the model N , so you can see the DeepSeek formula as a function of N and D -- as your model gets bigger, so does B opt -- whereas this Cerebras paper is saying that it's just a function of D , unaffected by model size. They did train over a number of different sizes (from 111M parameters up to 1.7B) and their formula seems to hold, so it's not just that they didn't treat model size as relevant. Well, let's see what their formula comes up with. We have 3,260,252,160 tokens in our train, so their formula for B opt comes out as: That's much closer to the 97-or-so sequences that appeared to be optimal when I did some rough-and-ready curve-fitting than the 373 that the DeepSeek formula gave for our setup :-) OK, so what about the weight decay? They don't give a direct formula for that, but they do give a formula for the optimal τ , the AdamW timescale. Without going into exactly what that means right now (that's one for my optimisers post later), they relate it to other numbers that we do know with this formula: ...where B is the batch size, D is the amount of data, and of course λ and η are weight decay and learning rate respectively. So if we know the optimal τ we can work out the optimal λ for our training run; solving for λ , we get: So let's work out the τ opt . Their fitted formula is this: ...where TPP is tokens-per-parameter. For us, with our Chinchilla-optimal TPP of 20, we get: Now, we're using a batch size B of 96, and (as before) D is 3,260,252,160. Our learning rate η is 0.0004 for this train -- remember, although in the last post we found that a scheduled learning rate with a peak at 0.0014 was better, in this post we're testing changing weight decay in isolation. 5 So, we just need to plug our τ opt into this: Before we do: having a batch size and a number of tokens in the same formula feels like a unit mismatch. In particular, as part of the explanation of that formula, they tie it back to a value S , the total number of optimisation steps, which they define as D / B . For that to work, either both need to be in terms of tokens, or both need to be in terms of sequences They clearly say that "B is reported in units of sequences". I'm not sure how to explain this, except by saying that perhaps the D is also meant to be in terms of sequences too, even though I'm pretty sure that it's meant to be in terms of tokens in the equation for the batch size. 6 Well, let's assume that is the case, and plug in numbers for sequences. We have 3,260,252,160 training tokens split into 1,024-token sequences, which is 3,183,840 sequences, so that comes out as: (Note that we'd get the same numbers if we plugged in numbers for tokens in both cases, as it would just multiply the top and the bottom by 1,024.) That comes out as 0.33724. Wow! That's even higher than the "traditional" 0.1, never mind the 0.01 that is the best guess we have for GPT-2. Even if I'm missing something here (I certainly can't say I've read the paper in as much detail as it deserves), that actually gives us a nice number to try out as an experiment. We already have a loss on our test set for a model trained with a weight decay of 0.1, as that was what we used in our baseline train. It looks like it might be worth doing two more, one with the GPT-2 estimate of 0.01, and one with this Cerebras-inspired 0.33724, neatly bracketing it. Let's give them a go! Firstly, the training run with λ = 0.01 : Looks like a nice smooth train -- one small loss spike near the start but it quickly recovered. The output was: That's not a bad final train loss (which does tend to indicate a good model). Let's look at the evals; firstly, the smoke test -- how would it complete "Every effort moves you"? Passably coherent. Let's take a look at the loss it gets on our test set: Not bad at all! Time to upload it to Hugging Face and to add it to the table so that we can compare it to the other interventions we've tried so far. So, it's better than gradient clipping and the QKV bias, but slightly worse than removing dropout and much worse than scheduling (and increasing) the learning rate. Now, that suggests to me that the much-higher Cerebras-inspired weight decay will be worse. My logic is this: if both decreasing it and increasing it improved loss, that would suggest that we have an inverted-U loss curve for weight decay like this: Now, it seems vanishingly unlikely that those downward trends on either side would continue so that you could get arbitrarily low loss by increasing or decreasing weight decay even more. So the curve would perhaps look a bit more like this W-shaped one: My intuition is that having multiple minima -- especially ones that just happen to be on either side of the "standard" value for weight decay -- seems less likely than the alternative -- that the higher number will be worse because we're actually on a U-shaped curve more like this: Of course, my intuition could be completely off on this, and it's definitely still worth doing the test! Here's the loss chart with that: You can see right away that it was a much choppier train, with quite a few loss spikes, some quite late on. The output at the end reflected this: ...a significantly worse loss at the end. Still, we should do the evals. Firstly the smoke test: Not too bad, but the loss test is the important one: That's terrible! Our first result for loss on the test set for an intervention that is actually worse than the baseline. Much worse: However, at this point I started wondering. When I was looking at the learning rate, the number I selected based on the DeepSeek paper worked well with learning rate scheduling, but failed to converge without. The weight decay number is multiplied by the current learning rate before it's used to reduce weights' values, so will be affected by both scheduling and η . It seemed likely that Cerebras used a learning rate schedule, and double-checking the paper: We present results with a single (standard) learning rate schedule ... For a given TPP, all models have the exact same warmup phase: a linear warmup of the learning rate from 0 to the maximum value. ... We use the µP-tuned and adjusted peak η , for 111M models. The learning rate increases linearly to the peak for the first 10% of steps, then decreases from the peak to 0 for the remainder of steps. Seems pretty certain. Now, I've been following a fairly strict rule of testing interventions in isolation; however, the learning rate and the weight decay parameters are so intertwined that perhaps that's just not reasonable here. I decided to do two more trains, both with learning rate scheduling. I'd use the same schedule as in the last blog post -- a warmup from pretty-much zero to the peak over 10% of the run, followed by a cosine decay to 10% of the peak. In the first, I'd use the same learning rate as our baseline model, 0.0004. In the second, I'd use the one we got from the DeepSeek paper, which did really well when scheduled: 0.0014. Well, that's less choppy, at least -- the scheduling calmed down the later parts of the run, as you'd expect given that the learning rate was dropping. The output: Still a kind of high training loss at the end, though. The smoke test: Not too bad, and the test set loss: Unfortunately still worse than the baseline of 3.692, albeit better than the one without learning rate scheduling. I'm not going to add it to the table, as this was more in the way of an exploratory training run. Let's see how we do with the larger DeepSeek-suggested learning rate. For this one, I kept the weight decay at 0.33724. (This was an error, as I realised later -- more on that shortly) Ouch, super-choppy loss -- and the loss at the end of the train isn't promising either Terrible loss at the end. The smoke test gives this: ...which is not too bad, but the test set loss: ...is still pretty terrible (though still a tad better than the one without the learning rate scheduling). Another one to throw away, I think. But then something occurred to me: the formula to go from the optimal AdamW time horizon τ opt to the optimal weight decay λ opt is this: It has the learning rate η in it -- I even made a footnote saying that I was going to have to remember to recalculate the weight decay value when that changed :-S Luckily, though, running the real numbers through that: ...which is almost exactly the same as the 0.1 that we've been using for all of our other experiments. So that actually suggests that the Cerebras equations come up with a reasonably usable number for weight decay if you use the DeepSeek-optimal level for the learning rate, and schedule it in a normal warmup-cosine decay manner. But it's still not as good -- for this model -- as using the GPT-2 number. 7 With that, I think it's time to wrap this intervention up! Let's look at our results table again: We've found that reducing the weight decay from the now-standard 0.1 to a GPT-2-inspired 0.01 improves the loss our model gets on the test set; it's the third-best intervention so far, after getting rid of dropout and updating our learning rate -- and the difference between it and the dropout intervention is pretty small. It did surprise me that the Cerebras-inspired number did so badly, though. To recap: I think that for now, I should not head any further down this rabbit hole and just take the win -- we have a weight decay parameter that works better than the one we had, and so that's something that can go into our set of working interventions. I can revisit the Cerebras paper later when I've spent more time studying optimisers. As to why this old-fashioned GPT-2 value might work better than the current default of 0.1: I think that could plausibly be due to scale. The 0.1 value appears to come from the GPT-3 paper, which essentially was an experiment in scaling up GPT-2. Perhaps larger models need larger weight decays? And the model we're working with here is really small, at 163M parameters. So, that's weight decay done! Of the list of planned interventions I wanted to try , only training in full-fat 32 bits (rather than AMP), and weight-tying remain. I think I'll look into the second of those next. Stay tuned! Here's a link to the next post in this series . More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩ β 1 = 0.9, β 2 = 0.95 Grad norm clipping = 1.0 Weight decay = 0.1 (Llama 3 405B drops this to 0.01) With our too-low learning rate of 0.0004, it performed terribly When we added scheduling, it was a bit better but still not great. When we used a DeepSeek-optimal learning rate (and actually did the right calculations to get the real value for weight decay based on that), we got a number which was very close to our baseline train, and seems very unlikely on the face of it to have a significantly different resulting test set loss. More precisely, from Deep Learning : Minimizing J ( w ) results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. ...where J ( w ) is the loss function we're trying to minimise in our training run, combining the "real" loss and a measure of the model's size.  ↩ I can't decide whether that makes it easier or harder to understand ;-)  ↩ Wild speculation: how about something using the Shannon entropy of the weights...?  ↩ Specifically the non-embedding training FLOPs.  ↩ Note to self: don't forget to adjust it if we do decide to combine this with the learning rate update. Also: I'm pretty sure from reading the paper that the η that they're using in these formulae is the peak -- they certainly are using learning rate scheduling, albeit with a decay-to-zero rather than the decay-to-10% we used.  ↩ Plugging in the number of sequences into the batch size formula gives us an optimal value of 9.47, which definitely doesn't look right based on the trains I've done.  ↩ Assuming that the GPT-2 value for weight decay "stacks up" well with the learning rate update and the scheduling from the last post. There may be some useful tests to do when we try to put this all together.  ↩

0 views
Giles's blog 2 months ago

Writing an LLM from scratch, part 32e -- Interventions: the learning rate

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: The values in there -- for the learning rate, and for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book. What do those values actually mean, and are those really the right values for them? I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before. The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that. In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later. If you're reading this blog, you almost certainly know what the learning rate is, but let's go over it briefly to build a solid foundation. The way it's normally explained, using simple gradient descent, goes something like this. Let's assume that we're training a model with just one parameter, and it starts off set to − 5 . We run some training data through, and get a loss, let's say 44.44: We don't know what shape our loss curve is (if we did, we might be able to find the lowest loss algebraically), but we do know the differential of the parameter versus the loss at the point we've measured; it happens to be -13. That is reasonably large and negative: We use that information to say that we want to move in the direction of a larger value for our parameter -- that is, in our case where the gradient is negative, so we have a downhill slope towards the right, we want to increase the parameter to move rightwards on that chart, whereas if it were positive (an uphill slope) we'd want to decrease the parameter to move leftwards. Simply subtracting the gradient from the parameter would lead to an update in the right direction, but it would be a very large one in this case -- we'd move 13 units to the right -- so we multiply the gradient by a small positive number, the learning rate (often written as a lower-case eta, like this: η ), to move a small distance in that direction. Let's say η = 0.3 . That means we want to update our parameter: So now we run that through and get a new loss -- let's say it's 9.06 -- and a new gradient, which happens to be -5.2. Now we can do another update, and our parameter will become 0.46, so we use that and work out another loss and gradient, which come to 3.3816 and -2.08. Let's plot that one, but this time we'll draw back the veil and show the actual loss curve. Now, it's worth reiterating that while we're training this model we don't know what that curve looks like -- we're just finding points on it, along with its gradient at those points, and using that information to work out which parameter value to explore next. But it's pretty clear that as we continue, if the learning rate is set correctly, we'll get to the minimum eventually if the learning rate is the right kind of size, because -- due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum 1 . It's also pretty clear that if the learning rate is smaller than an optimal value, in this simple case we will still find the right point, but it will take more steps because each one is smaller: And, of course, if the learning rate is too high, we might never converge -- we'd "bounce out of" the dip, and wind up with a parameter value that endlessly cycles between increasingly smaller and increasingly larger values, zooming off to infinity: OK, that's the basics. Why might we want to change from something that seems so logical and simple? A few paragraphs back I said: due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum What if it doesn't? Imagine if we had something more like a V-shaped curve, like this: The gradient does not decrease as we get closer to the minimum, and so while we're in the downward-sloping part, each update is exactly the same distance: Now, eventually we'll jump over the minimum: In this example, I've used a gradient of − 8.33 on the downward-sloping part of the curve, and + 8.33 on the upward-sloping part, so that means that our next update just bounces us back to where we were before! Because the gradient isn't decreasing the closer we get to the minimum, we wind up just oscillating around it. That's not very helpful. That's a slightly contrived example (though not entirely -- intuitively, with functions like ReLU or GELU in our real LLMs, it's easy to imagine crazy loss landscapes). But it does show that perhaps we might want to add in our own "artificial" way to decrease the size of the steps we take over the course of training our model rather than just relying on the gradients naturally flattening out for us. Another way of looking at things is that as the model gets trained, we don't want batches of very new-looking data to cause big updates, taking us away from what was a good part of the loss landscape in terms of what we've seen so far. For example, imagine you've been training an LLM on a bunch of documents, which have so far been in English. Halfway through, it encounters a document in Byzantine Greek, the loss skyrockets, and you do a big update. That would be a problem! You might want it to learn a bit from it to push it slightly in a "the world is multi-lingual" direction, but you don't want it to lose a big chunk of the value from its previous training. You might also see a kind of connection to the way that people learn over the course of their lives -- for babies, everything is new and they "update their parameters" constantly as they try to understand the world. Children are still pretty flexible, but as we get older we tend to update our beliefs less and less. That's not always optimal, but as a heuristic it's pretty adaptive. Anyway, in general: for most training runs, we're going to want the learning rate to adjust over time. Most of the time this will be by reducing it, though there can be cases for increasing it again for periods. The general case of doing this is called "learning rate scheduling". There are a bunch of ways that people adjust the learning rate over the course of a train; here are a few that cropped up a lot while I was researching this. If we want the learning rate to go down over time, and we know how many steps we're training for, we can just set it to (say) 0.0004 for the first quarter of our train, then 0.0002 for the next, then 0.0001, then finish off with 0.00005, like this: That can work pretty well! But there is one obvious oddity -- the big step changes in learning rate mean that the exact placement of the drops and the training data before and after can matter. Why are we treating the data and the state of the model immediately before and immediately after so differently? It would make more sense to have a smoother schedule. What functions decay smoothly like that? An exponential curve does: let's say we just multiply the learning rate by a number that is a little smaller than one every step, so that it drops smoothly like this: But there are lots of other curves like that, and one is particularly interesting: As you change θ from 0 to π , the value of cos θ goes smoothly from 1 to − 1 , so it's easy enough to rescale that so that our learning rate follows the same curve: This is called a "cosine annealing" or "cosine decay" schedule, and was apparently inspired by the algorithms used for simulated annealing (an optimisation algorithm that was in turn inspired by how the atomic structures form in metals as they cool -- another one for the list of things to look into in the future...) That solves the mystery from earlier: the cosine that the Chinchilla paper was talking about was exactly this. As it turns out, the cosine decay scheduling curve is quite popular in deep learning, because it has what amounts to two well-defined phases -- an initial high learning rate where lots of exploration of the loss landscape can happen, followed by a smooth transition to something more like fine-tuning to optimise the location in whatever part of the loss landscape we've wound up in. Now, all of the above are assuming that we want the learning rate to start high and finish low, so that we can mimic the textbook gradient descent that we had at the start of this post. Intuitively that feels nice, but on further thought, the important thing is really that we have a low learning rate at the end of the train, so that we can find as close a point as possible for the minimum at the part of the loss landscape we've found ourselves in. But perhaps there's a case for having both high and low periods during the train, so that we don't get stuck in a local minimum -- something to jolt us out of where we were every now and then? 2 With a step function, that's easy: you could, for example, do this: With an exponential, you could do something like this: With cosine decay, of course, things are even easier, because the cosine function is inherently cyclical, so we can just do this: However, at least for our purposes, training an LLM using a Chinchilla-optimal number of training tokens, it makes sense to be guided by what the authors of the Chinchilla paper did. Appendix B says: We find that setting the cosine cycle length too much longer than the target number of training steps results in sub-optimally trained models, as shown in Figure A1. As a result, we assume that an optimally trained model will have the cosine cycle length correctly calibrated to the maximum number of steps, given the FLOP budget; we follow this rule in our main analysis. So, at this point, I think we have one important part of the intervention we want to make: we want to use a cosine learning rate scheduler, going from high near the start of the training run, down to low at the end over one cycle. Additionally, and also from appendix B in the paper: we use a 10x learning rate decay in line with Rae et al. (2021) ...which means that if our learning rate starts at η , then we want it to decay down to η / 10 by the end. So, we just need to work out an initial value for η , and let it rip, right? Well, not so fast... When our model is uninitialised, right at the start of the train, gradients are going to be pretty wild. It's going to be making random errors all of the time, and we'll be making huge jumps across the loss landscape. That sounds bad. Additionally those kind of wild jumps can get the optimiser into a -- well, sub-optimal -- state. I haven't read enough about optimisers yet to have a solid handle on that, but that can wait -- intuitively it makes some kind of sense that erratic gradient updates might confuse it. So, it makes a certain amount of sense to start off with a low learning rate so that we don't do that, and then to increase it gradually to the peak, and only then to schedule the gradual cosine decay. According to this (rather nice looking) masterclass on LLM training , it's typical to do this over "a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size", and we would just use a linear increase over that period: I think we should do that; a simple linear warmup at the start -- let's relatively arbitrarily say 5% of our training steps going up to our desired peak learning rate. So our learning rate schedule should look something like this: So far I've written a lot about how we vary the learning rate over time, and that's all been very useful. But we still need to know what the value should be initially! In smaller-scale experiments you might just try a bunch of different numbers to see what worked well, but at more than US$30 per train, that's not practical here. Unfortunately it's really quite hard to find good suggestions published anywhere. The GPT-2 paper is (as usual) reticent: The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText ...and if you search for "learning rate training llm", you'll see lots of results for when people are fine-tuning existing LLMs ( 2 × 10 − 4 comes up a lot), but almost nothing about when you're training one from scratch. I eventually came across this (long!) post from Hugging Face , which I definitely need to spend time going through in the future, because it covers a lot of the ground I've been going over in this post series. But for this post, I think the most relevant part is in the section " Scaling Laws for Hyperparameters ", where they include a figure from this DeepSeek paper . Here it is, with some of the (also relevant) surrounding text: In our trains we're using something like 5 × 10 18 total FLOPs. Now, they are specifically charting things in terms of non-embedding FLOPs, but I'm going to play a little fast and loose here and ignore that, so reading off their chart, that looks like we should be using about 1.4 × 10 − 3 as our learning rate. We can double-check that against their formula, where C is the compute budget: Nice, a close match! However, it's definitely worth noting that we're using a simple GPT-2 architecture, and they are using something quite different -- RMSNorm instead of LayerNorm, SwiGLU as the activation function on the feed-forward networks, Rotary Position Embedding rather than the fixed ones we're using, and so on. As a sanity check: you can see that they also give a formula for the optimal batch size in terms of tokens. For our FLOP budget, that comes in at 381,782, which is about 373 of our 1,024-token sequences. That is quite a lot higher than the 97-or-so sequences that we appeared to be optimal in our earlier experiments . That is a little concerning, though of course the 97 number came out of a very ad-hoc bit of curve-fitting. For now, I'm going to hope that that doesn't matter too much for the learning rate. This may come back to bite me; if the results of a train with 1.4 × 10 − 3 are radically worse than the existing rate of 4 × 10 − 4 , I'll have to do a bit more investigation. So, now I think we have all of the theoretical pieces in place to do a train. Let's move on to the practicalities. We started by looking at this: What should we change -- disregarding the until the next post? Based on the above, we want to do a linear warmup of about 5% of our steps, going up to a learning rate of 1.4 × 10 − 3 , followed by a cosine decay down to one tenth of that, 1.4 × 10 − 4 . What does that look like in code? The relevant API for scheduling the learning rate in PyTorch is, logically enough, in the module, and there are a bunch of different scheduling classes. You create your optimiser, then create a scheduler for the shape you want, and then you can call on the scheduler (after the on the optimiser) to adjust the optimiser's learning rate over time. Let's make that more concrete; one of the schedulers is , which is what we'll need for our linear warmup period. It takes as its parameters: Let's say that we want to go from almost-zero to our optimiser's learning rate over 1,600 steps -- we'd create our scheduler like this: ...then in our training loop, after we've done the scaled step of the optimiser, we'd also step the scheduler: This confused me a little bit the first time I saw it; after all, if the scheduler hasn't been "triggered" when we step the optimiser, how does the optimiser know what learning rate to use? Surely it would just use whatever it was initialised with? The answer is that when you create the optimiser, it stores away the learning rate that you give it in two places -- an "initial learning rate" and a "current learning rate". Next, when you create your scheduler, it uses the initial learning rate to work out the start and end values, and then sets the current one to the start value immediately. Just by creating a scheduler, you're changing the optimiser's current learning rate -- but not the initial one, which is important, as we'll see in a moment. So, we have a scheduler that handles our warmup period nicely. Another scheduler that's relevant to our interests is the CosineAnnealingLR . This takes: On creation, this scheduler will read in the optimiser's initial learning rate -- note, not the current one -- and then the first time it's stepped, it will set the current learning rate to that value, and then for steps after that it will reduce it so that it follows a nice cosine decay, reaching after steps. So those two cover the two regimes that we want -- the warmup and then the cosine decay. But now we need to put them together; we want to do one and then the other. There's a very useful class, , which allows you to chain schedulers and tell it when each one takes over from the previous one. Let's sketch out some code to use that to do a train with our new peak learning rate of 1.4 × 10 − 3 , a warmup of 1,600 steps, followed by a cosine decay for the next 32,000 steps to one tenth of the peak learning rate: That actually works quite nicely! I wrote a dummy training loop to plot the current learning rate over a fake train using code like the above , and got this: ...with the output confirming that the values were good at the "milestone" point, the start and the end: I was initially a bit surprised by that, as at the time I ran it, I didn't realise that there was that split between the initial and the current learning rates on the optimiser, so I thought that the cosine scheduler would pick up whatever tiny starting value the warmup scheduler had overwritten the optimiser's learning rate with -- but that split saves the day. That means that now we have the outline of how to schedule our learning rate. But before we can put that into the code, we need to think about how it affects our checkpoints. Just like the scheduler and the optimiser, the learning rate scheduler -- or, indeed, our two schedulers here -- contain information about the state of the train. That means that if we recover from a checkpoint, we need to provide them with the information they need. If we just created them afresh, they'd start from the beginning -- for example, if we restarted from step 20,000 in a train like the one above, we'd start a new warmup from pretty much zero, and then start a fresh cosine decay. That would be bad: (Dummy test code here .) Now, we could use the parameter to initialize them with the correct current global step. But they have a state dict, like most other PyTorch objects, so the simplest thing to do is just to write that to another checkpoint file: ...and then load it likewise: (Dummy test code here .) Conveniently, if you save the state dict of a , it will also include the state of all of its component schedulers, and likewise if you reload it, it will load the components' states back in too. The one thing you have to be careful about is what they warn about in the PyTorch docs: Initializing a scheduler overwrites its optimizer’s s. When restoring a checkpoint, initialize the scheduler before calling your optimizer's to avoid overwriting the loaded learning rates. Luckily enough, in our code as it stands, we create all of the things that are checkpointed -- the optimiser and the scaler so far, but shortly the scheduler as well -- before we load in the state dicts, so that drops out quite nicely. So, we have some sketched-out code -- it's time to put it in place for the real training run. I won't go through the details of the changes to my existing DDP training code, though you can see the diff here if you're interested. Much of the complexity was due to keeping backward compatibility so that we don't have to always use a learning rate scheduler; remember that in this mini-series, I'm trying making various changes ("interventions") to the training loop in isolation, seeing whether each one improves things. So it's important to be able to easily train with or without learning rate scheduling; I did that with a flag in the Implementation-wise, initially I was thinking that it would be easiest to always have a scheduler, and in the "non-scheduled" case to just set it to a linear one that didn't change the value over the course of the train. But in the end it turned out to be easier to use as being the switch to tell the training loop which "mode" it was in. The placement of the code to create the schedulers was also a little tricky; the "natural" place was just after the optimiser is created, like it is in the example code above. However, at that point, we don't know how many global steps we're going to have in the train, because we don't have the dataset -- which means that working out the numbers to pass in to the schedulers for the warmup and decay steps would be impossible. It turned out to be easiest to put it in the function , just after the datasets are loaded, as at that point we have all of the information we need. Anyway, that's the code done, so let's see what happens! I wanted to do two trains; one with the learning rate scheduling, and one with just the new value for the learning rate, instead of . I was expecting the updated learning rate alone to be too high and to cause a very choppy train, but had high hopes for the train with the scheduling. Here's how it did; the scheduled learning rate train first: Here's what the training loss looked like over that: Quite a few loss spikes early on in the train when the learning rate is at its peak, but nothing unmanageable -- and, as you'd expect, things calmed down quite a lot later on. I also charted the learning rate, to make sure it really was doing what I thought it was doing: So, a pretty smooth train, and we definitely did the right learning rate scheduling. Time to upload it to Hugging Face , and see what the evals look like. Firstly, the smoke test: Reasonably coherent, at least, though it's not super-impressive. On to the loss on our test set: That's our best loss so far! Let's put it into the table: So, it definitely looked like it was worth it. But was it the scheduling of the learning rate that helped, or just the change from 0.0004 to 0.0014? I kicked off a second run with no scheduling, just a learning rate of 0.0014, to see what would happen. After about an hour, I noticed that the loss chart had stopped updating. The last point had a maximum and minimum loss but no average -- but after that, nothing: However, the learning rate was still being charted, so the train was definitely running: Looking at the checkpoint metadata showed what had happened. At global step 1851, we had this 3 : ...and at the next checkpoint at step 2468, we had this: ...and the same for all checkpoints thereafter. Clearly the parameters had gone off the rails -- exactly what we'd expect with an excessive learning rate: There was no point in continuing the train, as it was pretty much certainly unrecoverable, so I stopped it. Out of interest, I downloaded the model, but I couldn't even run the smoke test on it: So it was pretty clear that just updating the learning rate to 0.0014 was actively harmful. No need to upload that one to HF! And time to wrap up this experiment. While this has been quite a long post, I've really only scratched the surface of how learning rates are set. If I were doing things in more detail, the best would probably be to do a "sweep" over multiple values to try to at least approximate the best possible rate for this model. That would be pretty expensive for me, though, so I decided to stick with the DeepSeek number. It might not be ideal for the specific architecture that I'm using, given how different that is to theirs, but given the results, it's a decent one compared to what I was using. 4 Something that I found interesting is that exactly how to schedule your learning rate is still an area being actively researched. Even in my relatively minimal research, I came across three alternatives to the mainstream warmup-cosine decay pattern: I'm sure there are many more. But for this train, I decided to stick to the mainstream, and the results were pretty good! To reiterate, this has been the most positive intervention so far: So I'll stick with that, and move on to the next thing: what is the parameter that we're passing in to the AdamW optimiser? Tune in next time :-) Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. , which is the optimiser we're applying it to. , which the optimiser's learning rate is multiplied by to work out where we want to start up. , which is likewise applied to the optimiser's learning rate to work out the value we're heading for. , which is the number of steps over which it should go from the initial learning rate to the final one. , which lets the scheduler know how many steps into its schedule it currently is -- this defaults to , meaning it hasn't started yet. This can be useful if you're resuming from a checkpoint, but for our purposes we can ignore it. , which is the same as the 's. , which is the number of steps before it reaches its minimum , the minimum learning rate we want to get to. , again the same as the 's. Per the Hugging Face paper, some people do warmup, then pause at a set level for a while, then start the cosine decay (warmup-stable-decay). DeepSeek use a relatively simple stepped function after a warmup. 5 I came across a 2025 paper " Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs " which says that a linear decay (after a warmup) outperforms cosine. Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: ↩ You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". This is the third intervention I'm trying: adding bias to the attention weight matrices. In the code from the book, we have this: So: we initialise the weights W q , W k and W v as linear layers rather than simple matrices of weights, and have a parameter to say whether or not we should add bias to those. In all of our trains so far we've set that to . Why do we have this parameter, and where did it come from? In Raschka's book, the use of the for these weights is introduced in section 3.4.2 with the wording: We can improve the implementation further by utilizing PyTorch's layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using instead of manually implementing is that has an optimized weight initialization scheme, contributing to more stable and effective model training. So, it's presented essentially as a way of getting better weights for our untrained model, which makes good sense in and of itself -- but, if that's the only reason, why don't we just hard-wire it to have ? That would be the sensible thing to do if the initialisation were the only reason, but clearly there's more to it than that. Section 4.1 has a bit more information: determines whether to include a bias vector in the layers of the multi-head attention ... We will initially disable this, following the norms of modern LLMs, but we will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model. That looks like a typo, as the real explanation is in chapter 5, section 5 (page 164 in my copy), where we do indeed load the OpenAI weights: OpenAI used bias vectors in the multi-head attention module's linear layers to implement the query, key and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary. So, that all makes sense so far. QKV bias was part of the original GPT-2 models, perhaps just because it was standard at the time, inherited from something else, or perhaps for some other reason -- I can't find any reference to it in the actual paper . But people have found it doesn't help, so no-one uses it these days. But... is there some way in which an LLM of this specific size, or in some other way similar to the GPT-2 small model that we're training, might in some way benefit from having bias? That's what this experiment is for :-) One thing that occurred to me while setting this up is that we have been training on a Chinchilla-optimal number of tokens, 20x the number of parameters. Without QKV bias, we have 163,009,536 parameters, so we've been training on 3,260,190,720 tokens, rounded up to the nearest batch size, which is 3,260,252,160 in our current setup for these experiments (per-GPU micro-batches of 12, with 8 GPUs, so a total batch size of 96). These extra bias terms will be parameters, though! We're essentially making our model larger by adding them, which changes the Chinchilla calculation. How much? OK, that's essentially nothing -- 27,648 extra total paramaters on top of 163 million. I make it less than two hundredths of a percentage point larger! The correct number of tokens goes up to 3,260,743,680, so if we wanted to be very pedantic, we're under-training. But I feel like training on a larger dataset is worse in terms of comparability between the baseline and our "intervened-on" model with QKV bias. So: we'll train a model with QKV bias on 3,260,252,160 tokens, accepting that it's a tiny bit less than Chinchilla-optimal. Let's see how it goes! Here's the config file for this train. Running it gives this training chart: Pretty standard, though the loss spikes look less prominent than they have been in the other trains. Might QKV bias actually help with model stability in some way...? The train finished with these stats: Timing-wise, pretty much indistinguishable from the baseline train's 12,243.523 seconds. The final train loss looks a tad better, but we can't rely on that -- the test set loss is the important one. So it was time to download it, upload it to Hugging Face Hub , and then on to the evals. Firstly, our normal "how should you continue ": Not bad at all, borderline coherent! Next, the loss on the test set: Well, crap! Now that's a surprise. Let's look at that in the context of the other interventions to see how surprising that is, given Raschka's comments (which were undoubtedly backed up by serious research): So, adding QKV bias actually improved our test set loss by more than gradient clipping did! The loss spikes in the training chart look smaller than in the other trains 1 , so, speculating wildly, perhaps with a model of this size, the bias stabilises things somehow? Or perhaps what we're seeing is the model become that tiny bit smarter because it has some extra parameters -- albeit less than 0.02 percent more? I'm not going to spend time investigating things now, but this is a really interesting result. One extra thing that does occur to me is that the direction research has taken since GPT-2 has definitely been in the direction of larger models. The attention weight matrices are sized d emb × d emb , so excluding bias they have d emb 2 weights each. Bias adds on another d emb . So, as a model scales up, the attention-related non-bias weights will scale quadratically -- doubling d emb will square their number -- while the bias weights will scale linearly. So perhaps it's just that the effect -- whatever causes it -- gets rapidly swamped as you scale out of toy-model territory. That, at least, seems pretty plausible. One final note to self, though: these improvements are small enough that I do find myself wondering whether or not it might be some kind of noise, despite the setting of the random seeds I'm doing: I think that at the end of this, before I do a final train, it would be worth doing another baseline train and measuring the test set loss again, and doing another comparison. If it comes out exactly the same -- and I can bump up the number of significant figures in the output, it's just a formatting parameter -- then I don't need to worry. But if they vary to some degree, perhaps I'll need to update my mental model of what level of finding is significant, and what isn't. I think it goes without saying that QKV bias definitely goes onto the list of interventions we want to add when training our best-possible GPT-2 small-scale model, assuming that the random seed test goes well. That surprises me a bit, I was expecting it to have negligible impact! That, of course, is why it's worth doing these tests. Next up, I think, is trying to understand how we can tweak the learning rate, and its associated parameters like weight decay. This will need a bit of a deep dive, so you can expect the next post late next week, or perhaps even later. I'm sure you can't wait ;-) Note to self: is there some way I could quantitatively measure those?  ↩ Note to self: is there some way I could quantitatively measure those?  ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32c -- Interventions: removing dropout

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something! This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better? In a blog post last summer about architectural advances in LLMs since GPT-2 , Sebastian Raschka wrote: Dropout (2012) is a traditional technique to prevent overfitting by randomly "dropping out" (i.e., setting to zero) a fraction of the layer activations or attention scores (Figure 3) during training. However, dropout is rarely used in modern LLMs, and most models after GPT-2 have dropped it (no pun intended). I assume that dropout was originally used in GPT-2 because it was inherited from the original transformer architecture. Researchers likely noticed that it does not really improve LLM performance (I observed the same in my small-scale GPT-2 replication runs). This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced. So, since LLMs see each token only once during training, there is little risk of overfitting. That makes quite a lot of sense. My own understanding of dropout was that it was a bit broader than just preventing overfitting -- it seemed to me to be similar to the mandatory vacation policies that financial firms user to prevent over-dependence on individuals . My instinct was that having knowledge distributed across different weights in the model was good in and of itself, even beyond its benefit on multiple-epoch training. But it is quite a high price to pay. With the training parameters we've been using we're literally discarding 10% of our calculations' results -- attention weights, feed-forward neuron activations, and so on -- as we do the forward pass. It's easy to see why it would harm training. Let's give it a go. The nice thing about this one is that, unlike the gradient clipping experiment, I didn't have to write any new code. The dropout level was already controlled by a setting in the file , so by setting that to zero for this run, I could just kick it off and let it do its thing while I worked on something else: Here's what the training run chart looked like (please disregard the stuff about grad norms in the title and the axis -- I'll remove that for the next train): As you can see, we still have loss spikes, including one just after global step 20,000 that lasts for several checkpoint periods of 617 steps. I imagine gradient clipping might have helped with that, but I'm very deliberately testing each intervention in isolation. At the end of the training run, we got this: So, interestingly, it took 967 seconds -- about 16 minutes -- less time than the gradient clipping run, and about 15 minutes less than the baseline train. So while gradient clipping added on a small amount of time (or maybe that was just noise), dropping dropout certainly seems to speed things up! I guess there's quite a lot of work involved in generating and applying the random masks that drop things out as we're doing the forward pass. Anyway, with the model trained, it was time to download it, upload it to Hugging Face Hub , and run the evals. Firstly, the smoke test, where it just needs to continue the sequence , it came up with something reasonably coherent: ...but it was on the test of the loss on the training set that it was most impressive: That's a bigger improvement on the baseline train's 3.692 than gradient clipping: 0.051, which is more than three times the improvement! Let's start keeping a table of these: Now, of course, we don't know how these different interventions combine together -- it would be naive to think that if we did both gradient clipping and dropout removal, we'd get a total loss reduction of 0.014 + 0.051 -- but, especially with that long-lived loss spike in our training run -- it does feel like they might play well together. So, that's dropout covered. Which one next? I think a nice easy one that I should be able to get done on a Friday will be adding bias to the attention weight calculations. Let's give that a go and see if it makes things worse or better! Stay tuned...

3 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping. In the training chart for the baseline model, you can see that there are three places where the loss suddenly spiked up, at around global steps 4,200, 13,000, and 23,000: There are a number of things that could cause loss spikes like that: Exploding gradients are common in RNNs, and also happen in LLMs like this one. I spent a bit of time reading around to find out how they happen, and the ah-ha moment came when I came across this post from Wanshun Wong . Not only is the post itself a good intro in terms of how it affects RNNs, but in the "further reading" at the end, there's some gold: Chapter 10.11 of [1] has a good overview of how gradient clipping works. Now, I bought a copy of " Deep Learning " at the same time as I bought Raschka's book, but I'd only glanced through it. Now was the time to get it down from the shelf -- and, indeed, section 10.11.1 is all about clipping to handle exploding gradients. I'll put the explanation of how they happen into my own words, to see if I can clarify things (at least in my mind). Normally, when we learn about gradient descent, it's illustrated with nice smooth loss charts like this imaginary one for a single-parameter model: We're told that we might start at point A. The gradient is quite high and negative, so we multiply it by our learning rate and subtract it from our parameter. That gets us to point B. This time around, the gradient is smaller as the curve is flatter there, so when we do the same -- multiply by LR and subtract -- we take a smaller step, and wind up at C. Rinse and repeat and we'll wind up near the minimum. The problem is, what if the loss curve actually looks like this: We start at A, with a small gradient, move a little to the right, and now we're at B halfway down a cliff! The gradient is massive, and when we subtract it, even scaled by the learning rate, we can zoom off somewhere to the right -- maybe not even on the chart. Indeed, you can imagine a cliff that is so steep that it would have vertical portions -- negative infinite gradients in this case -- and no matter what your learning rate is, you'll wind up with an infinite parameter update and everything will break. It's hard to see how a model can continue training in a case like that. Now, what can cause steep cliffs like that? The book says "strongly nonlinear functions, such as those computed by a recurrent neural net over many time steps". If you know about RNNs (I wrote about them if you'd like a summary), you'll remember that a single RNN might be quite shallow -- maybe three or four layers -- but when you're doing backpropagation, you run a number of inputs through, one after the other, work out the overall loss, and then "unroll" it to something similar to a "vanilla" neural net to do the backward pass. To put that in concrete terms, a 3-layer neural network trained with a 100-element sequence would unroll to a 300-layer deep network. Every one of those layers has several operations, including (in the implementation I was looking at in my post above), a t a n h . It's not surprising that there are cliffs in the loss landscape -- it's more surprising that there are any smooth bits! Now in LLMs, we don't have that unrolling through time -- but our network is deep enough as it is. For the GPT-2 small model, disregarding the embeddings and the final output head, we have 12 Transformer layers, each of which is multiple matrix multiplications for attention, then a softmax, then another layer, and then a feed-forward... mapping precisely to the equivalent vanilla NN is hard, but I think you can treat each one as at least four layers, so we've got 48. And there are GELUs and logs and exps 1 dotted around, so again -- we should expect cliffs. So if sometimes we'll get crazy gradients, what can we do about them? We clip them. Clipping gradients simply means that if they get larger than a particular number -- v , which we define -- we reduce them to that number. In other words, we have a cap on how big they can get. "Deep Learning" ("DL" from now on) suggests two ways to do it. Remember that while in the example above, we only had one parameter -- on the X axis -- for the GPT-2 small LLM we're training, we have 163 million of them. So the gradients, instead of being one number, will be a 163M-long vector, one per parameter. The two ways to clip are: The second feels more elegant -- we're scaling all of the elements of the gradient vector by the same amount, so it still points in the same direction. Interestingly, though, DL says that the two methods "work similarly", which I'll read as "are pretty much the same in practice". DL then goes on to say how infinite or not-a-number gradients should be handled. With the first way, clearly doing it naively would set every element in the gradient vector to v , which would make the total size (norm) of the update very large. With the second, it be even worse -- we'd still wind up with completely junk gradients, because the norm would be infinite, and in Python is , so we'd be applying gradients with NaNs in them at best. That would be likely to knock our model into unrecoverable territory, as any parameter that had that applied to it would be NaN forever. Their suggested solution is that if you get garbage gradients like that, you can take a random step -- that is, create a new gradient to apply that has the norm v but just points in a random direction. The idea is that this will move you away from the cliff-ridden part of the loss landscape where you've found yourself (more about that later), and things will continue nicely. So, anyway, how to do this in practice? PyTorch has a function, , and that's what's referenced in almost every bit of writing I've found about how to clip gradients. So I decided to use that, assuming it would do what was described in DL's second option and that it would do the random updates they suggest for non-finite gradients. (I was half-correct -- see later.) As to how to use it -- if we had a normal training loop, where we were just using a normal optimiser, we would go from: ...to something like ...where is the max value v from above. However, for our training code using Automatic Mixed Precision (AMP), it's a little more complicated -- but luckily, the AMP explainer we've been using has a section explaining what to do . Right now we have this: Per that explainer, we need to move to this: That looks a bit weird; we're "unscaling" the gradients, then clipping them, then using the scaler to step the optimiser. You'd think that you'd need to "re-scale" the scaler after clipping the gradients -- to get back to where you started from before the optimiser step. From the help page I gather it keeps track of whether or not the gradients it has right now are currently scaled and handles them appropriately based on that state in . Anyway, given that we know what the code looks like now, we need to implement it in a way that can be easily switched on for this experiment (and potentially in the future), but which also allows us to not use it if we don't want to. The best way with our setup is to make it a training option, so we can do it this way: ...with extracted from the file where we call it in : ...and we can just pass in for it in our function that we use to find the maximum micro-batch size for our current hardware, as all we're testing for there is memory usage -- we don't care if we're doing good updates. Here's the code delta for that , plus a bugfix to allow for files without a in them. But it would also be useful to be able to track when it "fired" -- that is, when we had to clip our gradients. Then we can see two things: Now, the docs for say that it returns the "[t]otal norm of the parameter gradients (viewed as a single vector)". It doesn't say whether that's before or after the clipping, but given that the return value would always be if it was after, I'm going to guess that it returns the pre-clipping norm (ChatGPT agrees). So we can chart that; changes in these diffs: 1 , 2 , 3 , 4 . So we now have code to clip gradients to a given norm size and to chart the gradient norms so that we know what they were before clipping. The question is, what should that clipping norm be? Some googling around suggested that there was no standard way of saying "for such-and-such a kind of model, gradients should be clipped at around x ". For example, on this Reddit thread , says "Common values are 1, 3, 5, 8, 10", and likewise sample code in this tutorial . has 1, as does this one . So my initial thought was, let's just use 1. But then I wondered, what actually are the gradient norms that we're getting in normal training? I decided to run a local short train on 3m tokens (a thousandth of the full training set, taking just less than four minutes) with very frequent checkpointing, and gradient clipping set to 1, and see what happened. You can see that the "grad max" line is almost always above the "grad clip" -- we're almost always clipping. This doesn't sound right. It looked like the range of the grad max was generally beween 1.1 and a little above 3, so I set the to 3.5 and did another train: Our loss is about the same, but we're no longer clipping -- and that's what we want; there was no evidence of exploding gradients for that short run -- just big updates near the start, as you'd expect. I then ran the same with no gradient clipping at all, and got exactly the same shape for the loss chart as I did with gradient clipping at 3.5, and the same final loss -- that's a good signal that clipping is not affecting the train when we stay inside the limit, which is exactly what we want. So, it was time to train our model! I kicked off the train, and after a little while, I looked at the training chart, which is updated dynamically as the model trains: You can see the dotted green lines, both the light one and the dark one -- that is, the "grad max" and the "grad avg" -- disappear starting just before global step 4,000, only coming back at about 5,500 -- that is, these were not plotted for global steps 4,319 and 4,936, even though the loss was. What was going on? I took a look at the checkpoint meta file for the first of those to see what the actual numbers were, and saw this: Aha! The PyPlot code I was using could not handle infinite values, which is entirely reasonable. That was easy enough to fix , though -- I just replaced positive infinity by 1,000,000 and negative infinity by -1,000,000, and then (in the interest of getting a proper from-scratch run) kicked everything off from the beginning. That training run completed with this chart: That's a little hard to read, but if you look closely at the green lines, you can see that there are seven periods where gradients were either very large or infinite. Weirdly, though, out of the seven, two of them were two checkpoint periods long (that is, two periods of 617 global steps). That felt weird, though of course we're looking at the maximum gradient norm and the average gradient norm -- so two single infinite/high-gradient steps in successive 617-step periods would lead to that effect. What was even stranger, though, was that if you look at the training chart for the run with no gradient clipping, we have only three loss spikes rather than seven: ...though it's also very noticeable that the gradient-clipped run had only two small loss spikes, unlike the three larger ones in the unclipped run. The training loss the gradient-clipped run reported at the end was better, too: ...versus 3.743 at the end of the baseline train. So it was time to download it, and run the sequence-completion smoke test: Coherent enough! Next, we evaluate it against our held-back test set: So, the loss had gone down -- but only from 3.743 to 3.678, a reduction of 0.065, or about 1.7%. That's not actually all that bad! After all, in my initial experiments on my local machine, training for a Chinchilla-optimal number of tokens from FineWeb-Edu (rather than the regular FineWeb I'm using now) got a loss of 4.167 on the same dataset (weirdly worse with the more-curated training set), and training for a further Chinchilla-optimal number of tokens only brought that down to 4.135, for a difference of 0.032, or 0.7%. It's not strictly comparable due to the different training sets, but speaking very loosely, we could say that gradient clipping for this train had more effect than doubling the training time for the other one. That's pretty nifty. But the question remained: why those long periods of high gradients, even with gradient clipping? And why were there still loss spikes -- in particular the one just before global step 12,000, which lasted for two checkpoint periods? Remember that when I started the first run of this train, and got the chart with the missing bits, it was because the logged and were infinite. What happens when gets an infinite gradient -- either one that has an infinity as one of its components, or one that (due to numerical overflow) winds up with a norm of infinity anyway? I'd been kind of assuming that it did what the authors described in "Deep Learning" -- a random update of norm v -- given that the book stated pretty confidently that you "can" do it but then appeared to consider the topic closed. But it doesn't! If you check that link to the docs, you'll see that it has a parameter , which is by default. If it's set to , that will raise an exception if the norm is positive or negative infinity, or if it's not a number -- which catches both the infinite component and the norm overflow cases above. But if it's not set -- and we weren't setting it -- and the norm or the gradients are non-finite, then will essentially return garbage gradients. Depending on the exact cause, elements will either be infinities of one sign or another, or NaNs. And if these are added to parameters, then those parameters will become garbage too. Now that leads to the question, given that we know that somewhere in the period between the checkpoint at global step 4,319 and the previous one at 3,702 there was an infinite norm at some point, how on earth did the model manage to continue training after that? Loss went up at around the same time, but it wasn't completely broken as it would have been with NaNs or infinities in its parameters. Obscurely enough, the answer turned out to be in the AMP explainer , in a comment in one of the bits of example code. Regarding the class we're using: So what was happening was that the scaler -- something we introduced into our code to get a speedup by using 16-bit floats instead of 32-bit whenever PyTorch thought it would make sense -- was protecting us against infinite and NaN gradients as a side-effect. It was skipping updates that would have polluted our weights with bad values from non-finite gradients. If the above comes across as a little frustrated, then it's because I am a bit! From a software engineering viewpoint, this situation really does feel a bit like a rather messy part of the API. There are three things that it's reasonable for a library to do with infinite/NaN gradients: Now, if we look at that , we can see that the first two of those cases are handled there; and the developer can choose which option to follow. It's not where I'd personally put it (the function on the optimiser seems more natural) and I think I'd probably set the default to too, but I can also imagine good reasons for it being the way it is -- backward compatibility for one. But the "skip non-finite gradients" being a (not even optional!) behaviour that is on a class designed for handling mixed-precision training just seems outright bonkers. I would be surprised if there weren't people out there who've spent days trying to work out why their training runs failed catastrophically when they decided to switch from mixed-precision to "full fat" 32-bit floats, not realising that a hardly-even-documented feature of the scaler 3 had been saving them from gradient issues previously. Anyway, rant over. What does this all mean? There are three ways a gradient can explode: With both the baseline code and our new code, the was saving us from the last two of those, by skipping the optimiser steps with non-finite gradients. However, the baseline run was not protected against the first kind -- large but finite gradients with a finite norm -- while this run was protected. What I'm almost certain is happening here is that in all of my training runs so far, there have been all three kinds of issues with exploding gradients. The , which again, we introduced for faster training, happened to be saving us from the infinite gradients/norms. But we were still being bitten by the finite but excessively large ones. And that, I think, is why this training run had a positive -- not huge, but certainly worthwhile -- effect on the test set loss. If I had more time, I think I'd do another run, logging all three of those categories of error to see how frequent they are, and charting the result. That might go some way to explaining the final question I had here: why is it that the renowned "Deep Learning" suggests a random update to get away from the cliff where you've found yourself, while we seem to be getting away with just skipping the update, which is much simpler? Well, the book was written in 2016, and I guess rather a lot has changed in the last 10 years :-) My guess is that their solution might have been a solid default in the age of RNNs, but might not make so much sense with the kind of models we're training these days. I think I can see a way in which that makes sense. Think of the illustration of a loss "cliff" in a one-parameter world that we had at the start of this post: If you happen to wind up on that cliff, you're in trouble. But imagine a two-parameter model -- the line of the loss function becomes a surface. Just as in the real world you might be able to walk along the edge at the top of a cliff and find a nice easy slope down next to it, you can imagine that the cliff in the two-parameter case might be less of a problem because you don't need to be lucky enough to jump down it -- you can walk around it. Extrapolating examples like this to higher dimensions is risky, but I think it should hold that the more dimensions you're working with, the less likely it is that a cliff is an issue -- you're more likely to be able to find a way around it. I've heard a very similar argument made for why local minima are less of an issue with lots of parameters. It's certainly worth saying that this is far from a mathematical proof, but I think it's a decent grounding for intuition. Now think about an RNN. Although you're doing back-propagation through time over what amounts to a very deep network, there aren't actually all that many parameters, certainly compared to an LLM like this. Each parameter is involved in the back-propagation multiple times. So, thinking of it that way, the gradient vector for the RNNs they were dealing with was of much lower dimensionality than the ones we're dealing with, even for this tiny model. They say that the random step "will typically move away from the numerically unstable configuration". I'm probably playing fast and loose here, but I'll take that as something like: if you wound up on a cliff, you were likely in a very "cliffy" area of the loss landscape. "Teleporting" randomly to somewhere some distance away was a sensible way to handle that. In our situation, even if the area is "cliffy" in the direction that one particular batch might push us, we have so many extra dimensions that it may well be that it won't be so bad with the next one. So just skipping the problematic update -- under all of those assumptions -- seems a perfectly reasonable way to handle it. All of this, BTW, made me think back to validation loss. In our previous training runs, where we were measuring it just before each checkpoint, its spikes were in general correlated with but not identical to spikes in training loss: Now, of course, exploding gradients don't have to be related to high training loss -- there's enough non-linearity in there that we can treat them as being completely uncorrelated, I think. But you definitely would expect them to have an effect on validation loss if applied. Disregarding the infinite ones (which were being filtered out anyway), the very high ones that we are now clipping would, in the unclipped baseline train, seem very likely to have caused validation loss spikes. So: if I hadn't stripped that out, we would likely have been able to see a clear difference in the validation loss line between clipped and unclipped. That would have been useful! I'm not going to re-introduce it, though. Best to keep the number of code changes to a minimum if I'm trying to compare like with like over the course of these intervention tests. I think that's enough for gradient clipping. I may come back and do the experiment another time to see what the relative ratios of the different kinds of problematic gradients are. Are there parts of the train where we get lots of them as a percentage (ie. we're somewhere "cliffy" in the loss landscape)? How many infinite gradient vs infinite norm vs big-but-not-infinite instances do we have relative to each other, and to normal gradient updates? What do we see if we have validation loss? And so on. But for now: gradient clipping definitely helps, and goes on the positive interventions list! I'm thinking I'll see what happens with switching off dropout next. That should at least be a bit easier... Stay tuned! Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩ A "bad batch" -- that is, one batch, or even one sequence in a batch, was massively different in structure to the others that the model had seen, so it just had much worse loss. That doesn't seem likely in this case, though: the numbers on the chart are averages over 617 global steps each, and it would take a truly pathological sequence to move the needle that much. Something weird in the optimiser. That's not something I understand well, but according to the various LLMs I'm working with, it's a possibility. Exploding gradients. This is my working hypothesis, and so in this post I'll try out gradient clipping, the normal solution to that problem. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning (2016), MIT Press. We clip element-wise. If any one of the gradients in the vector is larger than v , we reduce it to v . We clip based on the norm: the length of the gradient vector in -- in our case -- 163M-dimensional space. That sounds harder than it is -- it's really just an extension of the Pythagorean equation that a 2 + b 2 = c 2 to multiple dimensions. If you want to work out the length of a vector ( a , b ) then you can use Pythagoras to work out c = a 2 + b 2 , and that generalises to any number of dimensions. So for our model we'd just square all 163M elements of the vector, sum those, and take the square root of the result, and that's the norm. 2 If the norm is greater than v , we just divide every element of the gradient vector by the norm and multiply the result by v , to produce a new gradient vector whose norm is v . Whether we actually did wind up clipping them and fixing those loss spikes Whether we were clipping at other times -- we don't want to be doing it unnecessarily. Blindly apply them and expect the developer to sanitise their inputs. Raise an error. Take some kind of default sane action, like skipping the update. It can get very large, still be finite, and have a finite norm. It can get very large, still be finite, but have an infinite norm (eg. due to numerical overflow) It can become infinite -- that is, at least one of the parameters' gradients is infinite (which of course means an infinite norm regardless of any numerical stuff). Oh my .  ↩ Technically the L2 norm -- if you used cubes/cube root it would be L3, and likewise for the power of four and L4 and so on. But the L2 is the one used for gradient clipping.  ↩ Shades of Douglas Adams , really: "But the plans were on display..." "On display? I eventually had to go down to the cellar to find them." “That’s the display department." “With a flashlight." “Ah, well, the lights had probably gone." “So had the stairs." “But look, you found the notice, didn’t you?" “Yes," said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."  ↩

1 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

I'm rounding out my series of posts on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090 , and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better. For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud . That led to some refinements in the prompt-following test I was using , and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub . Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to. I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list: I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'. The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU. But first, we need a baseline model. I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one? In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on. In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss). Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it. I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point. Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility. So: what was the baseline train going to look like? The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed , which I set to 42. A couple of bugfixes, and we were left with this version of the code . One thing to highlight: in the file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that: Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together. If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud. Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1 . On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12 × 8 = 96 , which means we'd be able to do gradient accumulation and do a parameter update every 96 ÷ 6 = 16 steps. Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help! So, having coded that up and set up the configuration, it was time to run it. Here's the training chart it came up with: Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later. The training run reported this at the end: So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost. Here is the model on Hugging Face Hub . Let's see how it looks. For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other . But the smoke test -- how does it complete the sequence is worthwhile: Looks good! Reasonably coherent. Now we can find the loss on our held-back test set: That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216). So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart. Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training. If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case! So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35. All in all, it was good to see them there, as it sets us up well for that experiment. So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests! Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping? Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow. Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩ The amount of training data. I'm not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we'd need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring. The number of epochs. I'm going to stick to single-epoch training -- that is, I'll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine. The bias on the W q , W k and W v matrices. This one definitely sounds worth looking into -- easy, as it's just a change to a config flag, and makes the model more like the original GPT-2. I'll give that a go. Dropout. I've read that for single-epoch training, dropout doesn't help (which doesn't quite work with my mental model of what it's for, but does sound plausible). Worth a look! The learning rate, and weight decay. The values I've used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit. The precision. I'm using AMP , which means that some calculations are done in 16-bit rather than 32-bit, and calling with to let PyTorch choose to use the GPU's tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you'd get if you trained with full-fat . Would reverting that be worth the longer train time? I should probably at least poke at that. The batch size. I've already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation. Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training -- maybe 10-15% of training time -- to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating! Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩

0 views
Giles's blog 3 months ago

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series , and wanted to share them with anyone that was interested. I managed to get it done , but it was kind of tricky to get right. The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end. This post is the tutorial I wish I'd found before I started , and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-) Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it? You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on. That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library , using models that had been uploaded to their hub . What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this: ...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then -- rather than like this , with its >100-line function. Here's what I had to do to get it working. To make it easier to follow along with this post, I've created a GitHub repo . As a starting point, I recommend you clone that, and then check out the tag: You'll see that there's a file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". There's also a script called , which is some code to run a model and get it to predict the 20 next words after the string , and a config file for the LLM code called , which tells it the number of layers, attention heads, and so on. If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with (recommended) or by running it in a Python environment with the libraries listed in installed. You'll get something like this: Your output will probably vary (for this and the later examples), as you'd expect from sampled LLM output, but it should at least be reasonably coherent. So: let's get it on Hugging Face! Our goal of being able to run inference with Transformers' system relies on a couple of deeper levels of abstraction. The requires that the model be available for download -- complete with all of its code and weights -- using code like this: is the HF abstraction for models that generate text. If that flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code -- the flag is how they do that opt-in. So it is, unfortunately, necessary. Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code: With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we'll be all set. But that's quite a big jump. There is a more general class called ; it's much simpler, just wrapping a generic model that might be doing anything. If we support it, we'll still need to use all of that clunky inference code, but the model's code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily. So let's get that working first, just to work out the bugs and get the basic process down pat. Our goal is to be able to run this in a Python environment where we just have and installed: ...and then have a model that we can run inference on, just like the code in our repo , but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it's not the endgame. If you're following along with the git repo, the tag to check out for this section is . In this version, you'll see a new subdirectory to contain our HF wrapper code (which I've imaginatively called ); you'll see why we need that later. In there, I've added a symlink to the model code itself (also to be explained later), an empty file to make the directory a Python module, and two files with some Transformers code: Let's dig into what's going on in those two. The first thing to understand is that whole thing in the filenames. Transformers is designed to handle all kinds of different models -- for example, Meta's Llama models and Qwen's models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like and or respectively -- but we don't have that advantage. Our code is not built in to the library. So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn't try to rely on built-in stuff. I chose because my Hugging Face username is my initials, 1 , and this model is the implementation of the GPT-2 architecture I'm playing with. That feels like a solid pattern to me -- it's unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you're consistent throughout your code, and so long as it doesn't clash with any of the built-ins. So, you need two files with those specific names: your-model-type , and your-model-type . Let's look at them now. They're really simple at this stage; here's the configuration one: Now, when Transformers is loading a model with , it's going to need to know how to configure it. At the very least, it will need to know what to pass into the . If you look at the code , it's taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That's going to be required to instantiate the model with the right setup so that it can load the weights that we're providing. There's other config stuff that will come there later, but that's all we have for now. It does this using the same pattern as the various methods we were looking at earlier: All we're doing here is defining what kind of thing that method will return when it's all set up properly. You can see that we're inheriting from a class -- this provides all of the infrastructure we're going to need to push things to HF. I don't think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name -- so, we're using for our model. However, the is important -- it has to match the model type that we've chosen and used for our filenames. Apart from that, we're stashing away the config that we're provided on a field, and then calling our superclass , forwarding on any kwargs we got in our own . Now let's look at : Just as with the config, there's for us to inherit from 2 . We're defining the thing that will return when it's all set up properly. We tell transformers that this should be configured with the that we just defined using that class variable, but apart from that, we're basically just wrapping the that is defined in 3 . That is imported using a relative import using rather than : This is important -- it has to be that way, as we'll discover later. But for now: that's why we had to create the subdirectory and the symlink to -- a relative import in Python can only happen if you're not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo. Now, let's take a look at the . We're calling the superclass , as you'd expect, then we're creating an underlying wrapped . We're expecting a parameter, which has the underlying model's configuration stashed away in its field by its own , so we can pass that down to the wrapped model. Finally, we call this special function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it's 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying. Now let's take a look at how we actually use those to upload the model. That's back at the root of the repo, in the file . Before looking at the code, try running it: So, it takes a model config path -- that file we have to set the number of layers and so on -- and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model -- code, weights and config -- to the Hub. Let's see how it works. We do some boilerplate imports, and then import our config and our model classes -- importantly, via the submodule. Don't worry, we're getting close to the explanation of why that is :-) A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it: What this is doing is telling our to register itself so that it is a thing that will be returned by the call. This only applies locally for now, but by setting things up locally we're telling the library what it will need to push up to the hub later. Next: We're doing exactly the same for our model, saying that it should be returned from . We need to be explicit about which of the various model classes we want to register it for -- the config class can only be loaded from , whereas the model might be something we'd want to have returned from , or if it was a different kind of model, perhaps , or something else entirely. What we want to do here is expose the basic model using , so that's what we do. We're creating our config class, passing in that model configuration that we loaded from the file earlier, so that it will stash it on its field, then: ...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So: ...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The file we have is specifically for the custom that we want to publish, not for the wrapped one. But that's easily done by using the field. Finally, the magic: This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped . Then it will look at the class that defines the model, and will push the file that has the source for that class. It will see that it also has a dependency on , and will push that and its source . It will also spot the setup we did with our two calls to the different methods above to register them for the and and push that too. And when it's pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to . The code doesn't want to upload loads of extra stuff -- for example, any libraries you're using. It wants to be sure that it's only uploading your model code. The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the or the file" -- that is, with a dot at the start of the module name, rather than . In order to do that kind of import, we needed to create a submodule. And in order to access our file we need a copy of it inside the submodule. I didn't want to have two actual copies of the file -- too easy to let them get out of sync -- so a symlink sorts that out. Hopefully that clears up any mystery about this slightly-strange file layout. Let's give it a go and see what it creates! In order to upload a model to the HF Hub, you'll need an account, of course, so create one if you don't have one. Next, create an access token with write access -- the option is in the "Access Tokens" section of the "Settings". Then you need to authorize your local machine to access the hub using that token; if you're using , then you can just run: If you're not, you'll need to download and install the HF CLI and then run That will store stuff on your machine so that you don't need to log in again in the future -- if you're concerned about security, there's an you can call, and you can completely trash the session by deleting the associated token from the HF website. Now, let's run our upload script! You'll need to change the target HF model name at the end of the command to one with your username before the slash, of course. Once you've done that, take a look at the model on Hugging Face. You'll see a rather ugly default model card, but let's ignore that for now and take a look at the "Files and versions" tab. You should see the following files: Now, let's look into that . It will look like this: The bit is just showing the name of the class that was used in the call. This will become useful later when we get onto the pipeline code, but doesn't matter right now -- the next one is more important. The is essentially saying, if someone does on this model, then use the class from here, and likewise for should use . It's what that stuff we did in the upload script set up. The is just the parameters that we're threading down to our underlying custom class; nothing exciting there. The is, of course, the floating point type we're using for the model, and the is our unique name for this particular architecture. And the is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions. So, it looks like there's enough information across those files on the hub to instantiate and use our model! Let's give that a go. The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment: and then to try to use the model: So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let's try to run some inference on it: So there we go! We've gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line: But that inference loop is still a pig; if you've been working with LLM code then it's not too bad -- a basic bit of autoregression with top-k and temperature -- but it's definitely holding us back. What next? One obvious issue with the code above is that we still have that dependency on . If we're going to run inference using the simple HF object, it's going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won't have the luxury of being able to just install it into the target runtime env -- you would still need to copy file around. Now, as I said at the start, I'm not going to go into this in as much detail, because my use case was really simple -- although I was using , the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I'll explain how you do things for models that use a built-in Transformers tokeniser. After that I'll give some pointers that you might find useful if you're using something more custom. The good news if you're using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can't do it by using the trick that we did above -- that is, you can't just import it: ...and then add this below our previous calls to register the model and config as auto classes: That will essentially do nothing. However, tokenisers do have their own method, and the target that you specify can be your model. So, for my own models, I'm using this: That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that's not the case by default), and then push it to the model. If you're following along with the code, you can check out the tag to see that. The code goes immediately after we've pushed the model itself to the hub. So, run the upload again: And now we can do a completely fresh env without tiktoken: In there, we can see that works: (Note that I had to use here -- that appears to be new in Transformers 5.0.0.) And do our inference test: It may not be much shorter than the code we had when we just had the , but it's an important step forward: we can now download and run inference on our custom model with none of the custom code -- neither the model itself nor the tokeniser -- on the machine where we're doing it. Everything is nicely packaged on the HF Hub. Now, what if you're using a tokeniser that's not already in Transformers? There are two possibilities here: As I said, I have not done either of these, but that's the direction I'd explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I'll link to the details of how to upload them then. Anyway, we've got the tokeniser done to the level we need for this walkthrough, so let's do the QoL improvements so that we can run inference on the model using the nice HF abstraction. Let's look at our target code for inference again: The version of the code that does this is in the repo on the tag , but I'll explain how it was put in place, with the logic behind each step. In order to run a text-generation pipeline, we're going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: . So, our first step is to put the plumbing in place so that we can use the method on that class to download our wrapped model. IMO it's cleanest to have two separate models, one for "simple" inference that is just a regular model -- the we have right now -- and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to : We can then add code to register that to our script -- the last line in this snippet, just below the two that already exist. That feels like it should be enough, but for reasons I've not been able to pin down, it's not -- you also need to massage the "auto-map" in the object to make it all work properly. So after that code, after we've created the object, we need this: With that in place, we could just upload our model -- would work just fine. But the model that it would return would not be any different to the one we've been using so far. To get that to work, we need to update the model to say that it can generate text. That's actually pretty easy. Firstly, we need it to inherit from a mixin class provided by Transformers: Now, the semantics of the method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper -- the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this: Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4 . In the case of the model I'm using to demonstrate, that's the parameter in the underlying configuration, so this can go inside the : Another change in the config that took me a while to puzzle out, and might catch you if you're in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with , the first run of the model will get the full input; let's say it returns . The next iteration of the loop, however, won't be passed the full new sequence , but rather just the token that was generated last time around, . So you'll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish: All of the tokens generated after had just the previous token as their context. Luckily, you just need to specify that your model doesn't have a cache in the config class as well, after the call to the superclass : We're almost there! At this point, we actually have all of the code that we need for a working . But there's one final tweak. A model on the hub has a "default" model type, which is the one that we use when we do the original . You might remember that it appeared in the in that single-element list keyed on . Previously we has this in our upload script: That means that our default is the model. But when the pipeline creates a model for us, it will just use the default -- even for the text-generation task, it doesn't assume we want to use the . Luckily, that's a small change: we just upload our text-generation model instead of the basic one: With all of that in place, we can run the script, upload the model, and then in a fresh environment: Lovely! Now let's get it training. For this section, check out the tag. You'll see a new file, , which has the training loop from the notebook I linked to at the start of this post. It will train the model on this dataset , which is essentially a bunch of chatbot-style transcripts in the Llama 2 format. Its goal is to help fine-tune a base model to become an instruction-following one, though of course the model I'm using here is too tiny for that to work well! It's still a useful way of checking that training works, though. To save time, it only does one training epoch, which should be enough to get the loss down a bit. If you run against one of my other models, you can see it working (you will need to tweak the batch size if you have less than 24G GiB of VRAM). You can see that it's at least trying to answer the question after training, even if its answer is completely wrong -- pretty much what you'd expect from the tiny model in question (163M parameters trained on about 3B tokens). In order to get it working with our custom models, we just need to return the loss as well as the logits from the method of our class: You can see that we're getting the targets for our predictions in , and an attention mask; we have to shift them ourselves (that is, if the inputs are , then the labels will be ), and also apply the attention mask manually, and then we can do the normal PyTorch cross-entropy calculation. This makes some kind of sense. The model on HF does need to package its own loss function somehow -- cross entropy is, of course, going to be the most likely option for a causal LM, but there's no guarantee. And while I think that personally I would have just had return logits and package up the loss calculation elsewhere so as not to muddy the interface, I can see the convenience of having it there. Anyway, having done that, we can upload the model one final time, and then use that training code to run it. We have a working training loop! Once again, it's replying, even if it has no idea what the answer is, and starts looping in a typical small-model fashion. And with that, we're done. We've gone from having a custom model that was hard for other people to discover and work with, to something that plays well with the Hugging Face ecosystem. The final step is to write a decent model card so that people know what to do with it -- that, of course, depends very much on your model. I was uploading a bunch of very similar models in one go, so I wound up writing a Jinja2 template and using the class to upload it, but that's just simple plumbing code -- you can see it here if you're interested. As I said at the start, this isn't a full tutorial -- it's just the code I needed to upload my own models, so it doesn't cover tokenisers that aren't already baked in to Transformers -- and there are probably other gaps too. But hopefully it's useful as-is. If you find gaps that your model needs and work out how to solve them, then please do leave comments here -- if there are useful resources out there, either things I missed or things you've written, I'd be happy to link to them from this post. Thanks for reading! I'll be returning to my normal "LLM from scratch" series shortly... It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩ -- a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don't play nicely with git, so it uses LFS for them. We don't need to pay much more attention to that for our purposes. -- that ugly model card. Updating that is useful, but out of scope for this post. . We'll come back to that one in a moment. -- a copy of the file we created locally with our class. -- again, the same file as the local one, uploaded due to that clever dependency-finding stuff. -- our weights. There should be an icon next to it to say that it's stored using the LFS system. -- once more, a file that was just copied up from our local filesystem. You're using the HF library. With that, you can save your tokeniser to a JSON file, then you could load that into a object, which provides a method to push it like I did with the one above. You've got something completely custom. Just like there is a and a , I believe you can also add a that defines a subclass of , and then you can push that to the Hub just like we did our model wrapper class. Working , , , and helpers. A working text-generation . Support for HF's abstraction for follow-on training and fine-tuning. It's a fun coincidence that my initials are so similar to the architecture. Someday I should do something with my domain ...  ↩ I'm not sure why the capitalisation of the "t" is different -- vs -- but it seems very deliberate in the Transformers codebase, at least as of version 4.57.6. Some kind of backward-compatibility cruft, I assume. 5.0.0 provides a alias as well, so it looks like they're making things consistent in the future.  ↩ You might reasonably suggest that we could inherit from rather than wrapping it. I've chosen to wrap it instead because I generally prefer composition to inheritance -- the code generally works out nicer, to my mind. I'd suggest starting this way and then refactoring to use inheritance if you prefer later on.  ↩ No idea why, but it does ¯_(ツ)_/¯  ↩

0 views
Giles's blog 3 months ago

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face

As part of my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally , and four in the cloud . I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank. It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too. Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later). From the post where I trained the models locally , we have: Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs , four models (with two checkpoints from one of them): You can see how they compare on my evals at the bottom of this post . I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that. Here's the code I've been using as a smoke test after training a model to make sure it's not complete garbage. There's quite a lot of it. That's a lot of faffing about to generate a continuation of ! Disregarding the boilerplate with the argument parsing and validating, we have to load up the model, load up the tokeniser, encode our prompt, and then do a bunch of rather arcane stuff 1 to sample from the model to generate some tokens before we finally print out the result. With the HF Transformers library, there are extra levels of abstraction that allow you to do things much more simply: ...and I wanted what I published to work with that -- and, indeed to be trainable further using the associated training library, like I did during my fine-tuning experiments . I managed to get that all to work, but it was quite a lot more effort than I expected. But at the end, both the pipeline code above, and the training code that you can see in this notebook worked fine. I'll write a follow-up blog post shortly about how to write the code to make a vanilla PyTorch model work within the Hugging Face ecosystem (probably not as part of this LLM from scratch series, as it's a bit of a tangent). But in the meantime, if you're using HF and want to take a look, have fun :-) I've put all of the models in a collection . Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩ -- the first model in that post, trained on a roughly Chinchilla-optimal number of tokens (20x the number of parameters) from FineWeb . -- the second model, trained on the same number of tokens from FineWeb-Edu . -- the third one, which is the model trained further on another roughly Chinchilla-optimal number of tokens from the same dataset. -- trained on a 8x A100, 40 GiB/GPU machine. -- trained on a 8x B200, 160 GiB/GPU machine. -- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was not in the last iteration, so this is the checkpoint with the best loss. -- this one is the final checkpoint from the one above. -- trained on a 8x A100, 80 GiB/GPU machine. Of course, if you've been reading the posts in this series carefully I'm sure it's all as clear as day ;-)  ↩

0 views