Posts in Cloud (20 found)
Stratechery 1 weeks ago

Thin Is In

Listen to this post : There was, in the early days of computing, no debate about thick clients versus thin: When a computer was the size of a room, there were no clients: you scheduled time or submitted jobs, and got back the results when it was your turn. A few years later, however, thin clients in the form of a monitor and keyboard arrived: There is no computer in this image; rather, this is a terminal connected to a mainframe. That’s why it’s called a “thin” client: it’s just an interface, with all of the computing happening elsewhere (i.e. in another room). By the 1980s, however, “thick” clients were the dominant form of computing, in the form of the PC. All of your I/O and compute were packaged together: you typed on a keyboard connected to a PC, which output to the monitor in front of you. A decade later, and Sun Microsystems in particular tried to push the idea of a “network computer”: This was a device that didn’t really have a local operating system; you ran Java applications and Java applets from a browser that were downloaded as they were used from a central server. Sun’s pitch was that network computers would be much cheaper and easier to administer, but PCs were dropping in price so quickly that the value proposition rapidly disappeared, and Windows so dominant that it was already the only platform that network administrators wanted to deal with. Thick clients won, and won decisively. If you wanted to make a case for thin clients, you could argue that mobile devices are a hybrid; after all, the rise of mobile benefited from and drove the rise of the cloud: nearly every app on a phone connects to a server somewhere. Ultimately, however, mobile devices are themselves thick clients: they are very capable computers in their own right, that certainly benefit from being connected to a server, but are useful without it. Critically, the server component is just data: the actual interface is entirely local. You can make the same argument about SaaS apps: on one hand, yes, they operate in the cloud and are usually accessed via a browser; on the other hand, the modern browser is basically an operating system in its own right, and the innovations that made SaaS apps possible were the fact that interactive web apps could be downloaded and run locally. Granted, this isn’t far off from Sun’s vision (although the language ended up being JavaScript, not Java), but you still need a lot of local compute to make these apps work. The thick-versus-thin debate felt, for many years, like a relic; that’s how decisive was the thick client victory. One of the things that is fascinating about AI, however, is that the thin client concept is not just back, it’s dominant. The clearest example of this is the interface that most people use to interact with AI: chat. There is no UI that matters other than a text field and a submit button; when you click that button the text is sent to a data center, where all of the computation happens, and an answer is sent back to you. The quality of the answer or of the experience as a whole is largely independent of the device you are using: it could be a browser on a PC, an app on a high-end smartphone, or the cheapest Android device you can find. The device could be a car, or glasses, or just an earpiece. The local compute that matters is not processing power, but rather connectivity. This interaction paradigm actually looks a lot like the interaction paradigm for mainframe computers: type text into a terminal, send it to the computer, and get a response back. Unlike mainframe terminals, however, the user doesn’t need to know a deterministic set of commands; you just say what you want in plain language and the computer understands. There is no pressure for local compute capability to drive a user interface that makes the computer easier to use, because a more complex user interface would artificially constrain the AI’s capabilities. Nicolas Bustamante, in an X Article about the prospects for vertical software in an AI world , explained why this is threatening: When the interface is a natural language conversation, years of muscle memory become worthless. The switching cost that justified $25K per seat per year dissolves. For many vertical software companies, the interface was most of the value. The underlying data was licensed, public, or semi-commoditized. What justified premium pricing was the workflow built on top of that data. That’s over. Bustamante’s post is about much more than chat interfaces, but I think the user interface point is profound: it’s less that AI user interfaces are different, and more that, for many use cases, they basically don’t exist. This is even clearer when you consider the next big wave of AI: agents. The point of an agent is not to use the computer for you; it’s to accomplish a specific task. Everything between the request and the result, at least in theory, should be invisible to the user. This is the concept of a thin client taken to the absolute extreme: it’s not just that you don’t need any local compute to get an answer from a chatbot; you don’t need any local compute to accomplish real work. The AI on the server does it all. Of course most agentic workflows that work tread a golden path, but stumble with more complex situations or edge cases. That, though, is changing rapidly, as models become better and the capabilities of the chips running them increase, particularly in terms of memory. When it comes to inference, memory isn’t just important for holding the model weights, but also retaining context about the task at hand. To date most of the memory that matters has been high-bandwidth memory attached to the GPUs, but future architectures will offload context to flash storage . At the same time, managing agents is best suited to CPUs , which themselves need large amounts of DRAM. In short, both the amount of compute we have, and the capability of that compute, still isn’t good enough; once it crosses that threshold, though, demand will only get that much stronger. This combination of factors will only accentuate the dominance of the thin client paradigm: Yes, you can run large language models locally, but you are limited in the size of the model, the size of the context window, and speed. Meanwhile, the superior models with superior context windows and faster speeds don’t require a trip to the computer lab; just connect to the Internet from anywhere. Note that this reality applies even to incredible new local tools like OpenClaw: OpenClaw is an orchestration layer that runs locally, but the actual AI inference is, by default and in practice for most users, done by models in the cloud. To put it another way, to be competitive, local inference would need some combination of smaller-yet-sufficiently-capable models, a breakthrough in context management, and critically, lots and lots of memory. It’s that last one that might be the biggest problem of all. From Bloomberg : A growing procession of tech industry leaders including Elon Musk and Tim Cook are warning about a global crisis in the making: A shortage of memory chips is beginning to hammer profits, derail corporate plans and inflate price tags on everything from laptops and smartphones to automobiles and data centers — and the crunch is only going to get worse… Sony Group Corp. is now considering pushing back the debut of its next PlayStation console to 2028 or even 2029, according to people familiar with the company’s thinking. That would be a major upset to a carefully orchestrated strategy to sustain user engagement between hardware generations. Close rival Nintendo Co., which contributed to the surplus demand in 2025 after its new Switch 2 console drove storage card purchases, is also contemplating raising the price of that device in 2026, people familiar with its plans said. Sony and Nintendo representatives didn’t respond to requests for comment. A manager at a laptop maker said Samsung Electronics has recently begun reviewing its memory supply contracts every quarter or so, versus generally on an annual basis. Chinese smartphone makers including Xiaomi Corp., Oppo and Shenzhen Transsion Holdings Co. are trimming shipment targets for 2026, with Oppo cutting its forecast by as much as 20%, Chinese media outlet Jiemian reported. The companies did not respond to requests for comment. The memory shortage has been looming for a while, and is arguably the place where consumers will truly feel the impact of AI; I wrote in January in the context of Nvidia’s keynote at CES: CES stands for “Consumer Electronics Show”, and while Nvidia’s gaming GPUs received some updates, they weren’t a part of [Nvidia CEO Jensen] Huang’s keynote, which was focused on that Vera Rubin AI system and self-driving cars. In other words, there wasn’t really anything for the consumer, despite the location, because AI took center stage. This is fine as far as Nvidia goes: both the Vera Rubin announcement and its new Alpamayo self-driving system are big deals. It is, however, symbolic of the impact AI is having on technology broadly, and that impact is set to impact consumer electronics in a major way. Specifically, not only is all of the energy and investment in the tech sector going towards AI, but so is the supply chain. A big story over the last few months has been the dramatically escalating cost of memory as the major memory manufacturers shift their focus to high-bandwidth memory for AI chips in particular. What that means is that everything else is going to get a lot more expensive: memory is one of the most expensive components in nearly everything tech-related, and given the competitive and commoditized nature of the industry those costs will almost certainly be passed on to the end users. This AI crowd-out dynamic arguably started with the hyperscalers, who diverted ever increasing parts of their budget to GPUs in place of CPU purchases, but now it’s coming for everything from grid power to turbines and now to components, and it’s only going to increase and become more impactful to end users. In other words, Nvidia may not have talked about consumer electronics at the Consumer Electronics Show, but they are having the biggest impact on the industry by far. The downsides of this crowd-out effect are obvious; I pity anyone trying to build their own PC, for example, but soon their pain will be everyone’s pain as prices inevitably rise on everything that needs RAM. At the same time, I think the reported PlayStation delay is telling: apparently the PS5 is “good enough” for Sony to wait for more memory capacity to come online, and they’re probably right! Thick clients — of which consoles like the PS5 are the ultimate example — have long since reached the point of diminishing returns when it comes to hardware improvements. I think you could make the same case for PCs and phones as well: what we already have is already more than sufficient for almost any task we want to do. Moreover, the plateau in thick client capability is happening at the same time that the need for any capability at all is disappearing, thanks to these entirely new AI workflows that happen in the cloud. Yes, it sucks that AI is making memory scarce and personal computers of all kinds — from PCs to phones to consoles — more expensive; it’s also making them less important than ever. Of course thick clients could make a comeback, particularly since local inference is “free” (i.e. the user pays for their own electricity). As I noted above, however, I’m skeptical about local inference in the near term for performance reasons, and the memory crunch is going to make it uneconomical for the foreseeable future. And, by the time local inference is a viable alternative, path dependency downstream of these few years may have already led to many workflows moving to this new paradigm. It will, to be clear, be a transition: UI isn’t just about how to use a computer, it also, as Benedict Evans noted on a recent Interview , embeds critical aspects of how a business works. Open-ended text prompts in particular are a terrible replacement for a well-considered UI button that both prompts the right action and ensures the right thing happens. That’s why it’s the agent space that will be the one to watch: what workflows will transition from UI to AI, and thus from a thick client architecture to a thin one? Current workflows are TBD; future workflows seem inevitable. First, if compute isn’t yet good enough, then workloads will flow to wherever compute is the best, which is going to be in large data centers. Second, if larger models and more context makes for better results, then workloads will flow to wherever there is the most memory available. Third, the expense of furnishing this level of compute means that it will be far more economical to share the cost of that compute amongst millions of users; guaranteeing high utilization and maximizing leverage on your up-front costs.

0 views
neilzone 2 weeks ago

Moving away from Nextcloud

I have used Nextcloud for a long time. In fact, I have used Nextcloud from before it was Nextcloud - before the fork of Owncloud. And while I have not used many of its features - sync, calendar, and contacts - I’ve been a very happy user for a long time. Until a year or so ago, at least. I’ve had a worry, at the back of my mind, for a while, that Nextcloud is trying to do too much. A collaborative document editor. An email client. A voice/video conferencing tool, and so on. I’m sure that, in some contexts, this is amazing, and convenient. For me, as someone who typically prefers a piece of software to do one thing well, it left me a bit uneasy. But that was not, in itself, enough of a reason for me to switch. A year or so ago, I had problem after problem keeping files in sync. I routinely got error messages about the database (or files; I don’t quite remember) being locked. And, for me, that was the mainstay of Nextcloud, and indeed the reason why I started to use it in the first place. I tried all sorts of things, including setting up redis, and trying other memcache options, even though I am the only regular user. I could not get it to sync reliably. And I really did try, using the voluminous logs to try to determine what was going wrong. But I failed. And so I started considering other options. Did I actually need Nextcloud at all? I’ve moved to Syncthing for syncing, and so far, that has been working fine. It is fast, and appears to be reliable. I should probably write about it at some point. Using Nextcloud to sync photos from my phone was not too bad, but from Sandra’s iPhone, it did not work well. I have switched to Immich for photo sync / gallery, and I’ve been very happy with it. For contacts and calendar sync - DAV - I am using Radicale . The main annoyance is that Sandra cannot invite me (or anyone) to appointments using the iOS or macOS calendar. For me, I’ve just given Sandra write access to my calendar, so that she can add events directly, but it is far from ideal. I’ve tried using Radicale’s server-side email functionality, and that is not suitable for my needs, as it sends out far too many email. But, for now, Radicale is tolerable, even if I might try to find another option at some point. And that just leaves the directories which I share via Nextcloud and mount in my file browser. Stuff that I don’t need on my computer, but still want to access. For that, I’m going back to samba. It works. And so, once I’ve finalised this and tested it and given it some time to bed in, I will turn off the Nextcloud server.

0 views
Stratechery 2 weeks ago

Amazon Earnings, CapEx Concerns, Commodity AI

Amazon's massive CapEx increase makes me much more nervous than Google's, but it is understandable.

0 views
Stratechery 2 weeks ago

Google Earnings, Google Cloud Crushes, Search Advertising and LLMs

Google announced a massive increase in CapEx that blew away expectations; the companies earnings results explain why the increase is justified.

0 views
matklad 3 weeks ago

CI In a Box

I wrote , a thin wrapper around ssh for running commands on remote machines. I want a box-shaped interface for CI: That is, the controlling CI machine runs a user-supplied script, whose status code will be the ultimate result of a CI run. The script doesn’t run the project’s tests directly. Instead, it shells out to a proxy binary that forwards the command to a runner box with whichever OS, CPU, and other environment required. The hard problems are in the part: CI discourse amuses me — everyone complains about bad YAML, and it is bad, but most of the YAML (and associated reproducibility and debugging problems) is avoidable. Pick an appropriate position on a dial that includes What you can’t just do by writing a smidgen of text is getting the heterogeneous fleet of runners. And you need heterogeneous fleet of runners if some of the software you are building is cross-platform. If you go that way, be mindful that The SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. In other words, while SSH supports syntax like , it just blindly intersperses all arguments with a space. Amusing to think that our entire cloud infrastructure is built on top of shell injection ! This, and the need to ensure no processes are left behind unintentionally after executing a remote command, means that you can’t “just” use SSH here if you are building something solid. One of them is not UNIX. One of them has licensing&hardware constraints that make per-minute billed VMs tricky (but not impossible, as GitHub Actions does that). All of them are moving targets, and require someone to do the OS upgrade work, which might involve pointing and clicking . writing a bash script, writing a script in the language you already use , using a small build system , using a medium-sized one like or , or using a large one like or .

0 views
Giles's blog 3 weeks ago

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

I'm rounding out my series of posts on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) " by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090 , and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better. For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud . That led to some refinements in the prompt-following test I was using , and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub . Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to. I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list: I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'. The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU. But first, we need a baseline model. I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one? In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on. In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss). Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it. I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point. Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility. So: what was the baseline train going to look like? The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed , which I set to 42. A couple of bugfixes, and we were left with this version of the code . One thing to highlight: in the file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that: Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together. If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud. Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1 . On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12 × 8 = 96 , which means we'd be able to do gradient accumulation and do a parameter update every 96 ÷ 6 = 16 steps. Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help! So, having coded that up and set up the configuration, it was time to run it. Here's the training chart it came up with: Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later. The training run reported this at the end: So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost. Here is the model on Hugging Face Hub . Let's see how it looks. For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other . But the smoke test -- how does it complete the sequence is worthwhile: Looks good! Reasonably coherent. Now we can find the loss on our held-back test set: That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216). So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart. Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training. If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case! So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35. All in all, it was good to see them there, as it sets us up well for that experiment. So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests! Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping? Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow. Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩ The amount of training data. I'm not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we'd need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring. The number of epochs. I'm going to stick to single-epoch training -- that is, I'll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine. The bias on the W q , W k and W v matrices. This one definitely sounds worth looking into -- easy, as it's just a change to a config flag, and makes the model more like the original GPT-2. I'll give that a go. Dropout. I've read that for single-epoch training, dropout doesn't help (which doesn't quite work with my mental model of what it's for, but does sound plausible). Worth a look! The learning rate, and weight decay. The values I've used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit. The precision. I'm using AMP , which means that some calculations are done in 16-bit rather than 32-bit, and calling with to let PyTorch choose to use the GPU's tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you'd get if you trained with full-fat . Would reverting that be worth the longer train time? I should probably at least poke at that. The batch size. I've already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation. Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training -- maybe 10-15% of training time -- to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating! Well, you could potentially do something with batches of different sizes, but that would be fiddly.  ↩

0 views
DHH 3 weeks ago

Cloud gaming is kinda amazing

I fully understand the nostalgia for real ownership of physical-media games. I grew up on cassette tapes (C64 + Amstrad 464!), floppy disks (C64 5-1/4" then Amiga 3-1/2"), cartridges, and CDs. I occasionally envy the retro gamers on YouTube with an entire wall full of such physical media. But do you know what I like more than collecting? Playing! Anywhere. Anything. Anytime. We went through the same coping phases with movies and music. Yes, vinyl had a resurgence, but it's still a tiny sliver of hours listened. Same too with 4K Blue-rays. Almost everyone just listens to Spotify or watches on Netflix these days. It's simply cheaper, faster, and, thus, better. Not "better" in some abstract philosophical way (ownership vs rent) or even in a concrete technical way (bit rates), but in a practical way. Paying $20/month for unlimited music and the same again for a broad selection of shows and movies is clearly a deal most consumers are happy to make. So why not video games? Well, because it just wasn't good enough! Netflix tried for casual gaming, but I didn't hear much of that after the announcement. Google Stadia appears to have been just a few years ahead of reality (eerie how often that happens for big G, like with both AI and AR!) as they shut down their service already. NVIDIA, though, kept working, and its GeForce NOW service is actually, finally kinda amazing! I had tried it back in the late 2010s, and just didn't see anything worth using back then. Maybe my internet was too slow, maybe the service just wasn't good enough yet. But then I tried it again a few days ago, just after NVIDIA shipped the native GFN client for Linux, and holy smokes!! You can legitimately play Fortnite in 2880x1800 at 120 fps through a remote 4080, and it looks incredible. Yes, there's a little input lag, but it's shockingly, surprisingly playable on a good internet connection. And that's with the hardest possible genre: competitive shooters! If you play racing games like Forza Horizon or story-mode games like Warhammer 40K: Space Marine 2, you can barely tell! This is obviously a great option for anyone with a modest computer that can't run the latest triple-A titles, but also for Linux gamers who don't have access to run the cheat-protection software required for Fortnite and a few other games.  And, like Spotify and Netflix, it's pretty competitively priced. It's $20/month for access to that 4080-tier. You'd quickly spend $2,000+ on a gaming rig with a 4080, so this isn't a half bad deal: it's a payback of 100 months, and by then you'd probably want a 6080 anyway. Funny how NVIDIA is better at offering the promise of cheap cloud costs than the likes of AWS! Anyway, I've been very impressed with NVIDIA GeForce NOW. We're going to bake the Linux installer straight into the next version of Omarchy, so you can just go to Install > Gaming > NVIDIA GeForce NOW to get going (just like we have such options for Steam and Minecraft). But of course seeing Fortnite running in full graphics on that remote 4080 made me hungry for even more. I've been playing Fortnite every week for the last five years or so with the kids, but the majority of my gameplay has actually been on tablet. A high-end tablet, like an iPad M5, can play the game with good-for-mobile graphics at 120 Hz. It's smooth, it's easy, and the kids and I can lounge on the couch and play together. Good Family Fun! Not peak visual fidelity, though. So after the NVIDIA GeForce NOW experience, I found a way to use the same amazing game streaming technology at home through a local-server solution called Apollo and a client called Moonlight. This allowed me to turn my racing-sim PC that's stuck downstairs into a cloud-like remote gaming service that I can access anywhere on the local network, so I can borrow its 4090 to play 120-fps, ultra-settings Fortnite with zero perceivable input lag on any computer in the house. The NVIDIA cloud streaming is very impressive, but the local-server version of the same is mind-blowing. I'm mostly using the Asus G14 laptop as a client, so Fortnite looks incredible with those ultra, high-resolution settings on its OLED, but unlike when you use that laptop's built-in graphics card, the machine stays perfectly cool and silent pulling a meager 18 watts. And the graphics are of course a lot nicer. The Moonlight client is available for virtually every platform: Mac, iOS, Android, and of course Linux. That means no need to dual boot to enjoy the best games at the highest fidelity. No need for a honking big PC on my primary desk. I did not know this was an option!! Whether you give NVIDIA's cloud gaming setup a try or repurpose a local gaming PC for the same, you're in for a real treat of what's possible with streaming Fortnite on ultra settings at 120 fps on Linux (or even Mac!). GG, NVIDIA!

0 views
Lalit Maganti 1 months ago

The surprising attention on sprites, exe.dev, and shellbox

Over the last few weeks, three new products have announced themselves on Hacker News to great success, each making the frontpage: All three have a very simple pitch: they will give you full access to Linux virtual machines to act as a sandboxed developer environment in the cloud. At first glance, the attention these have gotten is very head-scratching. The idea of a Linux VPS has been around for more than 20 years at this point and VPS providers like DigitalOcean and Hetzner are widely known and used in the industry. From a technological standpoint, there’s very little revolutionary here. Is it price then? Well no: the hardware specs are pretty awful for what you pay. For example, exe.dev gives you 2 CPUs and 8GB RAM shared across your whole account for $20/month. For comparison, at Hetzner for roughly that price, you can get a single VPS with 16 CPUs and 32GB RAM… Sprites (fly.io) with 508 votes and hit #7 exe.dev with 457 votes and hit #2 shellbox.dev with 316 votes and hit #4

0 views
matklad 1 months ago

Vibecoding #2

I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one. At TigerBeetle, we are big on deterministic simulation testing . We even use it to track performance , to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat. To do that, you need to procure six machines in a cloud, get your custom version of binary on them, connect cluster’s replicas together and hit them with load. It feels like, quarter of a century into the third millennium, “run stuff on six machines” should be a problem just a notch harder than opening a terminal and typing , but I personally don’t know how to solve it without wasting a day. So, I spent a day vibecoding my own square wheel. The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around. My idea for the solution comes from these three sources: The big idea of is that you can program distributed system in direct style. When programming locally, you do things by issuing syscalls: This API works for doing things on remote machines, if you specify which machine you want to run the syscall on: Direct manipulation is the most natural API, and it pays to extend it over the network boundary. Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts: synchronizes a local and remote projects. If you run inside folder, then materializes on the remote machine. does the heavy lifting, and the wrapper script implements behaviors. It is typically followed by , which runs command on the remote machine in the matching directory, forwarding output back to you. So, when I want to test local changes to on my Linux box, I have roughly the following shell session: The killer feature is that shell-completion works. I first type the command I want to run, taking advantage of the fact that local and remote commands are the same, paths and all, then hit and prepend (in reality, I have alias that combines sync&run). The big thing here is not the commands per se, but the shift in the mental model. In a traditional ssh & vim setup, you have to juggle two machines with a separate state, the local one and the remote one. With , the state is the same across the machines, you only choose whether you want to run commands here or there. With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2. The third component, library, is pretty mundane — just a JavaScript library for shell scripting. The notable aspects there are: JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Combining the three ideas, I now have a deno script, called , that provides a multiplexed interface for running ad-hoc code on ad-hoc clusters. A session looks like this: I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held. I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation. This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need. With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there. First, the original ChatGPT spec didn’t get what I meant with my “current directory mapping” idea, that I want to materialize a local as remote , even if are different. ChatGPT generated an incorrect description and an incorrect example. I manually corrected example, but wasn’t able to write a concise and correct description. Claude fixed that working from the example. I feel like I need to internalize this more — for current crop of AI, examples seem to be far more valuable than rules. Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing. The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact. My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach. In my previous vibe-post , I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons: And, as I said, the code didn’t feel good, for these specific reasons: The incremental approach worked much better, Claude is good at filling-in the blanks. The very first thing I did for was manually typing-in: Then I asked Claude to complete the function, and I was happy with the result. Note Show, Don’t Tell I am not asking Claude to avoid throwing an exception and fail fast instead. I just give function, and it code-completes the rest. I can’t say that the code inside is top-notch. I’d probably written something more spartan. But the important part is that, at this level, I don’t care. The abstraction for parsing CLI arguments feel right to me, and the details I can always fix later. This is how this overall vibe-coding session transpired — I was providing structure, Claude was painting by the numbers. In particular, with that CLI parsing structure in place, Claude had little problem adding new subcommands and new arguments in a satisfactory way. The only snag was that, when I asked to add an optional path to , it went with , while I strongly prefer . Obviously, its better to pick your null in JavaScript and stick with it. The fact that is unavoidable predetermines the winner. Given that the argument was added as an incremental small change, course-correcting was trivial. The null vs undefined issue perhaps illustrates my complaint about the code lacking character. is the default non-choice. is an insight, which I personally learned from VS Code LSP implementation. The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote and then asked Claude to write the body of a particular function according to the SPEC.md. Unlike with the CLI, Claude wasn’t able to follow this pattern itself. With one example it’s not obvious, but the overall structure is that is the AWS-level operation on a single box, and is the CLI-level control flow that deals with looping and parallelism. When I asked Claude to implement , without myself doing the / split, Claude failed to noticed it and needed a course correction. However , Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write: I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well. Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code: Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first , not last, common prefix is more valuable than common suffix because of visual alignment. The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely. Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them. For example, when I first tried to after , I got an error. Pasting it into Claude immediately showed the problem. Originally, the code was doing and not . The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist. Another fun one was about the disk. I noticed that, while the instance had an SSD, it wasn’t actually used. I asked Claude to mount it as home, but that didn’t work. Claude immediately asked me to run and that log immediately showed the problem. This is remarkable! 50% of my typical Linux debugging day is wasted not knowing that a useful log exists, and the other 50% is for searching for the log I know should exist somewhere . After the fix, I lost the ability to SSH. Pasting the error immediately gave the answer — by mounting over , we were overwriting ssh keys configured prior. There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!). It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article! I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me: A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved. But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right! https://github.com/catern/rsyscall https://peter.bourgon.org/blog/2011/04/27/remote-development-from-mac-to-linux.html https://github.com/dsherret/dax JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration. No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs. I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result. It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance. The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there. I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong. In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).

2 views
Robin Moffatt 1 months ago

Alternatives to MinIO for single-node local S3

In late 2025 the company behind MinIO decided to abandon it to pursue other commercial interests. As well as upsetting a bunch of folk, it also put the cat amongst the pigeons of many software demos that relied on MinIO to emulate S3 storage locally, not to mention build pipelines that used it for validating S3 compatibility. In this blog post I’m going to look at some alternatives to MinIO. Whilst MinIO is a lot more than 'just' a glorified tool for emulating S3 when building demos, my focus here is going to be on what is the simplest replacement. In practice that means the following: Must have a Docker image. So many demos are shipped as Docker Compose, and no-one likes brewing their own Docker images unless really necessary. Must provide S3 compatibility. The whole point of MinIO in these demos is to stand-in for writing to actual S3. Must be free to use, with a strong preference for Open Source (per OSI definition ) licence e.g. Apache 2.0. Should be simple to use for a single-node deployment Should have a clear and active community and/or commercial backer. Any fule can vibe-code some abandon-ware slop, or fork a project in a fit of enthusiasm—but MinIO stood the test of time until now and we don’t want to be repeating this exercise in six months' time. Bonus points for excellent developer experience (DX), smooth configuration, good docs, etc. What I’m not looking at is, for example, multi-node deployments, distributed storage, production support costs, GUI capabilities, and so on. That is, this blog post is not aimed at folk who were using MinIO as self-managed S3 in production. Feel free to leave a comment below though if you have useful things to add in this respect :) My starting point for this is a very simple Docker Compose stack: DuckDB to read and write Iceberg data that’s stored on S3, provided by MinIO to start with. You can find the code here . The Docker Compose is pretty straightforward: DuckDB, obviously, along with Iceberg REST Catalog MinIO (S3 local storage) , which is a MinIO CLI and used to automagically create a bucket for the data. When I insert data into DuckDB: it ends up in Iceberg format on S3, here in MinIO: In each of the samples I’ve built you can run the to verify it. Let’s now explore the different alternatives to MinIO, and how easy they are to switch MinIO out for. I’ve taken the above project and tried to implement it with as few changes to use the replacement for MinIO. I’ve left the MinIO S3 client, in place since that’s no big deal to replace if you want to rip out MinIO completely (s3cmd, CLI, etc etc). 💾 Example Docker Compose Version tested: ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍👍 Very easy to implement, and seems like a nice lightweight option. 💾 Example Docker Compose Version tested: Ease of config: ✅✅ ✅ Docker image (100k+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility RustFS also includes a GUI: 💾 Example Docker Compose Version tested: ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍 This quickstart is useful for getting bare-minimum S3 functionality working. (That said, I still just got Claude to do the implementation…). Overall there’s not too much to change here; a fairly straightforward switchout of Docker images, but the auth does need its own config file (which as with Garage, I inlined in the Docker Compose). SeaweedFS comes with its own basic UI which is handy: The SeaweedFS website is surprisingly sparse and at a glance you’d be forgiven for missing that it’s an OSS project, since there’s a "pricing" option and the title of the front page is "SeaweedFS Enterprise" (and no GitHub link that I could find!). But an OSS project it is, and a long-established one: SeaweedFS has been around with S3 support since its 0.91 release in 2018 . You can also learn more about SeaweedFS from these slides , including a comparison chart with MinIO . 💾 Example Docker Compose Version tested: ✅ Docker image (also outdated ones on Docker Hub with 5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍 Formerly known as S3 Server, CloudServer is part of a toolset called Zenko, published by Scality. It drops in to replace MinIO pretty easily, but I did find it slightly tricky at first to disentangle the set of names (cloudserver/zenko/scality) and what the actual software I needed to run was. There’s also a slightly odd feel that the docs link to an outdated Docker image. 💾 Example Docker Compose Ease of config: 😵 Version tested: ✅ Docker image (1M+ pulls) ✅ Licence: AGPL ✅ S3 compatibility I had to get a friend to help me with this one. As well as the container, I needed another to do the initial configuration, as well as a TOML config file which I’ve inlined in the Docker Compose to keep things concise. Could I have sat down and RTFM’d to figure it out myself? Yes. Do I have better things to do with my time? Also, yes. So, Garage does work, but gosh…it is not just a drop-in replacement in terms of code changes. It requires different plumbing for initialisation, and it’s not simple at that either. A simple example: . Excellent for production hygiene…overkill for local demos, and in fact somewhat of a hindrance TBH. 💾 Example Docker Compose Version tested: ✅ Docker images (1M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ozone was spun out of Apache Hadoop (remember that?) in 2020 , having been initially created as part of the HDFS project back in 2015. Ease of config: 😵 It does work as a replacement for MinIO, but it is not a lightweight alternative; neither I nor Claude could figure out how to deploy it with any fewer than four nodes. It gives heavy Hadoop vibes, and I wouldn’t be rushing to adopt it for my use case here. I took one look at the installation instructions and noped right out of this one! Ozone (above) is heavyweight enough; I’m sure both are great at what they do, but they are not a lightweight container to slot into my Docker Compose stack for local demos. Everyone loves a bake-off chart, right? gaul/s3proxy ( Git repo ) Single contributor ( Andrew Gaul ) ( Git repo ) Fancy website but not much detail about the company ( Git repo ) Single contributor ( Chris Lu ), Enterprise option available Zenko CloudServer ( Git repo ) Scality (commercial company) 5M+ (outdated version) ( Git repo ) NGI/NLnet grants Apache Ozone ( Git repo ) Apache Software Foundation 1 Docker pulls is a useful signal but not an absolute one given that a small number of downstream projects using the image in a frequently-run CI/CD pipeline could easily distort this figure. I got side-tracked into writing this blog because I wanted to update a demo in which currently MinIO was used. So, having tried them out, which of the options will I actually use? SeaweedFS - yes. S3Proxy - yes. RustFS - maybe, but very new project & alpha release. CloudServer - yes, maybe? Honestly, put off by it being part of a suite and worrying I’d need to understand other bits of it to use it—probably unfounded though. Garage - no, config too complex for what I need. Apache Ozone - lol no. I mean to cast no shade on those options against which I’ve not recorded a ; they’re probably excellent projects, but just not focussed on my primary use case (simple & easy to configure single-node local S3). A few parting considerations to bear in mind when choosing a replacement for MinIO: Governance . Whilst all the projects are OSS, only Ozone is owned by a foundation (ASF). All the others could, in theory , change their licence at the drop of a hat (just like MinIO did). Community health . What’s the "bus factor"? A couple of the projects above have a very long and healthy history—but from a single contributor. If they were to abandon the project, would someone in the community fork and continue to actively develop it? Must have a Docker image. So many demos are shipped as Docker Compose, and no-one likes brewing their own Docker images unless really necessary. Must provide S3 compatibility. The whole point of MinIO in these demos is to stand-in for writing to actual S3. Must be free to use, with a strong preference for Open Source (per OSI definition ) licence e.g. Apache 2.0. Should be simple to use for a single-node deployment Should have a clear and active community and/or commercial backer. Any fule can vibe-code some abandon-ware slop, or fork a project in a fit of enthusiasm—but MinIO stood the test of time until now and we don’t want to be repeating this exercise in six months' time. Bonus points for excellent developer experience (DX), smooth configuration, good docs, etc. DuckDB, obviously, along with Iceberg REST Catalog MinIO (S3 local storage) , which is a MinIO CLI and used to automagically create a bucket for the data. ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (100k+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (also outdated ones on Docker Hub with 5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (1M+ pulls) ✅ Licence: AGPL ✅ S3 compatibility ✅ Docker images (1M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility SeaweedFS - yes. S3Proxy - yes. RustFS - maybe, but very new project & alpha release. CloudServer - yes, maybe? Honestly, put off by it being part of a suite and worrying I’d need to understand other bits of it to use it—probably unfounded though. Garage - no, config too complex for what I need. Apache Ozone - lol no. Governance . Whilst all the projects are OSS, only Ozone is owned by a foundation (ASF). All the others could, in theory , change their licence at the drop of a hat (just like MinIO did). Community health . What’s the "bus factor"? A couple of the projects above have a very long and healthy history—but from a single contributor. If they were to abandon the project, would someone in the community fork and continue to actively develop it?

0 views

Most Code is Just Cache

Claude Code has systematically begun to consume many of the SaaS apps I used to (or plan to) pay for. Why pay a subscription when I can "vibe code" a personal MVP in twenty minutes? I don’t worry about maintenance or vendor lock-in because, frankly, the code is disposable. If I need a new feature tomorrow, I don’t refactor—I just rebuild it. 1 Code is becoming just an ephemeral cache of my intent. Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) We are seeing this now with companies adopting the Forward Deployed Engineering (FDE) . In this stage, the SaaS company hires humans to manually use AI to build bespoke solutions for the client. For the consumer, this feels like a concierge service; they don’t get a login to a generic tool, they get a custom-built outcome delivered by a human who used AI to write the glue code. The intelligence is hybrid: the human provides the architecture, the AI writes the implementation code in weeks to days. When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) This is the current “safe space” for most tech companies, where they bolt an LLM onto an existing application to handle unstructured data. Consumers experience this as a “Draft Email” button in their CRM or a “Chat” sidebar in their UI—the platform is still the main product, but AI is a feature that (hopefully) reduces friction and/or provides some extra functionality customization 2 . The intelligence comes from a constrained model of product design and LLM scaffolding, providing content within a structure still strictly dictated by the SaaS platform’s code. When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) This is the tipping point where the software interface starts to disappear because the “interface” was just a way to collect context that the model can now ingest directly. Consumers move to a “Do this for me” interface where intent maps directly to an outcome rather than a button click, often realized as an agent calling a database or MCP servers 3 . The intelligence is the model and it’s engineered input context, relegating the SaaS role to in some sense providing clean proprietary data via an agent friendly interface. Software as a Service for Agents . When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) Critically, this doesn't mean the LLM acts as the CPU for every single user interaction (which would be latency-poor and energy-inefficient). Instead, the model almost acts as a Just-In-Time compiler. It generates the necessary code to execute the user’s intent, runs that code for the session, and then potentially discards it This is the end game in some cases. If code is just a cache for intent, eventually we bypass the cache and bake the intuition directly into the model. To the consumer, the “tool” is invisible; the expert system simply exists and provides answers or actions without a login or workflow. The intelligence is in the model itself; the software platform exists solely as a distillation mechanism—a gym to train the vertical AI—and once the model learns the domain, the software is no longer needed. A company in this stage is not really even SaaS anymore, maybe more so a AI-gyms-aaS company. When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) This might feel unintuitive as a stage — like how could you bake some proprietary data lake into a model? How can our juicy data not be the moat? My conclusion is that most (but not all) data is a transformation of rawer upstream inputs and that these transformation (data pipelines, cross-tenant analysis, human research, etc.) are all “cache” that can be distilled into a more general model that operates on its intuition and upstream platform inputs. “But can agents run a bank?” Reliability and safety comes down to distinguishing between guardrails (deterministic interfaces and scaffolding) and runtime execution (LLM code). For now, you don’t let the LLM invent the concept of a transaction ledger or rewrite the core banking loop on the fly. In XX years, maybe we do trust AI to write core transaction logic after all fail-able humans wrote the code for most mission critical software that exists today. The line between human-defined determinism and agent symbolic interfaces will gradually move of time. “But enterprise SaaS is actually super complex.” Yes, but that complexity is mostly just unresolved ambiguity. Your “deep enterprise understanding” is often a collection of thousands of edge cases—permissions, policy exceptions, region-specific rules—that humans had to manually hard-code into IF/ELSE statements over a decade. Distilled to the core, this complexity collapses. The model doesn’t need 500 hard-coded features; it needs the raw data and the intent. An app built for one can also make a lot of simplifications compared to one that acts as a platform. “Customers don’t want to prompt features.” I agree. I don’t think the future looks like a chatbot. “Chat” is a skeuomorphic bridge we use because we haven’t figured out the consistent native interface yet. It might be a UI that pre-emptively changes based on your role, or it might feel like hiring a really competent employee who just “takes care of it” without you needing to specify the . Or, as we see in Stage 2, the user never prompts at all—an FDE does it for them, and the user just gets a bespoke app that works perfectly. Stage 1, where most companies are stuck today, definitely is. Why? Because the sheer overhead of traditional SaaS—the learning curve, the rigid workflows, the "click tax" to get work done—is becoming unacceptable in a world where intent can be executed directly. It feels increasingly archaic when flexible solutions can be generated on demand. The value is moving away from the workflow logic itself and toward two specific layers that sandwich it: The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. We are going to see companies move through these tiers. The winners IMO will be the ones who realize that the "Service" part of SaaS is being replaced by model intelligence. The SaaS that remains will be the infrastructure of truth and the engine of agency. We are transitioning from a world of static artifacts (code that persists for years) to dynamic generations (code that exists for milliseconds or for a single answer). Of course, I could be wrong. Maybe AI capability plateaus before it can fully integrate into complex verticals. Maybe traditional SaaS holds the line at Stage 2 or 3, protecting its moat through sheer inertia. Maybe the world ends up more decentralized. Some of my open questions: Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Even more so than me you can see Geoffrey Huntley’s ralph-powered rampage of GitHub and many other tools . I liked this tweet by Harj Taggar, “moved away from the FDE playbook that’s become the default for fast growing AI startups. Instead they’ve built AI to covert plain English from the customer into Python code to make the product work for their use cases” . Similar to Karpathy’s “LLMs not as a chatbot, but the kernel process of a new Operating System” (2023) Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Software Evolution The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. Stage 1. Traditional SaaS This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”?

12 views
Simon Willison 1 months ago

Fly's new Sprites.dev addresses both developer sandboxes and API sandboxes at the same time

New from Fly.io today: Sprites.dev . Here's their blog post and YouTube demo . It's an interesting new product that's quite difficult to explain - Fly call it "Stateful sandbox environments with checkpoint & restore" but I see it as hitting two of my current favorite problems: a safe development environment for running coding agents and an API for running untrusted code in a secure sandbox. Disclosure: Fly sponsor some of my work. They did not ask me to write about Sprites and I didn't get preview access prior to the launch. My enthusiasm here is genuine. I predicted earlier this week that "we’re due a Challenger disaster with respect to coding agent security" due to the terrifying way most of us are using coding agents like Claude Code and Codex CLI. Running them in mode (aka YOLO mode, where the agent acts without constantly seeking approval first) unlocks so much more power, but also means that a mistake or a malicious prompt injection can cause all sorts of damage to your system and data. The safe way to run YOLO mode is in a robust sandbox, where the worst thing that can happen is the sandbox gets messed up and you have to throw it away and get another one. That's the first problem Sprites solves: That's all it takes to get SSH connected to a fresh environment, running in an ~8GB RAM, 8 CPU server. And... Claude Code and Codex and Gemini CLI and Python 3.13 and Node.js 22.20 and a bunch of other tools are already installed. The first time you run it neatly signs you in to your existing account with Anthropic. The Sprites VM is persistent so future runs of will get you back to where you were before. ... and it automatically sets up port forwarding, so you can run a localhost server on your Sprite and access it from on your machine. There's also a command you can run to assign a public URL to your Sprite, so anyone else can access it if they know the secret URL. In the blog post Kurt Mackey argues that ephemeral, disposable sandboxes are not the best fit for coding agents: The state of the art in agent isolation is a read-only sandbox. At Fly.io, we’ve been selling that story for years, and we’re calling it: ephemeral sandboxes are obsolete. Stop killing your sandboxes every time you use them. [...] If you force an agent to, it’ll work around containerization and do work . But you’re not helping the agent in any way by doing that. They don’t want containers. They don’t want “sandboxes”. They want computers. [...] with an actual computer, Claude doesn’t have to rebuild my entire development environment every time I pick up a PR. Each Sprite gets a proper filesystem which persists in between sessions, even while the Sprite itself shuts down after inactivity. It sounds like they're doing some clever filesystem tricks here, I'm looking forward to learning more about those in the future. There are some clues on the homepage : You read and write to fast, directly attached NVMe storage. Your data then gets written to durable, external object storage. [...] You don't pay for allocated filesystem space, just the blocks you write. And it's all TRIM friendly, so your bill goes down when you delete things. The really clever feature is checkpoints. You (or your coding agent) can trigger a checkpoint which takes around 300ms. This captures the entire disk state and can then be rolled back to later. For more on how that works, run this in a Sprite: Here's the relevant section: Or run this to see the for the command used to manage them: Which looks like this: I'm a big fan of Skills , the mechanism whereby Claude Code (and increasingly other agents too) can be given additional capabilities by describing them in Markdown files in a specific directory structure. In a smart piece of design, Sprites uses pre-installed skills to teach Claude how Sprites itself works. This means you can ask Claude on the machine how to do things like open up ports and it will talk you through the process. There's all sorts of interesting stuff in the folder on that machine - digging in there is a great way to learn more about how Sprites works. Also from my predictions post earlier this week: "We’re finally going to solve sandboxing" . I am obsessed with this problem: I want to be able to run untrusted code safely, both on my personal devices and in the context of web services I'm building for other people to use. I have so many things I want to build that depend on being able to take untrusted code - from users or from LLMs or from LLMs-driven-by-users - and run that code in a sandbox where I can be confident that the blast radius if something goes wrong is tightly contained. Sprites offers a clean JSON API for doing exactly that, plus client libraries in Go and TypeScript and coming-soon Python and Elixir . From their quick start: You can also checkpoint and rollback via the API, so you can get your environment exactly how you like it, checkpoint it, run a bunch of untrusted code, then roll back to the clean checkpoint when you're done. Managing network access is an important part of maintaining a good sandbox. The Sprites API lets you configure network access policies using a DNS-based allow/deny list like this: Sprites have scale-to-zero baked into the architecture. They go to sleep after 30 seconds of inactivity, wake up quickly when needed and bill you for just the CPU hours, RAM hours and GB-hours of storage you use while the Sprite is awake. Fly estimate a 4 hour intensive coding session as costing around 46 cents, and a low traffic web app with 30 hours of wake time per month at ~$4. (I calculate that a web app that consumes all 8 CPUs and all 8GBs of RAM 24/7 for a month would cost ((7 cents * 8 * 24 * 30) + (4.375 cents * 8 * 24 * 30)) / 100 = $655.2 per month, so don't necessarily use these as your primary web hosting solution for an app that soaks up all available CPU and RAM!) I was hopeful that Fly would enter the developer-friendly sandbox API market, especially given other entrants from companies like Cloudflare and Modal and E2B . I did not expect that they'd tackle the developer sandbox problem at the same time, and with the same product! My one concern here is that it makes the product itself a little harder to explain. I'm already spinning up some prototypes of sandbox-adjacent things I've always wanted to build, and early signs are very promising. I'll write more about these as they turn into useful projects. Update : Here's some additional colour from Thomas Ptacek on Hacker News: This has been in the works for quite awhile here. We put a long bet on "slow create fast start/stop" --- which is a really interesting and useful shape for execution environments --- but it didn't make sense to sandboxers, so "fast create" has been the White Whale at Fly.io for over a year. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Developer sandboxes Storage and checkpoints Really clever use of Claude Skills A sandbox API Scale-to-zero billing Two of my favorite problems at once

1 views
Giles's blog 1 months ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
A Room of My Own 1 months ago

My Digital Workflow (Jan 2026 Edition)

My digital workflow has evolved quite a bit—really, it’s simplified a lot. I use far fewer apps now. My last post on this was in October 2024 (edited in Mar 2025), before I adopted Bear .  Since then, Bear has launched a web app beta , which means I can access my notes anywhere—especially at work, where we’re PC all the way. https://spasic.me/posts/a-digital-workflow-to-run-my-life I also just posted My App Defaults (as of Jan 2026) . (I struggle with this, so I had to write it down for myself and I’m genuinely getting better at following these rules.) Limit the amount of information I take in and process. Just because I can capture everything doesn’t mean I should Don’t rush to save every interesting idea; if it’s truly important, it’ll come back to me. Be selective about what I consume, especially online. Avoid organizing and exploring new tools. Focus on capturing my own thoughts and ideas and summarizing concepts in my own words. Don’t save everything—let things go. Write, write, write (don’t just consume - create) My “best practice” (but fluid) workflows for processing ideas, information, interests, documentation. Main Documents storage and backup Photographs, videos and their backup All current documents  (all documentation and scans, ebooks, writing, anything that would go into a computer hard drive) backup of old, unused documents and mementoes (notes apps backups, old word doc backups, old work doc backups, old email backups, various mementos) Photographs from my phone upload directly into Dropbox (although for permanent storage I upload manually et the end of the month and delete this automoatic backup when I don’t need it anymore) - I have a separate post on how I manage my Memory Keeping and Photographs I use Dropbox’s  “selective sync”  on my laptops (I only sync folders that I currently use) Cost: 120 USD a year What used to live across three or four different apps now lives almost entirely in Bear . Bear is my: Central hub for personal projects and current activities , where I store: Tasks and goals Quarterly and monthly plans Narratives and ongoing notes Central hub for admin and resources 
 (attachments mostly live in Dropbox and are linked back to Bear) : Personal information (some password-protected) Frequently accessed info (school/work details, admin notes, medical info, various records - anything I need to look up occasionally) Resources such as links, apps, wishlists, recipes, travel info, etc. Commonplace book and thinking space , where I: Make notes on topics I care about Store notes on topics of interest (old and new) Collect ideas, concepts, and connections in a non-linear way Think freely and explore without structure getting in the way Store my writing (essays, blog posts, stories) Jot things down at random and on the go Dump ideas and brainstorm Cost: 50 NZD a year A single source of truth for all my journaling and mementos. Digital Journal/Diary: My personal journal and diary (with photos). I use it daily most of the time. Mementos and Memory Keeping: This includes text notes, screenshots, random photos, voice messages, things my kids said, audio, video, photos, messages, locations, screenshots, various stats, anything I want to preseve. Logs of books , movies and TV and quotes/wisdom, etc. Some of this is still a work in progress, I am exploring it and it is constantly evolving. I back up Day One periodically in both JSON and PDF formats and store the backups in my Archive folder in Dropbox as well as on external hard drives. Note: I’m considering moving some of my personal journal entries into Bear for reference and long-term safekeeping, but I haven’t decided yet. Cost: 50 NZD per year I’ve written more about my current Trello setup. I organize my lists using the Eisenhower Matrix , along with a backlog for things I want to clear from my mind but may never actually do. All my specific To Dos and projects/tasks Anything that has a date but doesn’t belong in Google Calendar 
If it doesn’t need to happen on a specific day, it’s a task rather than an event, so it lives in Trello. Recurring tasks and reminders 
Things like document expiry dates, subscriptions, and periodic check-ins. Small personal projects 
Ideas I’d like to get to at some point, but that don’t need active scheduling yet. ⠀Trello is also great for on the go capture: I can email tasks directly into Trello On my phone for quick to-dos and relevant info (goes straight into the inbox widget) The email reminders are a bonus Dabble Writer Long-form writing. Novel in progress Memoir (snippets and fragments) After months of research I have settled on Dabble Writer to replace Scrivener for my long-form writing. While I loved Scrivener, I needed something that syncs seamlessly across multiple computers (and on the go) without requiring downloads or worrying about syncing my work. I hope to write about Dabble Writer in another post. COST: One-time purchase (subscription options available too) Kindle and article highlights Article dump for things I might read later (or delete if I don’t). Archive articles only if they’re genuinely worth keeping. Sends articles directly to my Kindle, which is where I prefer to read them. Cost: $40/year I know I could use Readwise Reader for RSS, but it doesn’t feel as casual or as easy to process as Feeder. I subscribe to a lot of personal blogs, and while I don’t read everything all the time, Feeder lets me quickly scan and dip into whatever catches my eye. It feels nice and low-pressure.
Take it or leave it. And it’s free. all appointments and events important dates and birthdays reoccurring events (like my yoga classes, kids’ sports, group meetings I regularly attend, etc.) syncing with my husband’s and son’s calendars ⠀ COST: Free My main personal email account since March 2000. I also use it as a kind of archive—emails are such an overlooked record of life and work. NOTE: I do use Gmail for Chrome, YouTube, and similar things, but I genuinely prefer Yahoo to Gmail, even though Google Calendar is my main calendar (am I the only one?). I use Inbox Zero across all my email accounts. How I Finally Settled on Bear for My Notes My One-Board Trello Task Management System How I Use Day One to Track What I Read I Journaled My TV and Movie Watching for a Year Why Did I Wait So Long to Start Using Day One? A Digital Workflow to Run My Life The Eisenhower Matrix I Forgot About (But Still Followed) My App Defaults (Jan 2026 Edition) My App Defaults (Mar 2025 Edition) A Digital Workflow to Run My Life (Mar 2025 Edition) Limit the amount of information I take in and process. Just because I can capture everything doesn’t mean I should Don’t rush to save every interesting idea; if it’s truly important, it’ll come back to me. Be selective about what I consume, especially online. Avoid organizing and exploring new tools. Focus on capturing my own thoughts and ideas and summarizing concepts in my own words. Don’t save everything—let things go. Write, write, write (don’t just consume - create) Main Documents storage and backup Photographs, videos and their backup All current documents  (all documentation and scans, ebooks, writing, anything that would go into a computer hard drive) backup of old, unused documents and mementoes (notes apps backups, old word doc backups, old work doc backups, old email backups, various mementos) Photographs from my phone upload directly into Dropbox (although for permanent storage I upload manually et the end of the month and delete this automoatic backup when I don’t need it anymore) - I have a separate post on how I manage my Memory Keeping and Photographs I use Dropbox’s  “selective sync”  on my laptops (I only sync folders that I currently use) Central hub for personal projects and current activities , where I store: Tasks and goals Quarterly and monthly plans Narratives and ongoing notes Central hub for admin and resources 
 (attachments mostly live in Dropbox and are linked back to Bear) : Personal information (some password-protected) Frequently accessed info (school/work details, admin notes, medical info, various records - anything I need to look up occasionally) Resources such as links, apps, wishlists, recipes, travel info, etc. Commonplace book and thinking space , where I: Make notes on topics I care about Store notes on topics of interest (old and new) Collect ideas, concepts, and connections in a non-linear way Think freely and explore without structure getting in the way Store my writing (essays, blog posts, stories) Jot things down at random and on the go Dump ideas and brainstorm Digital Journal/Diary: My personal journal and diary (with photos). I use it daily most of the time. Mementos and Memory Keeping: This includes text notes, screenshots, random photos, voice messages, things my kids said, audio, video, photos, messages, locations, screenshots, various stats, anything I want to preseve. Logs of books , movies and TV and quotes/wisdom, etc. All my specific To Dos and projects/tasks Anything that has a date but doesn’t belong in Google Calendar 
If it doesn’t need to happen on a specific day, it’s a task rather than an event, so it lives in Trello. Recurring tasks and reminders 
Things like document expiry dates, subscriptions, and periodic check-ins. Small personal projects 
Ideas I’d like to get to at some point, but that don’t need active scheduling yet. I can email tasks directly into Trello On my phone for quick to-dos and relevant info (goes straight into the inbox widget) The email reminders are a bonus Novel in progress Memoir (snippets and fragments) Article dump for things I might read later (or delete if I don’t). Archive articles only if they’re genuinely worth keeping. Sends articles directly to my Kindle, which is where I prefer to read them. all appointments and events important dates and birthdays reoccurring events (like my yoga classes, kids’ sports, group meetings I regularly attend, etc.) syncing with my husband’s and son’s calendars ⠀ COST: Free My main personal email account since March 2000. I also use it as a kind of archive—emails are such an overlooked record of life and work.

1 views
The Tymscar Blog 2 months ago

Automating What Backblaze Lifecycle Rules Don't Do Instantly

I recently moved from Synology to TrueNAS and set up cloud backups to Backblaze B2. I have two buckets: one for important files like documents, and one for homelab services. The services bucket backs up things like qcow2 disk images for my VMs, some of which are hundreds of gigabytes. When I created the buckets, I set the lifecycle rule to “Keep only the last version of the file.” I assumed this meant Backblaze would automatically replace old versions when new ones arrived. It doesn’t work that way.

0 views
ava's blog 2 months ago

trusting the little guys: issues with 'big tech' alternatives

A while ago, my brother-in-law asked around friends and family if anyone wanted to join the (private) cloud/file service he spun up. Practical, right? Many outside the corporate web believe in smaller services within friend groups, families, and local organizations as the way forward. Instead of trusting big companies who could (or rather, will) enshittify and become too big and bloated (Google, Meta, Microsoft...), we should trust smaller maintainers within our circles. The offer made me ponder what I would upload to the file service, and how much I would trust my brother-in-law with the files. Not just the integrity, but the uptime, the availability when issues arise, how swiftly severe bugs or security issues would be patched, and the uncomfortable question about confidentiality: Should I only upload files I don't mind him to see, or should I trust him that he wouldn't look at them? 1 That made me think: How much do we trust alternatives to big tech? When we host our various things like emails, image backups, blogs, social media accounts etc. with these big companies, a certain professionalism is expected. You're dealing with a corporate entity, so you probably have the following expectations: All of these (whether they are actually realistic and enforceable or not) can give us a sense of security. A cold, sterile business relationship, like the one to our water provider. If we want to switch away from these data-harvesting giants to smaller solutions, we are confronted with the fact that usually, it's a small group of people, or even just one person. Some try to build up a smaller service professionally, but many just do it on the side, as a hobby. A Mastodon or PixelFed instance, another social media alternative, or media sharing. That poses some challenges and questions for the average user: These concerns make smaller services feel less reliable and trustworthy. A big corporation can (and will) obviously mess up as well and the data breaches and downtimes are a lot more impactful, but: The roles are clear, legal identities are divulged publicly if needed (like their data protection officer!), and someone is responsible for an issue. With a small group of strangers or even just one person online that you don't know, this is more opaque and there are not necessarily any consequences, quality control, workflows or customer service. There is often not even a real name offered that you can use for any sort of complaint or legal action. I think I might have talked about this in another blog post or alluded to it, but there is a creator of a variety of indie web services that just refuses to delete my accounts since at least 2023. It started with just one I wanted gone, but nowadays I want all of them gone. After multiple fruitless attempts at asking for deletion via email and having no full account deletion in the settings page, I filed an official complaint at the Data Protection Authority responsible for my area. Unfortunately, they were almost entirely useless, because as long as I do not have the full legal name of the person behind all those services, they say they cannot do anything. These fossils do not want to send out an email reprimanding them for being non-compliant despite processing the EU citizen's data and even taking money for it, they insist on sending an actual letter to the person's residence and don't want to put effort into getting that address from the hoster. Their feedback ended with the great advice that next time, I shouldn't sign up to websites that don't have a privacy policy, proper account deletion process, or a responsible person named. Well, geez, wish I could time travel and tell 2021 me that, who had rose-tinted glasses about indie web alternatives. Nowadays, I indeed don't sign up, and I make sure to remind every project I see that necessitates user accounts to please fulfill at least the PP and the deletion process. I know I cannot make any of them share their full name if they don't want to. Being better than the big players doesn't just involve not doing the excessive data harvesting they do, but also handling the little bit of data you get with care, and having processes in place that make dealing with user data easier and gives a lot of control to the user, and ideally, let them know who they're dealing with. And that's where it really differs from case to case, because at Bearblog, I am really happy with how things are and have turned out so far, despite it only being one person. It is professional, I get amazing customer support, I know the legal identity, and I can find out exactly how data is collected and processed. Plus: There is an account deletion I can initiate on my own without having to message someone and hope for the best. For comparison, it took Cohost (that was ran by a small group of people) about 4 months or so to delete my account that I had to request via email, and it took someone I know over a year. That means constantly checking back in whether the deletion has gone through and the profile is still up, and that is not only annoying, but it can also threaten the safety of people who get found by stalkers, family members and others. Some of these things are time-sensitive, and it's irresponsible and non-compliant to not have a better system in place. Strangers are simply a hit or miss. Could be a creep that reads all your DMs to other people on the instance, or not. What about a friend? If your friendship breaks apart, do you lose the service and the data accumulated on there? If it's a family member and something really bad happens with your data and account, do you want to risk the family peace by holding them accountable? Honestly, no one wants to set up a formal contract for something like this as it feels silly, and many won't. So what basis do you have? If you are lucky, the indie project you want to use has open‑source code, transparent incident logs, and community reviews and PRs that serve as proxies for professionalism and quality control, but in my view, that is rather uncommon. I don't want to badmouth smaller alternatives, as I am still a big fan of them and rely on them. I just want to discuss these fears and risks, and some of my good and bad experiences. I want them to thrive and do better in these topics. Trust sadly isn't purely rational, and familiarity, perceived competence, contracts, incentives, and consequences play important roles. Reply via email Published 22 Dec, 2025 For the record, I trust him not to look at them, but it's still a thought I had, since I never had to decide that before. ↩ I'm a consumer, and I have consumer rights against this corporation. I don't feel bad about potentially suing them, because I'm suing the company, not one individual. While messaging their support, they (or nowadays, their AI chatbot?) keep it professional and are available in a reasonable time. I can lodge complaints and expect a fix fairly fast, and downtime is usually resolved within an hour or few. No one person or one department has access to absolutely everything, and especially not unchecked. Lots of eyes, control mechanisms, logs, and separation, limited rights and access on a need-to-know basis. People working there get paid for this, which affects how they treat the service or what they cannot risk doing. There are internal consequences for non-compliance, and there are internal workflows on how to deal with specific cases the same every time; even just deletion requests or requests for personal information. There are far too many employees, and far too many users; why should I, of all people, be interesting enough to have my privacy violated by an employee? I have a right to request my data, and a right to data portability. Due to financial interest in keeping the company going, they're future-proofing. Do I still have consumer rights, even just rights like the GDPR, or not? Would I be comfortable pursuing this person legally if shit went sideways or they abuse or leak my data? Can I even, if this is just an internet stranger with a nickname and an email address? Can I expect a professional relationship about this service? If this is just done on the side as a hobby or experiment, will the person actually continue it after the first few weeks? What do I do if I lose this? Will they have the time and energy to continue to update it and care for it, and keep my data safe? If I need a quick fix or tech help, would they be able to respond in a timely manner? Depending on what it is, it might be urgent. Can I trust this person not to abuse their admin power and look into everything? Even if it's SFW, maybe I wouldn't want a stranger to click through my image files (... and use them for AI training or to make deepfake nudes I don't know about?). Is data portability a thing at all with their service? Can I export the data in any meaningful and useful way? Does the maintainer do any sort of future-proofing? For the record, I trust him not to look at them, but it's still a thought I had, since I never had to decide that before. ↩

3 views
Jeremy Daly 2 months ago

We’re Already Living in a Post-Serverless World

This isn’t really about serverless. It’s about what happens when infrastructure stops asking humans to guess the future.

0 views
Nelson Figueroa 2 months ago

GitHub Actions for Pulumi with an AWS S3 Backend

This is a quick guide to set up GitHub Actions for Pulumi with an AWS S3 Backend. There are some differences compared to running commands locally. This guide assumes you have the following: First, set up secrets on your GitHub repository. These will be filled in by GitHub Actions once we create a workflow YAML. You’ll need to create 3 secrets. You can name them whatever you want, but I’ll be naming them: You can create these by browsing to your GitHub repository > Settings > Secrets and variables > Actions. There are “Environment secrets” and “Repository secrets”. In this case go with “Repository secrets”. Create the three secrets and fill in their respective values. can be whatever you want and doesn’t come from AWS. Make sure you don’t change this after the fact though, or your Pulumi state may break . The end result should look like this: Next, clone the repository locally with . We’ll need to create a few files for a minimum viable Pulumi program. Run and it’ll guide you through the creation of a basic Pulumi program. The language you choose doesn’t matter. Then we can create the YAML file to set up GitHub Actions. Create a YAML file under . I’ll call it in my example. Fill it in with the following YAML, changing values as needed. This runs a so it’ll veryify that everything is set up correctly without actually deplpoying anything. Now push your code to GitHub and see if the GitHub Action workflow ran successfully. You should see output from a successful . If this works, you should be good to go. You can update your Pulumi code and, change to in the GitHub Actions YAML file and run it again to actually deploy some infrastructure. An AWS S3 Bucket created and ready to be used with Pulumi An IAM User that has permissions to read/write to the S3 bucket The Access Key and Secret Access Key for the IAM User to use for authenticating to AWS within GitHub Actions A passphrase of your choosing that will be used to encrypt secrets in the pulumi stack A GitHub repository https://github.com/pulumi/actions

0 views
Blog System/5 2 months ago

From Azure Functions to FreeBSD

On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. Most daemons implement all of the above as features in their own code, but I did not want to have to retrofit all of these into my existing HTTP service in a rush. Fortunately, FreeBSD provides this little tool, daemon(8) , which wraps an existing binary and offers all of the above. This incantation was enough to get me going: I won’t dive into the details of each flag, but to note: specifies which PID file to create; specifies where to store the stdout and stderr of the process; is required for log rotation (much more below); drops privileges to the given user; and specifies the “title” of the process to display in output. The trick was sufficient to inject configuration variables upon process startup, simulating the same environment that my server used to see when spawned by the Azure Functions runtime. Hooking that up into an service script was then trivial: And with that: Ta-da! I had the service running locally and listening to a local port determined in the configuration file. As part of the migration out of Azure Functions, I switched to self-hosting PostgreSQL as well. This was straightforward but required a couple of extra improvements to my web framework: one to stop using a remote PostgreSQL instance for tests (something I should have done eons ago), and another to support local peer authentication to avoid unnecessary passwords. In the call to above, I briefly mentioned the need for the flag to support log rotation. What’s that about? You see, in Unix-like systems, when a process opens a file, the process holds a handle to the open file. If you delete or rename the file, the handle continues to exist exactly as it was . This has two consequences: If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! For a long-running daemon that spits out verbose logs, writing them to a file can become problematic because you can end up running out of disk space. To solve this problem, daemons typically implement log rotation : a mechanism to keep log sizes in check by moving them aside when a certain period of time passes or when they cross a size threshold, and then only keeping the last N files around. Peeking into one of the many examples in my server, note how is the “live” log where writes go to but there is a daily archive for up to a week: Having all daemons implement log rotation logic on their own would be suboptimal because you’d have duplicate logic throughout the system and you would not be able to configure policy easily for them all. This is where newsyslog(8) on FreeBSD (or on Linux) comes into play. is a tool that rotates log files based on criteria such as size or time and optionally compresses them. But remember: the semantics of open file handles mean that simply renaming log files is insufficient! Once takes action and moves a log file aside, it must ensure that the process that was writing to that file closes the file handle and reopens it so that writes start going to the new place. This is typically done via sending a to the daemon, and is why we need to pass to the call. To illustrate the sequence: The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Configuring is easy, but cryptic. We can create a service-specific configuration file under that provides entries for our service, such as: I’ll leave you to the manpage to figure out what the magic is (but in short, it controls retention count, rotation schedule, and compression). As I briefly mentioned earlier, the Azure Functions runtime was responsible for TLS termination in my previous setup. Without such a runtime in place, I had to configure TLS on my own in my HTTP server… or did I? I had been meaning to play with Cloudflare Tunnels for a while given that I already use Cloudflare for DNS. Zero Trust Tunnels allow you to expose a service without opening inbound ports in your firewall. The way this works is by installing the tunnel daemon on your machine and configuring the tunnel to redirect certain URL routes to an internal address (typically ). Cloudflare then acts as the frontend for the requests, handles TLS termination and DDOS protection, and then redirects the request to your local service. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. And here is what I’ve lost (for now): Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of . None of the above losses are inherent to self-hosting, of course. I could provide alternatives for them all and at some point I will; consider them to-dos! On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! How did I get here? Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. The cloud database Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. The outage and trigger About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. From serverless to serverful The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. CORS But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. Results of the transition By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of .

0 views
Stratechery 2 months ago

AWS re:Invent, Agents for AWS, Nova Forge

AWS re:Invent sought to present AI solutions in the spirit of AWS' original impact on startups; the real targets may be the startups from that era, not the current one.

0 views