Posts in Cloud (20 found)
Robin Moffatt Yesterday

Alternatives to MinIO for single-node local S3

In late 2025 the company behind MinIO decided to abandon it to pursue other commercial interests. As well as upsetting a bunch of folk, it also put the cat amongst the pigeons of many software demos that relied on MinIO to emulate S3 storage locally, not to mention build pipelines that used it for validating S3 compatibility. In this blog post I’m going to look at some alternatives to MinIO. Whilst MinIO is a lot more than 'just' a glorified tool for emulating S3 when building demos, my focus here is going to be on what is the simplest replacement. In practice that means the following: Must have a Docker image. So many demos are shipped as Docker Compose, and no-one likes brewing their own Docker images unless really necessary. Must provide S3 compatibility. The whole point of MinIO in these demos is to stand-in for writing to actual S3. Must be free to use, with a strong preference for Open Source (per OSI definition ) licence e.g. Apache 2.0. Should be simple to use for a single-node deployment Should have a clear and active community and/or commercial backer. Any fule can vibe-code some abandon-ware slop, or fork a project in a fit of enthusiasm—but MinIO stood the test of time until now and we don’t want to be repeating this exercise in six months' time. Bonus points for excellent developer experience (DX), smooth configuration, good docs, etc. What I’m not looking at is, for example, multi-node deployments, distributed storage, production support costs, GUI capabilities, and so on. That is, this blog post is not aimed at folk who were using MinIO as self-managed S3 in production. Feel free to leave a comment below though if you have useful things to add in this respect :) My starting point for this is a very simple Docker Compose stack: DuckDB to read and write Iceberg data that’s stored on S3, provided by MinIO to start with. You can find the code here . The Docker Compose is pretty straightforward: DuckDB, obviously, along with Iceberg REST Catalog MinIO (S3 local storage) , which is a MinIO CLI and used to automagically create a bucket for the data. When I insert data into DuckDB: it ends up in Iceberg format on S3, here in MinIO: In each of the samples I’ve built you can run the to verify it. Let’s now explore the different alternatives to MinIO, and how easy they are to switch MinIO out for. I’ve taken the above project and tried to implement it with as few changes to use the replacement for MinIO. I’ve left the MinIO S3 client, in place since that’s no big deal to replace if you want to rip out MinIO completely (s3cmd, CLI, etc etc). 💾 Example Docker Compose Version tested: ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍👍 Very easy to implement, and seems like a nice lightweight option. 💾 Example Docker Compose Version tested: Ease of config: ✅✅ ✅ Docker image (100k+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility RustFS also includes a GUI: 💾 Example Docker Compose Version tested: ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍 This quickstart is useful for getting bare-minimum S3 functionality working. (That said, I still just got Claude to do the implementation…). Overall there’s not too much to change here; a fairly straightforward switchout of Docker images, but the auth does need its own config file (which as with Garage, I inlined in the Docker Compose). SeaweedFS comes with its own basic UI which is handy: The SeaweedFS website is surprisingly sparse and at a glance you’d be forgiven for missing that it’s an OSS project, since there’s a "pricing" option and the title of the front page is "SeaweedFS Enterprise" (and no GitHub link that I could find!). But an OSS project it is, and a long-established one: SeaweedFS has been around with S3 support since its 0.91 release in 2018 . You can also learn more about SeaweedFS from these slides , including a comparison chart with MinIO . 💾 Example Docker Compose Version tested: ✅ Docker image (also outdated ones on Docker Hub with 5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ease of config: 👍 Formerly known as S3 Server, CloudServer is part of a toolset called Zenko, published by Scality. It drops in to replace MinIO pretty easily, but I did find it slightly tricky at first to disentangle the set of names (cloudserver/zenko/scality) and what the actual software I needed to run was. There’s also a slightly odd feel that the docs link to an outdated Docker image. 💾 Example Docker Compose Ease of config: 😵 Version tested: ✅ Docker image (1M+ pulls) ✅ Licence: AGPL ✅ S3 compatibility I had to get a friend to help me with this one. As well as the container, I needed another to do the initial configuration, as well as a TOML config file which I’ve inlined in the Docker Compose to keep things concise. Could I have sat down and RTFM’d to figure it out myself? Yes. Do I have better things to do with my time? Also, yes. So, Garage does work, but gosh…it is not just a drop-in replacement in terms of code changes. It requires different plumbing for initialisation, and it’s not simple at that either. A simple example: . Excellent for production hygiene…overkill for local demos, and in fact somewhat of a hindrance TBH. 💾 Example Docker Compose Version tested: ✅ Docker images (1M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility Ozone was spun out of Apache Hadoop (remember that?) in 2020 , having been initially created as part of the HDFS project back in 2015. Ease of config: 😵 It does work as a replacement for MinIO, but it is not a lightweight alternative; neither I nor Claude could figure out how to deploy it with any fewer than four nodes. It gives heavy Hadoop vibes, and I wouldn’t be rushing to adopt it for my use case here. I took one look at the installation instructions and noped right out of this one! Ozone (above) is heavyweight enough; I’m sure both are great at what they do, but they are not a lightweight container to slot into my Docker Compose stack for local demos. Everyone loves a bake-off chart, right? gaul/s3proxy ( Git repo ) Single contributor ( Andrew Gaul ) ( Git repo ) Fancy website but not much detail about the company ( Git repo ) Single contributor ( Chris Lu ), Enterprise option available Zenko CloudServer ( Git repo ) Scality (commercial company) 5M+ (outdated version) ( Git repo ) NGI/NLnet grants Apache Ozone ( Git repo ) Apache Software Foundation 1 Docker pulls is a useful signal but not an absolute one given that a small number of downstream projects using the image in a frequently-run CI/CD pipeline could easily distort this figure. I got side-tracked into writing this blog because I wanted to update a demo in which currently MinIO was used. So, having tried them out, which of the options will I actually use? SeaweedFS - yes. S3Proxy - yes. RustFS - maybe, but very new project & alpha release. CloudServer - yes, maybe? Honestly, put off by it being part of a suite and worrying I’d need to understand other bits of it to use it—probably unfounded though. Garage - no, config too complex for what I need. Apache Ozone - lol no. I mean to cast no shade on those options against which I’ve not recorded a ; they’re probably excellent projects, but just not focussed on my primary use case (simple & easy to configure single-node local S3). A few parting considerations to bear in mind when choosing a replacement for MinIO: Governance . Whilst all the projects are OSS, only Ozone is owned by a foundation (ASF). All the others could, in theory , change their licence at the drop of a hat (just like MinIO did). Community health . What’s the "bus factor"? A couple of the projects above have a very long and healthy history—but from a single contributor. If they were to abandon the project, would someone in the community fork and continue to actively develop it? Must have a Docker image. So many demos are shipped as Docker Compose, and no-one likes brewing their own Docker images unless really necessary. Must provide S3 compatibility. The whole point of MinIO in these demos is to stand-in for writing to actual S3. Must be free to use, with a strong preference for Open Source (per OSI definition ) licence e.g. Apache 2.0. Should be simple to use for a single-node deployment Should have a clear and active community and/or commercial backer. Any fule can vibe-code some abandon-ware slop, or fork a project in a fit of enthusiasm—but MinIO stood the test of time until now and we don’t want to be repeating this exercise in six months' time. Bonus points for excellent developer experience (DX), smooth configuration, good docs, etc. DuckDB, obviously, along with Iceberg REST Catalog MinIO (S3 local storage) , which is a MinIO CLI and used to automagically create a bucket for the data. ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (100k+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (also outdated ones on Docker Hub with 5M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility ✅ Docker image (1M+ pulls) ✅ Licence: AGPL ✅ S3 compatibility ✅ Docker images (1M+ pulls) ✅ Licence: Apache 2.0 ✅ S3 compatibility SeaweedFS - yes. S3Proxy - yes. RustFS - maybe, but very new project & alpha release. CloudServer - yes, maybe? Honestly, put off by it being part of a suite and worrying I’d need to understand other bits of it to use it—probably unfounded though. Garage - no, config too complex for what I need. Apache Ozone - lol no. Governance . Whilst all the projects are OSS, only Ozone is owned by a foundation (ASF). All the others could, in theory , change their licence at the drop of a hat (just like MinIO did). Community health . What’s the "bus factor"? A couple of the projects above have a very long and healthy history—but from a single contributor. If they were to abandon the project, would someone in the community fork and continue to actively develop it?

0 views

Most Code is Just Cache

Claude Code has systematically begun to consume many of the SaaS apps I used to (or plan to) pay for. Why pay a subscription when I can "vibe code" a personal MVP in twenty minutes? I don’t worry about maintenance or vendor lock-in because, frankly, the code is disposable. If I need a new feature tomorrow, I don’t refactor—I just rebuild it. 1 Code is becoming just an ephemeral cache of my intent. Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) We are seeing this now with companies adopting the Forward Deployed Engineering (FDE) . In this stage, the SaaS company hires humans to manually use AI to build bespoke solutions for the client. For the consumer, this feels like a concierge service; they don’t get a login to a generic tool, they get a custom-built outcome delivered by a human who used AI to write the glue code. The intelligence is hybrid: the human provides the architecture, the AI writes the implementation code in weeks to days. When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) This is the current “safe space” for most tech companies, where they bolt an LLM onto an existing application to handle unstructured data. Consumers experience this as a “Draft Email” button in their CRM or a “Chat” sidebar in their UI—the platform is still the main product, but AI is a feature that (hopefully) reduces friction and/or provides some extra functionality customization 2 . The intelligence comes from a constrained model of product design and LLM scaffolding, providing content within a structure still strictly dictated by the SaaS platform’s code. When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) This is the tipping point where the software interface starts to disappear because the “interface” was just a way to collect context that the model can now ingest directly. Consumers move to a “Do this for me” interface where intent maps directly to an outcome rather than a button click, often realized as an agent calling a database or MCP servers 3 . The intelligence is the model and it’s engineered input context, relegating the SaaS role to in some sense providing clean proprietary data via an agent friendly interface. Software as a Service for Agents . When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) Critically, this doesn't mean the LLM acts as the CPU for every single user interaction (which would be latency-poor and energy-inefficient). Instead, the model almost acts as a Just-In-Time compiler. It generates the necessary code to execute the user’s intent, runs that code for the session, and then potentially discards it This is the end game in some cases. If code is just a cache for intent, eventually we bypass the cache and bake the intuition directly into the model. To the consumer, the “tool” is invisible; the expert system simply exists and provides answers or actions without a login or workflow. The intelligence is in the model itself; the software platform exists solely as a distillation mechanism—a gym to train the vertical AI—and once the model learns the domain, the software is no longer needed. A company in this stage is not really even SaaS anymore, maybe more so a AI-gyms-aaS company. When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) This might feel unintuitive as a stage — like how could you bake some proprietary data lake into a model? How can our juicy data not be the moat? My conclusion is that most (but not all) data is a transformation of rawer upstream inputs and that these transformation (data pipelines, cross-tenant analysis, human research, etc.) are all “cache” that can be distilled into a more general model that operates on its intuition and upstream platform inputs. “But can agents run a bank?” Reliability and safety comes down to distinguishing between guardrails (deterministic interfaces and scaffolding) and runtime execution (LLM code). For now, you don’t let the LLM invent the concept of a transaction ledger or rewrite the core banking loop on the fly. In XX years, maybe we do trust AI to write core transaction logic after all fail-able humans wrote the code for most mission critical software that exists today. The line between human-defined determinism and agent symbolic interfaces will gradually move of time. “But enterprise SaaS is actually super complex.” Yes, but that complexity is mostly just unresolved ambiguity. Your “deep enterprise understanding” is often a collection of thousands of edge cases—permissions, policy exceptions, region-specific rules—that humans had to manually hard-code into IF/ELSE statements over a decade. Distilled to the core, this complexity collapses. The model doesn’t need 500 hard-coded features; it needs the raw data and the intent. An app built for one can also make a lot of simplifications compared to one that acts as a platform. “Customers don’t want to prompt features.” I agree. I don’t think the future looks like a chatbot. “Chat” is a skeuomorphic bridge we use because we haven’t figured out the consistent native interface yet. It might be a UI that pre-emptively changes based on your role, or it might feel like hiring a really competent employee who just “takes care of it” without you needing to specify the . Or, as we see in Stage 2, the user never prompts at all—an FDE does it for them, and the user just gets a bespoke app that works perfectly. Stage 1, where most companies are stuck today, definitely is. Why? Because the sheer overhead of traditional SaaS—the learning curve, the rigid workflows, the "click tax" to get work done—is becoming unacceptable in a world where intent can be executed directly. It feels increasingly archaic when flexible solutions can be generated on demand. The value is moving away from the workflow logic itself and toward two specific layers that sandwich it: The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. We are going to see companies move through these tiers. The winners IMO will be the ones who realize that the "Service" part of SaaS is being replaced by model intelligence. The SaaS that remains will be the infrastructure of truth and the engine of agency. We are transitioning from a world of static artifacts (code that persists for years) to dynamic generations (code that exists for milliseconds or for a single answer). Of course, I could be wrong. Maybe AI capability plateaus before it can fully integrate into complex verticals. Maybe traditional SaaS holds the line at Stage 2 or 3, protecting its moat through sheer inertia. Maybe the world ends up more decentralized. Some of my open questions: Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”? Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Even more so than me you can see Geoffrey Huntley’s ralph-powered rampage of GitHub and many other tools . I liked this tweet by Harj Taggar, “moved away from the FDE playbook that’s become the default for fast growing AI startups. Instead they’ve built AI to covert plain English from the customer into Python code to make the product work for their use cases” . Similar to Karpathy’s “LLMs not as a chatbot, but the kernel process of a new Operating System” (2023) Cartoon via Nano Banana. In this model, the ‘Source Code’ is the prompt and the context; the actual Python or Javascript that executes is just the binary. We still run the code because it’s thermodynamically efficient and deterministic, but we treat it as disposable. If the behavior needs to change, we don’t refactor the binary; we re-compile the intent. This shift has made me intolerant of static interfaces. I have stopped caring about software that doesn’t let me dump massive amounts of context into Gemini or Claude to just do the thing . If a product forces me to click buttons to execute a process that an LLM could intuit from a prompt, that product is already legacy. It forces us to question the permanence of the current model. We often make the mistake of assuming software—as we know it today—is a permanent fixture of human productivity. But if you zoom out, the era of SaaS is a blink of an eye in modern history. It is easy to overestimate how core it is to the future. In this post, I want to extrapolate these thoughts a bit and write out what could be the final stages of software. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. Software Evolution The stages here might not necessarily be chronological or mutually exclusive. Instead, they are ordered from static to dynamic code generation — where more and more the intent of a customer is the software they use. Stage 1. Traditional SaaS This is the baseline where software is a static artifact sold as a service, built on the assumption that user problems are repetitive and predictable enough to be solved by rigid workflows. To the consumer, this looks like dashboards, CRUD forms, and hardcoded automations. The intelligence here is sourced mainly from the SaaS founder and hired domain experts, hard-coded into business logic years before the user ever logs in. When: We recognized that distributing software via the cloud was more efficient than on-premise installations. Value Loop: Customer Problem → Product Manager writes PRD → Engineers write Static Code → Deploy → Customer adapts their workflow to the tool. (Time: Months to Years | Fit: Generic / One-size-fits-none) When: Companies realize AI allows their employees to build custom apps for clients faster than the clients can learn or adapt a generic tool. Value Loop: Customer Problem → SaaS Employee (FDE) Prompts AI → AI generates Custom Script/App → Employee Deploys for Customer. (Time: Days | Fit: High / Tailored to specific customer edge cases) When: People start to see AI is good at summarizing, generating content, or taking actions within existing workflows. Value Loop: Customer Problem → Static SaaS Interface AI Feature Text Box → Stochastic Result → Human Review. (Time: Minutes | Fit: Medium / Constrained by the platform’s UI) When: People start to see AI is good at orchestrating complex decisions and using tools—across SaaS platforms—autonomously. Value Loop: Customer Problem (Prompt as ~PRD) → Runtime Code Generation → Dynamic Outcome. (Time: Real-time | Fit: Very High / Dynamically generated for the specific context) When: People start to see AI is good at absorbing the entire vertical’s intuition. Value Loop: Raw Domain Data → Reinforcement Learning / Fine-Tuning → Model Weights. (Time: Instant / Pre-computed | Fit: Very High / Intuitive domain mastery) The Data Layer: Proprietary data, trust, and the “agentic scaffolding” that allows models to act safely within your domain. The Presentation Layer: Brand and UI. While I suspect trying to control the presentation layer long-term is futile (as users will eventually bring their own “interface agents” to interact with your data), for now, it remains a differentiator. Which stage should you work on today? Is there alpha in skipping straight to Stage 4, or do you need to build the Stage 2 “vibe coding” service to bootstrap for now? What are the interfaces of the future? Is it MCP, curated compute sandboxes, or a yet-to-be-defined agent-to-agent-to-human protocol? What interface wins out or does each company or consumer bring their own agentic worker? How fast does this happen? Are we looking at a multi-decade-long transition, or do companies today rapidly start dropping lower stage SaaS tools? Does AI have a similar impact beyond software? does medicine move from “static protocols” to “on-demand, patient-specific treatments”?

12 views
Simon Willison 5 days ago

Fly's new Sprites.dev addresses both developer sandboxes and API sandboxes at the same time

New from Fly.io today: Sprites.dev . Here's their blog post and YouTube demo . It's an interesting new product that's quite difficult to explain - Fly call it "Stateful sandbox environments with checkpoint & restore" but I see it as hitting two of my current favorite problems: a safe development environment for running coding agents and an API for running untrusted code in a secure sandbox. Disclosure: Fly sponsor some of my work. They did not ask me to write about Sprites and I didn't get preview access prior to the launch. My enthusiasm here is genuine. I predicted earlier this week that "we’re due a Challenger disaster with respect to coding agent security" due to the terrifying way most of us are using coding agents like Claude Code and Codex CLI. Running them in mode (aka YOLO mode, where the agent acts without constantly seeking approval first) unlocks so much more power, but also means that a mistake or a malicious prompt injection can cause all sorts of damage to your system and data. The safe way to run YOLO mode is in a robust sandbox, where the worst thing that can happen is the sandbox gets messed up and you have to throw it away and get another one. That's the first problem Sprites solves: That's all it takes to get SSH connected to a fresh environment, running in an ~8GB RAM, 8 CPU server. And... Claude Code and Codex and Gemini CLI and Python 3.13 and Node.js 22.20 and a bunch of other tools are already installed. The first time you run it neatly signs you in to your existing account with Anthropic. The Sprites VM is persistent so future runs of will get you back to where you were before. ... and it automatically sets up port forwarding, so you can run a localhost server on your Sprite and access it from on your machine. There's also a command you can run to assign a public URL to your Sprite, so anyone else can access it if they know the secret URL. In the blog post Kurt Mackey argues that ephemeral, disposable sandboxes are not the best fit for coding agents: The state of the art in agent isolation is a read-only sandbox. At Fly.io, we’ve been selling that story for years, and we’re calling it: ephemeral sandboxes are obsolete. Stop killing your sandboxes every time you use them. [...] If you force an agent to, it’ll work around containerization and do work . But you’re not helping the agent in any way by doing that. They don’t want containers. They don’t want “sandboxes”. They want computers. [...] with an actual computer, Claude doesn’t have to rebuild my entire development environment every time I pick up a PR. Each Sprite gets a proper filesystem which persists in between sessions, even while the Sprite itself shuts down after inactivity. It sounds like they're doing some clever filesystem tricks here, I'm looking forward to learning more about those in the future. There are some clues on the homepage : You read and write to fast, directly attached NVMe storage. Your data then gets written to durable, external object storage. [...] You don't pay for allocated filesystem space, just the blocks you write. And it's all TRIM friendly, so your bill goes down when you delete things. The really clever feature is checkpoints. You (or your coding agent) can trigger a checkpoint which takes around 300ms. This captures the entire disk state and can then be rolled back to later. For more on how that works, run this in a Sprite: Here's the relevant section: Or run this to see the for the command used to manage them: Which looks like this: I'm a big fan of Skills , the mechanism whereby Claude Code (and increasingly other agents too) can be given additional capabilities by describing them in Markdown files in a specific directory structure. In a smart piece of design, Sprites uses pre-installed skills to teach Claude how Sprites itself works. This means you can ask Claude on the machine how to do things like open up ports and it will talk you through the process. There's all sorts of interesting stuff in the folder on that machine - digging in there is a great way to learn more about how Sprites works. Also from my predictions post earlier this week: "We’re finally going to solve sandboxing" . I am obsessed with this problem: I want to be able to run untrusted code safely, both on my personal devices and in the context of web services I'm building for other people to use. I have so many things I want to build that depend on being able to take untrusted code - from users or from LLMs or from LLMs-driven-by-users - and run that code in a sandbox where I can be confident that the blast radius if something goes wrong is tightly contained. Sprites offers a clean JSON API for doing exactly that, plus client libraries in Go and TypeScript and coming-soon Python and Elixir . From their quick start: You can also checkpoint and rollback via the API, so you can get your environment exactly how you like it, checkpoint it, run a bunch of untrusted code, then roll back to the clean checkpoint when you're done. Managing network access is an important part of maintaining a good sandbox. The Sprites API lets you configure network access policies using a DNS-based allow/deny list like this: Sprites have scale-to-zero baked into the architecture. They go to sleep after 30 seconds of inactivity, wake up quickly when needed and bill you for just the CPU hours, RAM hours and GB-hours of storage you use while the Sprite is awake. Fly estimate a 4 hour intensive coding session as costing around 46 cents, and a low traffic web app with 30 hours of wake time per month at ~$4. (I calculate that a web app that consumes all 8 CPUs and all 8GBs of RAM 24/7 for a month would cost ((7 cents * 8 * 24 * 30) + (4.375 cents * 8 * 24 * 30)) / 100 = $655.2 per month, so don't necessarily use these as your primary web hosting solution for an app that soaks up all available CPU and RAM!) I was hopeful that Fly would enter the developer-friendly sandbox API market, especially given other entrants from companies like Cloudflare and Modal and E2B . I did not expect that they'd tackle the developer sandbox problem at the same time, and with the same product! My one concern here is that it makes the product itself a little harder to explain. I'm already spinning up some prototypes of sandbox-adjacent things I've always wanted to build, and early signs are very promising. I'll write more about these as they turn into useful projects. Update : Here's some additional colour from Thomas Ptacek on Hacker News: This has been in the works for quite awhile here. We put a long bet on "slow create fast start/stop" --- which is a really interesting and useful shape for execution environments --- but it didn't make sense to sandboxers, so "fast create" has been the White Whale at Fly.io for over a year. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Developer sandboxes Storage and checkpoints Really clever use of Claude Skills A sandbox API Scale-to-zero billing Two of my favorite problems at once

1 views
Giles's blog 1 weeks ago

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that: In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way? Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing. So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?" Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation. If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results . Back when I was messing around with fine-tuning LLMs using the Hugging Face ecosystem -- their "Transformers" library and so on -- one of the experiments I did was to fine-tune a 0.5B Qwen model on an 8x GPU machine . As part of that, I came across this excellent HF page summarising different kinds of multi-GPU training techniques . The three that are relevant are: Now, from what I understand, due to all of the copying around of models, plus the issues inherent with the GIL in Python, DDP is actually better than DP despite being more complicated -- and more flexible! Per Hugging Face: DDP is recommended because it reduces communication overhead between GPUs, efficiently utilizes each GPU, and scales to more than one machine. It might be a while before I want to try multi-machine training, but it would be awesome to have code that's ready to do that without needing any extra work. Now, how to implement it? Hugging Face have a library called Accelerate , which does everything for you: Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! That does sound very useful, but I worry that by using it I won't learn as much. It also rather ties you in to the HF ecosystem. That's not necessarily a bad thing -- I enjoyed using their stuff in my fine-tuning project -- but I'm trying for a somewhat lower-level view in this series. So, let's use the PyTorch-native stuff. There's a "getting started" tutorial , so we can follow that. It has two options for running using DDP, one with a bit of extra setup code -- the first example, under "Basic Use Case" -- and one that uses to make things easier. The second sounds best. The code changes actually look really simple; given a normal single-GPU training script, you need to do some setup at the start: ...then wrap the model itself in a object, which is what you actually do the train on: ...and a bit of teardown at the end: The way to look at this is that will spin off one process per GPU, each running exactly the same code. They have a "rank", which is an integer saying which of the per-GPU processes they are -- 0 for GPU 0, 1 for GPU 1, and so on. There's a bit of a gotcha here, though -- you can see that we're looking at an environment variable called at the start, but we then get a (non-"local") variable from a bit later on. This is due to the multi-machine possibilities with DDP -- if you have multiple machines, then the local rank will be "which GPU on the machine does this process relate to", but there will also be a "global" rank, which is unique across all machines. This distinction won't matter that much during this one-machine test, but it's worth keeping in mind if we want to keep the code in a shape where it could potentially scale to multiple machines. Anyway, after the processes are spun up, they will do their training, and the synchronisation and passing around of gradients during the backward pass will all happen invisibly in the background, so when we do our , it will have the full set of gradients. Now that means that we'll presumably also need to use the rank -- that is, which of the n per-GPU processes the current code is running in -- when selecting which dataset items to train on. More about that later. Let's start writing some code! I'll use a new repo , into which I can put just the code needed for this train. I'll also structure it a little better than last time, with separate "runs", each of which has a model config and training parameters, and will later on have its own checkpoints. You can think of these as being one per machine size that I'm trying out -- I'll create a run directory for each one. Here's a first cut , simply loading up a model config from a run's directory, using it to create the model, and then doing the wrapping above -- no training at all. Running it with (and , as I'm using that for all new projects): Promising. Now, unfortunately we only have one GPU locally, and the code assumes that it's one process per GPU (I believe that's a hard limitation for PyTorch's DDP), so running with blows up. So we can't do an in-depth test locally. But at least we know that the basic infra is there and working. Now let's move the other training code from the single-GPU script into that file, pretty much blindly. This is the result -- it's doing almost nothing beyond what the last train did, apart from wrapping the model in a object -- the only other changes are to use this "runs" directory that we've introduced. As a quick hack, we should try running it. It does a validation and checkpoint before it starts, and we can make that happen quickly by hacking the validation loop to only do a couple of iterations: (Foreshadowing: that hack will come back to haunt us later!) Running that, then hitting control-C after the validation completes, and it looks OK: ...and we have what look like solid checkpoints: However, loading one of those checkpoints fails: It turns out that the problem is this code when we save it: The that we're saving is the wrapper around our model; my guess is that it does actually include all of the weights for the model, hence the correct-looking size for the checkpoint file, but they're renamed -- the wrapper sees the underlying model as something called , so (for example) would be called . Fixing that, with this diff: ...sorts it out -- we can load our checkpoints again. Here's the updated file . I think we're going to have to revisit checkpointing and validation again; we don't want to do it in all of our processes, probably only on global rank 0, and we'll need to somehow synchronise everything so that the other processes don't carry on training while we're doing it. But before we get on to that, there are a couple of other things to change. At the top of the file we're defining some constants that look wrong: We'll handle the dumbest of these first; it was actually silly that in the old code we had a constant for sequence length. We're using the context length of the model for that, so it's duplicated information. Let's get it from the : ...and here's the updated file . That was nice and simple. The code that we have specifies the batch size for each GPU -- that is, with , we'll have six sequences in each batch on each one. Like I mentioned earlier, that's called a "micro-batch" in distributed training like this 1 -- a per-GPU batch, as opposed to the overall global size across all GPUs -- so we could just rename it, and then we'd have 6 × n gpus as a global batch size. However, it feels to me like this is a useful metaparameter to be able to tweak from outside the code. I can see machines with per-GPU VRAM varying from 40 GiB to 160 GiB on Lambda Labs, and pretty clearly that will mean there will be a varying largest micro-batch size on each type. So this is something we'll want to configure on a per-run basis, so let's add a new file to our run config, load that up, and pass it through. That's a simple enough fix; no need to note the diff, but here's the code . This one we'll need to think about. The size of our validation set is based on what one process running on my local RTX 3090 can validate in five minutes, and the interval (for which I fairly arbitrarily put 2000 in the code when copying it across) was calibrated for roughly every half-hour. Those numbers in turn were aimed at the 44 hours of training time I expected locally. For this train, we'll (hopefully!) be taking significantly less time. We'll have eight GPUs, so naively that's 5.5 hours of train time, and each will have more VRAM, so we should be able to bump up the batch size and potentially get even faster than that. Depending on which kind of cards we're using, they may be faster, too -- I found that an A100 is slower (with the same batch size) than the RTX 3090 in my fine-tuning experiments, but the H100 and B200 are likely faster. I think this is another thing for the train config; we should have the validation interval (in terms of iterations) and the number of batches to do for validation. Here's the updated code . Now, let's move on to the dataset. With the code as it is right now, all of our per-GPU processes are using this code to iterate over the same dataset: That means that they'll all be training on the same data; the synchronisation that is happening "magically" in the background means that they'll all train on the first item, work out gradients, and step their optimiser -- so they'll essentially (modulo randomness) have the same updates. Pretty pointless! What we want is for each of the n per-GPU processes to train on 1 / n of the data. We have two useful helpers in : , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) So, the simplest thing to do is to use the world size as a step, and the rank as an offset: Here's the code with that . Now, remember that the same code is running for every one of our per-GPU processes. That means that all of them will do the training with forward and backward passes, and their own optimiser steps, all synchronised by PyTorch DDP magic. But they will also do their own validations -- which is kind of pointless -- and they'll also try to save their own checkpoints, which would be messy because they could quite easily interfere with each other; after all, all of the processes are running on the same machine and would be writing to the same filesystem. So, as a first cut, let's just wrap an around the eval and checkpointing stuff -- we change this: ...to this: That line is getting bit long, so let's break it apart a bit: That looks OK, but there's an extra wrinkle: all of the processes are running the same code, so while the rank zero one will do the eval, the others will continue through the script, so they will go right back around our loop and start training on the next batches -- which is bad. We want our processes to be proceeding in lockstep, iteration-by-iteration. Luckily, the solution is simple: the function in basically says "stop here until all of our processes have reached this point". So we can use two of those -- one before the eval loop, to make sure that all of the processes have finished their training part of the iteration before we do the eval on rank zero, and one after the eval, so that the non-rank-zero processes will wait. One bit of complexity -- we want to do those barriers only if it's a eval iteration, but we want to do them for all processes. So we have to break up the statement, and we wind up with this: That seems to work OK ( code here ), but it does give a warning: So, we want to pass the device ID in when we call . Let's dig into that a bit. Here's the copypasta that I took from the PyTorch tutorial earlier in this post: Let's dig into what that is doing. The environment variable is being set by to 0, 1, 2, etc as appropriate to tell us which process we are on this machine. So the first line is telling PyTorch to use the device with that index for this process . The next line is getting the current accelerator -- that is, an object that represents which acceleration hardware we're using in this process. I think that the best way to see the combination of these two lines is that the first says "use " (or 1, or 2, or...), and then the second says "get the object describing the GPU you're using right now". So it's a slightly indirect way of getting the object containing the details of the GPU in question. Next, we call . A backend in this context is an abstraction of whatever system the device in question is programmed using -- in the case of an Nvidia GPU, it would be some kind of thing that encapsulates CUDA. Once that's done, we call , passing in the backend that we're using. We're saying "initialise the internal data structures for so that they're all set up properly to work with the backend we specified". After that, we can do stuff like getting the global rank with and so on, because has been properly initialized. Presumably at this point we're talking to any other machines in a multi-machine cluster, so we can find out what our world size is and that kind of thing. That extra line at the end, to get the : ...actually looks erroneous to me. All of our code is assuming one process per GPU. So I think we can just use the there as well. Let's rewrite it like this (with some useful comments): That seems to work well! Here's the code . However, I ran it past ChatGPT (largely to validate my understanding of what was going on), and it highlighted something slightly misleading about it. Right now, we're training on a single node, with one process per GPU. But again, one of the neat-o things about this DDP stuff is that it should be able to scale to multiple nodes. Now, remember that is just the rank of the current process on the specific node that it's running on -- hence the name. If we had two machines, each with 8 GPUs, then there would be a process with rank zero on each of them. The "real" rank -- that is, across all machines -- is the one that you can get from once it has been initialised. One of the things it does during that initialisation is to talk to all of the other nodes and work that kind of thing out -- which of the local rank zero processes across all of the machines is the global rank zero process. So we need to use the local rank when working out which GPU we should be running on and so on, but we should not treat it as a global rank. That's actually quite fine in this case, as we're calling inside the training loop when we actually need to use the global one (when indexing into the dataset, or when deciding if we're the process that should be doing evals and checkpoints). The only place where we might be confusing matters is in that print, which is not important anyway, as the training loop also prints out its rank. So, let's tweak it a little more for clarity: That seems to work well! Here's the code . Time to run it past ChatGPT to see if I've made any dumb errors. Turns out that (unsurprisingly) I have... Let's go back to our code that decides whether or not it's an iteration where we need to do a validation run and a checkpoint: The problem is that our index is different in the different processes! Remember, we have this in order to pick out the correct training items: So let's think about it; in the first run through the loop, with 8 GPUs, we would have In the next run through the loop, we'd have: So will give different results for each process. That might not sound like the end of the world -- will only be zero for one of them, so long as is larger than the number of GPUs -- but remember that our validation code looks like this: Now, if different processes have different values for , then will only be called in the one(s) for which it is . But means "wait until all processes have reached this barrier". So the ones that call it will lock up completely until other processes get there, and everything will at best get out-of-sync, and at worst will lock up completely. I think that the problem here is that I'm conflating two things: the index of the global step -- that is, one iteration across all GPUs -- and the dataset element that we want to use. In the original one-GPU case that made, sense; iteration 0 was on dataset element 0, iteration 1 was on element 1, and so on. But now the offset into the dataset, and the global step, are quite different things. This is quite deeply embedded in the code, but we can fix it! Let's start off by changing our checkpoint code, just to rename things. It keeps track of a variable called , our offset into the training dataset, and uses that both to index into the dataset, and to work out how far through the train we are. The latter is a much better thing to store in a checkpoint, so instead of saving , we'll store (and restore) . Basically, just a rename so that the variables and stored JSON match the new reality. Here's the updated code . Now we need to make a number of minor changes to the training loop just to match that rename of the value that we're checkpointing (eg. for the code to generate the training chart) but the most important change is to our loop. Instead of iterating over our dataset with a step and and offset so that we can index into it, we firstly work out how many global steps there will be: ...then we iterate from our initial global step -- zero if we're starting a fresh train, or whatever global step we were on in a loaded checkpoint plus one if we're doing a continued train from a checkpoint -- up to the : That means that we need to use the global step, the world size, and our current rank to work out which dataset item we should be training on for this process at this global step. Let's say that we have eight processes; on the 0th global step, we should have rank 0 training on dataset item 0, rank 1 on item 1, and so on. On the next global step, rank 0 should train on item 8, rank 1 on 9, and so on. So: That's actually much more elegant than the earlier code, and seems to work fine. Here it is . Phew, glad to have caught that before I started spending money on machines -- it would have been confusing if everything locked up. Thanks, ChatGPT! Another thing that raised by ChatGPT is about the validation. We don't want to validate across all of the validation dataset -- we're using a number from the . I have this code: This looked like a nice, quick way to get the first elements of the validation dataset. But ChatGPT told me it would raise. It didn't, though -- why? The problem is that I had set to in my training config for testing. Stepping through what that slice does, when we run : Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" Nasty! AI code review certainly helped me dodge a bullet on that one. Let's fix it, it's not a big change: we can just do this: ...and that works! So here's the code now . So, I think we have one final issue, which is the training and validation datasets. In our single-GPU train, we worked out ahead of time how much of FineWeb (or FineWeb-Edu) to train on -- the Chinchilla-optimal number -- and generated a dataset that contained a round number of 6-sequence, 1024-token batches that was the smallest such round number that was larger than our target. We also worked out exactly how large (in terms of batches) our validation dataset needed to be so that each validation run would take five minutes. There was one big issue with that system; when I decided to do an "extended" train on more of the FineWeb-Edu dataset, in order to see whether I could get the loss down further, I had to do some nasty hackery in order to generate a new one. So it would be nice to not have that problem this time around. Additionally, we're likely to be tweaking the batch size quite a lot in this experiment while we find what the appropriate level is to fit onto the cloud GPUs, and also varying how much validation we do -- and additionally, we have the world size to worry about. I think that the best way to give us the flexibility we need will be to pre-convert the complete FineWeb and FineWeb-Edu datasets into the format we need -- each sequence in the dataset converted to GPT-2 tokens, and then those sequences concatenated together, with the token 50257 separating them. It would be good to properly nail down the validation dataset at the same time. So we can have a script that loads up the original dataset as downloaded from Hugging Face, splits it into 99% train, 1% validation, does the conversion, and then saves them as safetensors files. If we use for those (which is just large enough for our 50,257-token vocab), we can fit the ~10B tokens in each dataset's train split into 20 GiB of disk. Not too bad. But there will still be the issue of getting them onto our cloud machines. Let's generate the data, and then work out how to handle that. I tried initially with the code I used last time, adapted to run through the entire dataset . It does the 99%/1% train/validation split, and then for each of those generates a single massive tensor of tokens like this: It almost worked! To my surprise, it got all the way to the end, and only blew up with an out-of-memory error when it was trying to save the result -- and it did that completely silently, so I thought it had worked right up until I tried to check the file on disk to see how large it was, and it wasn't there. The obvious tweak: set the list to just after the , to free up the memory it's using. Given that it was the save that triggered the OOM, you'd think that that would be enough -- but it turned out not to be so. Rather than mess around with this for much longer, I just decided to add on 128 GiB of swap to my machine temporarily: ...and that was enough to make it run. So I've now generated pre-tokenised, pre-concatenated train and validation sets for both FineWeb and FineWeb-Edu: Now, thinking about how to get it up to the Lambda Labs machines. I have normal 1 Gb residential broadband, so conceivably I could upload 20 GiB in about 200 seconds. But that's assuming that there's no network congestion, so I would expect it to take longer. The LL machines are quite expensive, and I don't want to waste money keeping them up while I'm just uploading data. There are possibilities here: I think the best option is to use option (1), but with the option of also doing (2). The HF dataset will still take time to download to LL, even over the faster network connection. That might not be a problem -- but if it is, I download it once on a cheap instance and use a persistent disk too. Essentially I'd be using the persistent disk as a "cache", and still get the benefits of the easily-shareable datasets on Hugging Face. So, that decided, let's find out how we can upload a whacking great 20 GiB safetensors file as a dataset on Hugging Face. It turns out that resources like datasets on HF are just Git repositories using the LFS (Large File System) plugin to be able to handle, well, large files. Conveniently, given that I'm using to manage my project, there's a plugin that allows me to use their CLI tools with minimal effort, so: Both datasets show up on my profile page on Hugging Face, so that's looking good. Now it's time to try to upload the data. We'll need to install Git's LFS support first: Now let's try the FineWeb one first: OK, so we need some kind of extra thing to tell it we can use large files on top of the LFS stuff: Right, now let's try again: Weird that it prompted for the credentials twice, but it did appear to try to do something there -- but obviously it didn't work. Let's see if Git over SSH is any better. ...then the same stuff to copy in the files and create the metadata file, then: Looks like the same error. Odd. Let's try using HF's upload tools rather than Git -- feels like a bit of a cop-out, but maybe it'll work better. That did indeed take about 200 seconds to run, but the upload speed was only about 10 MiB/s -- from the output, I think it must have been compressing it. Anyway, it looks like it succeeded, so let's upload the others! ...and that's done :-) Next, a bit of manual editing of the dataset cards on the Hugging Face website, and we have our two new public datasets: That looks solid. So, the next thing: change our codebase so that we have some quick and easy way to download them (I'm feeling a little wary of using Git for that after the upload issue), and then to use the downloaded files in our training code. We already have the code to download a dataset; the stuff that I wrote to download FineWeb and FineWeb-Edu originally. Here's the important bit: ...so we can adapt that to download all files in an arbitrary dataset: ...and call that from our , using a new command-line argument , and a new element in our train config JSON file: I was thinking that we'd need extra guard code to not download the dataset again if it's already there, but it looks like handles that all nicely for us. So we have a way to specify which dataset we should use for a training run, and code to download it. Now we just need to adjust the code that loads our datasets so that instead of looking in the , it looks in the directory returned by : ...and update the directory so that if just blindly uses the directory provided rather than trying to look in a subdirectory: That all works! We successfully download the datasets and try to use them. Here's the code . But now we have a problem; when the tries to reshape the huge tensor that we have as our inputs: ...it craps out: That makes perfect sense. Our original files were carefully sized for a batch size of six, and 1024-token sequences. We need some way to work out an appropriate slice of both the training and the validation data. Most of the trains are likely to be Chinchilla-optimal, or at least use a Chinchilla-optimal number of tokens -- rounded up appropriately to match our micro-batch size, sequence length, and world size. But I'd like it to be more configurable. What I'll do is add a key to the training config dictionary, along with a so that we can (for example) train on the first Chinchilla-optimal tokens, then do an extended train continuing on from there. The idea is that we can use as a base, and train on the smallest number of full batches that contains at least that many tokens. For validation, I think that the key that we already have is actually quite nice. Validation is time-bound, and the number of batches is the easiest lever to pull to handle that. However, a would be nice for symmetry. So, here are some numbers for debugging: Now let's use them. Initially, we have this to load the train dataset: Let's work through that one first then make appropriate changes to the validation one. The pieces of information we need to work out which tokens to use are: Let's update our function so that it takes those parameters in that order: ...and now we can write an updated that uses those numbers to get the right number of tokens: Validation is less obvious; I think that the best way to do this (given that the validation dataset is small) is just to have a "magic" value for , which means "just get a round number of full batches starting at . It's also worth remembering that we only do evals on the rank 0 process, so we could in theory pass in a world size of 1 -- but I think that passing in the real world size might be a good idea, because it gives us one fewer thing to change if, in the future, we move towards distributed evals. ...and we change to be able to handle the magic : I also added in a quick sanity check to make sure that we don't get weird behaviour if the is past the end of the original dataset. That all looks good! Running it kicks off training, and validation is running happily every ten global steps, but just with three samples, as configured in the JSON file. Here's the code . One thing that hasn't shown up while running this code locally is that our training loop has this: With one GPU, that's fine, but on a multi-GPU machine, that is going to happen in all of our per-GPU processes -- so they'll all be spamming out progress bars, which will be ugly. So, as a first cut: Now, in order to compare different machines (say, an 8x H100 vs an 8x A100) it would be nice to get tokens-per-second numbers while training. We can do that in the progress bar too! It has a method that adds stuff to the end of the bar, just after the elapsed time and iterations/second numbers. For that, we'll need to have the object available in a variable: ...and now we can count the total tokens seen in the training run, plus keep track of the start time -- just before the start of the training loop: ...then inside, after the training step: That will give us a running average of tokens per second over the train as a whole since the start. Running that, we get a nice progress bar like this (you'll need to scroll to the right): Note that the tokens per second is worse than the just less than 20k that we got when running the single-GPU test previously, but that's due to the testing setup I have -- I'm doing an eval every 10 global steps. Changing that to 1,000,000 so that we just get a single eval when we start, then letting it run for a while to settle down from the initial eval, we get this: ...which is close enough to what we had before. Finally, let's print out some summary information at the end: Ran that on a super-short train with about 50 iterations-worth of tokens, and: Looking good. Here's the code . I think we now have something where it's worth spinning up a Lambda Labs machine to run it. Let's kick off a training run on the cheapest two-GPU machine that they have available right now. That's actually not all that cheap, it's a $6.38/hour 2x H100 80 GiB SXM5. But I'm not planning to do a full train on it yet, this is just a sanity test. I won't attach a filesystem this time, either -- let's see how things go without the caching of the datasets that I was considering. First thing: do we have ? Nope. OK, let's install it: Right, now let's clone our repo and set up our environment: And now I think we can just try running it! It took 18 seconds to download the dataset! I don't think we need to worry about the caching thing with persistent disks, at least at this point. But there are a couple of issues here. I didn't put the number of processes in the command line -- I should be using Also, we don't have the XKCD font family. I'll ignore that for now. OK, that's looking good! Let's make our validations happen less often, and see how high we can get the micro-batches with the 80 GiB VRAM we have on each of our two GPUs. Doing a binary chop, I set the micro-batch size to 100 (OOM), then to 50 (OOM), then to 25 (worked), then to 37 (OOM), then 31 (OOM), then 28 (worked), and finally 29 (OOM). So we have a batch size of 28 for our 80 GiB machines. Leaving it for a little while to settle down, and we get to about 142,000 tokens/second. Now, on the 3090, we were training at 20,000 tokens/second. That means that this machine is running at about 7 times the speed. Given that our original train finished in 48 hours, we'd expect the train to finish in about 6, which indeed is the estimated time on the tqdm progress bar. At $6.38 per hour, that comes to $38.28. Not bad! And this instance is actually quite pricey on a per-GPU basis -- it's $3.19 per GPU/hour, whereas there is an 8x H100 that costs $2.99 per GPU/hour. I'm almost tempted to let it run. But the purpose of this run was to work out the bugs. We're going to want to track the training chart -- remember that after every validation run, our training code generates a chart showing the training and validation loss so far, like this one . I ran the normal quick-and-dirty Python webserver command on the instance, inside the directory containing the training chart: My browser didn't connect to it, but looking at the Lambda Labs interface, there's a new "Firewall" section, where you configure rules for allowing incoming connections to your instances. That's sensible, and the default rules are just "allow SSH from any IP" and "allow ping from any IP". Adding one letting anyone access port 8000 fixed the problem, and I saw a directory listing; clicking on the chart showed exactly what I'd expect, but without the XKCD fonts. Nice. Let's work out how to fix that XKCD font thing. Looking around, it seems like there are approximately twenty thousand ways to do it. Here's one that seems to work; firstly, install the font on the system: Now, that installs a font that has the family name 'xkcd Script` (with that erratic capitalisation). So we need to change the code to pick up pretty much anything that looks like it's XKCD, so instead of this: ...we can do this: That seems to work OK. So, now, I think we have the beginnings of a script to set up a Lambda Labs machine so that we can use it. Let's write a with this: ...and give it another go on a fresh machine. Shut this one down -- total cost so far $7.28. Now there are no 2-GPU instances available. There is a super-cheap 1x A10 (basically the datacenter version of a 3090), though, so let's use that -- we're as certain as we can be that the multi-GPU stuff works, and the proof of the pudding will be whether we can train a model that works. After spinning up our 1x A10 machine: Looking good! I think we have something that (in theory) should work. That cost $0.05. I think it's time to do our first train on a big instance. There are four 8x instances available on Lambda Labs for me right now: I think I'm going to want to train on all of those, to try to work out some kind of metric (dollars per megatoken?) to compare them. But let's start with something reasonably low-end -- in fact, let's try the cheapest, and see what happens. Spin one up, and first thing; after the setup, we need to work out the micro-batch size. Last time we used 28, but this machine has GPUs with half as much VRAM. I did a binary chop again... it turns out to be 13. Now let's think about validation frequency. Let's try to get a feel for how long it will take. We can set the eval batches to (say) 100, so that we can see how fast evals are, but also set the interval to 10,000,000 so that it never does one after the first. It took 11 seconds to run 100 validation batches, and after a few minutes, it settles down at 254,000 tokens/second or so, and is estimating 3h15m to completion. Nice! The cards are an earlier generation to the H100s we used in the two-GPU test, so they're slower, and they have half the VRAM. So eight of them are, working together, about twice as fast as two H100s. Doesn't sound completely crazy. So, in our local train, we spent 5 minutes evaluating every 30 minutes. So our eval time was 16% of our train time. Probably a bit high, but let's run with it. If we're going to take 3 hours training time, then 16% of that is about 28 minutes. Previously we did about 88 evals (44 hours train time, with an eval after each half hour). That seems a bit too high. So let's say that we want to do 50 evals. 28 minutes eval time in total, with 50 of them, means about 30 seconds per eval. If 100 eval batches take 11 seconds, let's approximate it to 300 eval batches. As to the interval between them -- if we want to do 50 over 3h15m, or 195 minutes, then that's one every (let's approximate) 4 minutes. We seem to have settled down to 2.57 iterations per second, so that's about every 617 iterations. Let's bake those in and let it rip. After the run: OK, let's download everything. Looking at the checkpoints, the latest (that is, the last one at the end of the training) and best (the checkpoint that had the lowest validation loss) are the same one, meaning that validation loss kept falling consistently: So let's just download using the "best" symlink to get the weights for that checkpoint: And now we can shut the cloud machine down. Now that the clock is no longer ticking and we aren't spending money on an unused machine, here's the training chart: It looks like we had a couple of gradient spikes there. I'm going to add some gradient clipping code at some point, but I think I'll hold off for a little bit -- I want to do a few cloud trains first to work out the best instance sizes to use, and only then start exploring the possibilities for making the models better. Apart from that, it looks pretty normal. Looking at the billing page on Lambda Labs, that machine was up for about 4 hours and 35 minutes, costing US$10.32 per hour, for a total cost of US$47.35. Of that 4h35m, 13,904 seconds, or 3h52 was the actual training run -- somewhat more than the 3h15m that was predicted at the start of the run. The validation will have accounted for most of that -- we did 50 evals, at 30 seconds each, so that's 25 minutes. That means that 3h40m is accounted for, and the remainder can just be chalked up to noise, I guess. That leads to one question: do we actually need to be doing validation for these trains? I've been doing validation loops in these trains largely out of habit -- when you're training an ML model, it's just "what you do". The reason you'd normally hold out a validation set is simple: if you're training over multiple epochs, then eventually your model is going to start overfitting to the training data 2 . You validate as you go along so that you can spot any points where, while the training loss continues to drop, the validation loss -- which is loss on data that the model hasn't been trained on -- starts rising. That's the classic indicator of overfitting. But for these models we're not doing multiple epochs -- we're just training through a stream of constantly new tokens. So, in fact, there's no real difference between the training data and the validation data, apart from the fact that the validation data is constant. From the model's perspective, it's all new stuff (modulo any repetitions in the dataset, which is possible but I think not likely to be super-common in something as curated as FineWeb). Now, in this post I'm aiming to identify the best options for training in the cloud -- cost in terms of dollars and time. I don't want to change the model itself or the training strategy because I want whatever I come up with to be roughly equivalent to the models I trained on my own machine. Exploring enhancements is for the next post. (Of course, given that the batch size is one of the levers I want to experiment with, and training on larger machines is already meaning that I'm doing micro-batches larger than the batch size of 6 that I used locally, and then the overall batches are 8 times larger, that's not quite true.) Validation, however, doesn't actually affect the training runs in any direct way. I could in theory remove it. However, that is a relatively large change to the code, as I've kind of linked it in with my checkpointing code. I think that what I'll do for now is leave it in. Validation will scale at the same rate as training (so long as I leave the eval batches constant) so it leaving it there will give me a clean comparison between machine types. And I can keep notes on how much time was spent on validation for each train so that I can subtract it from the total time if that proves useful. However, when I start tweaking the training code with changes beyond the batch size, I should probably try removing validation first. Anyway, while validation during the training run might not be important, evaluating the model at the end and seeing how it compares to others is! Let's do that next. There were two important post-train evals that I did on the models that I trained locally: There was also a simple smoke test -- how does the model predict that the phrase ...should continue? I should do the same three tests here. A simple autoregressive generation script is easy enough to knock together, and: All we're looking for here is basic coherency, and I think this is good enough to pass that filter. Next, the loss-style testing. What I think I want to be able to do here is just take a file and run an eval against a standard dataset. I did not generate my own test set, but I did generate a much-larger-than-necessary eval set, 1% of both FineWeb and FineWeb-Edu -- that's 100 million tokens or so in both cases. In the validation that I was doing during the train just now, I did 300 batches of 1,024 tokens with a micro-batch size of 13. That only ran on the rank 0 process, so that's Not even 4% of the validation data. Now, for the local eval, I think it makes sense to make it run for about five minutes -- that's just for my own convenience, I don't want to spend very long -- and I know from the previous local train that I can do 3,200 batches of six 1,024-token sequences in that time: So, somewhat arbitrarily, let's use the 19,660,800 tokens starting at position 50,000,000 in the FineWeb validation dataset for our tests -- they'll never be used for training or validation during the training loop. It's kind of a hack, but it'll do for now. Here's the code . It should be easy enough to understand; it did require one tweak to our existing function, though: Originally, that function worked out out the actual number of tokens to use by working out the size of each global batch, dividing our requested minimum number of tokens by that size and taking the floor, adding on one, then multiplying that by the global batch size. That works fine in cases where the is not a multiple of the global batch size -- it gives us a round number of batches that contains at least . But if is already a multiple of the global batch size, it gives us an extra batch at the end. So I added that as a special case in to avoid that. Anyway, running that gives us a loss: That's actually quite a lot lower than we were seeing with the locally-trained models on the test dataset I was using then -- but, of course, it's a different dataset so it's not strictly comparable. Let's run the same test against them: That's really interesting! Those numbers are really close to the numbers I got in the last post. That does make some kind of sense, though -- while the numbers aren't strictly comparable, as I said, both the dataset that I was using then and the one I'm using now are essentially random stuff from FineWeb, so I guess they must be more similar than I thought. But, importantly, the loss on the newly-trained model is much lower -- 3.674 rather than > 3.9 for all three of the older locally-trained models. Now, the only big difference between this training run and the ones that I did locally is the batch size. As I said in the last post, while I felt that the difference between my batch size of six and the (reported) batch size of 512 for the original GPT-2 was the least-likely cause of the differences in the results, Gemini told me that it thought it was the most likely cause. It looks like Gemini (and, I should note, on Hacker News ) might have been right! Batch size is super-important. Let's do the same eval with the OpenAI weights. I wrote a quick script (in my old 'LLM from scratch' repo, which has the code used in the book) to load up the GPT-2 weights and save them as a safetensors file . When I ran that, I got an interesting error: That was easy enough to fix; in the book's code we assign the weights that have been loaded from the OpenAI TensorFlow checkpoint files with a function called that looks like this: Just adding a call to to the last line fixed the error: ...and as a result, I had safetensors files for the original OpenAI models: So now we can run our test against them: Excellent. Let's start putting together a table of these results: That's pretty amazing. Having a batch size of 13 micro-batches over eight GPUs, or 104 in total, seems to have massively improved the model -- it's much closer to the original weights. It will be interesting to see whether I get further improvements when I move to the larger machines, which (due to having more VRAM) will have larger possible micro-batches, so we'll get larger global batch sizes. It certainly makes me think that I could have got much better results locally by using gradient accumulation, which would mimic the effects of a larger batch size by running multiple smaller batches through, without doing an optimiser step each time, then doing one big update once enough has gone through. But all of that is for another day. Let's try the instruction fine-tuning test now. I decided to pretty much re-use my adapted version of the code from the book; that meant that I was borrowing quite a lot of Raschka's code, which he has released under the Apache 2 license . I normally use the MIT license for my code, but I'm not married to it, so I relicensed the whole repo as Apache 2 with some specific headers to say which parts came from "Build a Large Language Model (from Scratch)", and added this code . It downloads the Alpaca dataset from the site for the book, splits it into train/validation/test splits, trains on the training set, evaluating each epoch and bailing out (and restoring the previous epoch's weights) when validation loss starts rising, and then runs through the test set generating responses, and then sends them all off to the OpenAI API for GPT-5.1 to judge them. Running it against our new model gets a score of 17.09. Let's try the various other models and build out our table: Interesting! In the last run, I found the instruction fine-tune numbers came out as FineWeb-Edu extended > FineWeb > FineWeb-Edu, but here we have FineWeb-Edu > FineWeb > FineWeb-Edu extended -- exactly the opposite! I do have to wonder, though, how precise a measure this is. While the training should be fairly consistent (though I don't have a random seed in there to enforce it), the fact that we're using an LLM as a judge means that there is an element of randomness coming in here. Indeed, I re-ran the FineWeb-Edu extended train test again, just to see what I got, and it came up with an even-worse 12.12. So I don't think we can read a huge amount into these numbers -- well, unless we can get the numbers significantly up. While it looks like a 2.5-point difference might just be randomness, I doubt that a 10-point difference could be. I think we've done the tests that we need for this model now, and we have a testing procedure in place. So let's train some further models on different instance sizes, and gather numbers. This is the biggest machine available on Lambda Labs right now, and is only sporadically available; one happens to be there now, so let's to give it a go. First, we need to create the runs/8xb200m160 directory, initially with a that is a clone of the one I did for the last train, , then spin up the machine. As before, we need to log in, clone the repo, then in it run the script, run , and try to run the script: It crapped out because there was no datasets directory, which is an annoyance. We should create it if it doesn't exist. Create the directory, and run it again. It took a while to download the dataset, because every per-GPU process downloads it separately. That only took a minute or two, but it was a waste of time; I think we should only download it from the rank 0 process with some barriers to make the other processes pause. Next, we need to do a binary chop on the micro-batch size, starting with a low of 13 (which I know will be fine because it worked on the 40 GiB GPUs that we used last time), and a high of 100 (fairly random, just something I'm pretty sure will fail). While doing that, a few things are standing out, both to do with validation. When the script starts, it does one training iteration, then goes straight into validation. Then it starts the training run proper. However: We're going to need to work out some kind of fix for that, because it's taken me 17 minutes from spinning up the machine to getting a size for our micro-batches -- which happens to be 64. On a machine that costs US$39.92/hour, that's an expensive test! We'll look into that later. Anyway, a batch size of 64 is pretty neat, as with 8 GPUs, that means we have a global batch size of 512 -- exactly the same as in the original GPT-2 paper! So, let's kick off the train. It takes about 7 minutes to get to the first checkpoint, at which point it's averaging 801,221 tokens/second. That pattern repeats, and with about one minute to do the validation, we're spending about 12.5% of the time on this machine validating. Hmm. A further indication that we might want to remove the validation stuff if it's not adding on any value. Eventually, it finishes: So, that's 1h9m50s. The final validation loss is not as good as the previous run on the 8x A100 40 GiB machine, where we got down to 3.675. Given that we're using the same validation dataset as the previous, that's meaningful: this is not as good a model, it seems. Again, latest and best checkpoints are the same one: So we can download everything: ...and here's the training chart: OK, so that's smoother than the last one -- no loss spikes. Maybe the larger batch size smoothed them? Let's think a bit about the cost of this train. From Lambda Labs, we had that machine running for a little over 1h30m. At US$39.92/hour, the total cost was US$60.25. Yikes. So, knocking off the 1h10 or so for the train, we have 20m to allow for -- which matches up quite well to the 17 minutes of fiddling with batch sizes, and then 3 minutes to download all of the files. If this blog post isn't going to cost significantly more than it needs to, we need to get that down. Of the US$60.25, just over US$13 was spent on identifying the batch size. Only US$46.57 was spent on the train itself. We also did 11 validation runs as part of that; at a minute each, those cost US$7.32. So, excluding validation, we're below US$40 for the train. Now, let's run our tests. First, the smoke test: we get this: "...on all other website for..." is a bit rubbish. Still, on to the loss: That's in line with the training loss -- worse than the loss I got with the one trained on the smaller machine, with its corresponding smaller batch size, but still better than any of our local trains. Still interesting, though -- larger batches are not guaranteed to get bigger results. More investigation needed there! On to the instruction fine-tuning test. That gives us a score of 13.89 -- the worst that we've seen yet! I think I'll put together a full table including these results later; I want to try training on some other, differently sized machines first, and we can aggregate the results at the end. But before we do that, let's make some changes to the scripts to fix some of those QoL issues we encountered in that last train. The first irritation was that it errored out saying that was not a directory when it didn't exist. The script takes a datasets directory as one of its command-line options, and it's reasonable that it checks that it really is a directory (rather than, say, a file or a symlink): ...but if it doesn't exist, it might as well create it first. Now, I could just put this before the check: ...but remember, this code is run by multiple processes -- so they could easily trip over a race condition here. What I want is to have just one of them do this; I've deemed the rank 0 process the "special" one for validation, printing the progress bar, and so on, so we may as well treat it that way here. But -- there's a difference! Rank zero is the one that should be printing stuff out, it's true. And right now, we only have one node participating in this train. But I do want to avoid simple errors that would make it hard to run multi-node in the future. Now, if we have multiple nodes, then each one will have its own filesytem (unless we're using NFS or something like that), so we'll need a separate "datasets" directory for all of them. What we want is to do these checks on one process on each node. Usefully, we have the variable that is defined earlier in , which is per-node. Again, let's imagine we have two nodes with two GPUs each. Node 0 might be runnning the processes with global rank 0 and 1, and node 1 might have global ranks 2 and 3. On node 0, the processes would have local ranks 0 and 1 respectively, but on node 1, they'd also be local ranks 0 and 1. So, the full code becomes this: Note the barrier; we don't want the other processes to check whether is a directory until the local rank 0 process has had a chance to create it. (Of course, if we were running this on a setup where all of the nodes shared a filesystem, it wouldn't work -- in that case we'd want to use the global rank that we can get from instead. But we can burn that bridge if we ever come to it ;-) Phew, that was a bit more work than I expected! But it sets us up nicely for the next QoL fix on my to-do list. I don't like the fact that every process downloaded the whole dataset. The actually handled it pretty gracefully -- none of the processes tripped over any of the others. Indeed, it looks like there was some kind of global queueing going on, so they downloaded it one after the other. But it did take time -- maybe a minute or two in total, and with the clock ticking on that ~US$40/hour machine, that felt a bit stress-inducing. So: I think it would be best to only do that from the rank 0 process as well. The code that downloads the dataset is just after the bit we've been looking at: ...and looks like this: Now, the docs for say that the parameter is: If provided, the downloaded files will be placed under this directory. ...and the return value is this: We happen to be passing in a object for , and we're not in mode -- it defaults to . So all we're doing by returning that wrapped in a object is a slightly indirect way of returning the path that we're passing in as . For tidiness, I really want to gate the call to in with the same rank stuff as we did for the directory creation. So, let's change the setup so that takes the path to the directory where we want this specific dataset to be, not the generic "all datasets" directory. And given that we're now passing this specific path into the function, we don't need to return it: Now it's just a wrapper around a single call to , which I'm not entirely sure about (it's a code smell that I'm probably creating an unnecessary level of abstraction) but I think I'm happiest leaving it that way for now, as it does hide away a bit of messiness in the HF hub API. 3 That means that we can now combine the directory-checking logic that we fixed above with download-on-local-rank-zero-only code like this: Here's the updated code with those fixes. Now, let's move on to validation. I'm increasingly of the opinion that the validation steps are just adding on to the cost without much in the way of benefit. Additionally, the validation is taking a different amount of time for each batch size, and happen a different number of times in each train -- remember, it's batches every global steps, and the batch size varies based on the micro-batch size, which is different for different amounts of GPU VRAM, and the total number of global steps in a train also varies based on the size of each batch. So that means that if we want to compare apples to apples in any final comparison of the time and money cost of training models on different kinds of Lambda Labs machines, we'll want to exclude the validation cost -- once we've settled on a machine type, we're going to want to fine-tune the validation size for that in much more detail than I have to date, assuming we don't drop it entirely. However: I'm loath to make such a fundamental change halfway through this comparison. It's tightly coupled to the checkpointing code, and the charting code, and so on. So I think that for this post, I'm just going to keep it there, and keep track of how much time (roughly) we're spending on each validation step for each train, so that we can remove it and get a "pure" train-time only comparison between the different kinds of machines. It's not pretty, but I think it's better than changing horses mid-stream. On the other hand, the validation is a real pain when doing the binary chop to find out the maximum micro-batch size for our VRAM before we start the training run. That's because we have to wait for one validation to run before we get into the full training loop, which makes it slower. On top of that, having to do a manual binary chop is a PITA. What I think would be a true QoL improvement for the future trains is something that does the binary chop for us, using a dummy training loop. We run it once on each new machine type, get a micro-batch size to plug into our training parameters, and then let it rip, This will re-use so much of the code from the training script that I think it actually is just an alternative way of running it. After a bit of hacking, I came up with this updated code -- the diff is a bit hairy, but essentially: That takes just over six seconds to find the correct batch size on my local machine; with multiple GPUs, I expect it will be slower (there's a spinup overhead to start all of the per-GPU processes), but I'm sure it won't be as bad as the manual binary chops with validation that I was doing, and will be less error-prone. Right! We've done some QoL stuff, let's try another machine size on Lambda Labs :-) These are the machines that Andrej Karpathy is recommending for training nanochat, so let's see how we do with them. They cost US$23.92/hour; let's see how it works out. Here are the steps: Now let's download our dataset and find our micro-batch size: That took less than a minute to run -- nice! Now we can put that micro-batch size in . It does seem a little small -- after all, we could fit a batch of 64 into 160 GiB -- but I'll do some analysis later. Actually, before we kick off the train, let's see how long all of the preparatory steps took to run before we can do that -- not just the micro-batch-size script, but also the installation of the dependencies, the clone, and any overhead from boot time etc: Five minutes total. Not bad. Let's start the train: The initial validation run took 38 seconds, and then we started off. At 4m37s in, we get the first real validation run; at that point, it's running at 493k tokens/second. Eventually, it finishes, having taken about 1h50 including all of the validations. Here's the training chart: Two things stand out here: Further evidence that gradient clipping is likely to be an excellent addition to our training loop! It's also worth noting that the train loss spikes at the same time as the validation loss, so getting rid of the latter would still allow us to get a "best" checkpoint to compare with the latest at the end of the train. The machine was up and running for 2h9m, costing US$23.92/hour, for a total cost of US$51.47. The train took 6,650.197 seconds, so about 1h50m. Allowing for five minutes setup time, that's 1h55m accounted for. There's an extra 14m there -- that was because downloading those two checkpoints to my machine took quite a long time due to local network issues. Might want to look into ways to avoid that later. And for later cost-accounting purposes, we should note that it took 38 seconds or so for each validation run, and we can see on the chart that there were 24 of them. So, firstly, let's give our two models -- the best one and the latest one -- a smoke test: Both of those look OK! Now let's try the loss test. I started running it, but when it started downloading the dataset, I realised that it needed updating to allow for the changes I made to -- ooops! That done, let's give it a run for both of our models: As you'd expect, the best checkpoint has somewhat better loss, at 3.725, than the last one, with 3.734. Once again, better than our local trains, but not quite as good as the result with the first cloud train on that 8x A100 40 GiB machine, which was 3.674. Again, I'll put together a table comparing all of these results at the end. Does that make any real difference with the instruction fine-tune test? The test prints a lot out, but the headline numbers: So that was interesting! However, I am getting ever less convinced that the IFT test is a useful one; the randomness of the LLM-as-a-judge responses means that I don't think it can be consistent. Perhaps a better way to do this would be to batch up all of the models, and then give GPT5.1 answers from "model A", "model B", and so on all in one query, and then to ask it to give them scores all at the same time. That would hopefully make things at least a bit more consistent. Something to ponder later, I think. In the meantime, one extra thing I wanted to dig into before going on to the last train for this post: I mentioned that I thought that the batch size for that last run, 27, was a bit small considering that we'd managed to fit a size of 64 into the 160 GiB/GPU machine. But after thinking about it for a bit, it occurs to me that during my experiments doing fine-tuning, I came to the conclusion that memory use scaled linearly with batch size , with a fixed amount per element in the batch (the activations for the model for that batch element), plus an overhead (the model itself, the optimiser, and perhaps other stuff). We have batch sizes for: Now, that is slightly messy data because each memory "measurement" is the size of the card's VRAM, not the amount of VRAM we actually used -- there might have been anything from zero to just less than one extra batch element's worth of "spare" space -- but we can see what we get with a simple linear regression: And if we plot that, we get this: Nice! That fits really well. So we have an overhead of about 11.5 GiB, then about 2.35 GiB per batch element on top of that. That is, of course, somewhat sad news for anyone trying to repro this on a GPU with 12 GiB -- looks like it would be just too small to even fit in a single-element batch after the overhead :-( Anyway, that's been a bit of a side quest. Let's try our last machine size for what has (once again) turned into a bit of a monster of a blog post... This is the same kind of instance as the first train in this post, except that it has double the VRAM per GPU. Let's see what we can do with it. Once again, we create the run file, commit and push, then spin up the machine. On it, we clone the repo, run then . Next, we can find our micro-batch size: Interesting, we managed to squeeze an extra one in compared to the H100's batch size of 27, despite having exactly the same amount of VRAM! Not sure what might have caused that. It took 4 minutes to get to this point, so let's get that batch size into the config and kick off the run. The initial validation takes 1m06s, which is consistent throughout the train. The first real val run at 8m15s in, and the estimated train time is 2h35m, with a tokens-per-second of 286,188. At the end: Again, the latest and the best global steps are the same (despite some loss spikes): ...so we just need to download that and shut down the machine. How much did that cost us? The machine was running for 3h25m, costing US$14.32 / hour, for a total of US$48.76. Our train took 11,532 seconds, which is 3h12m, and our setup took about 4 minutes -- maybe five including the time required to update the train config with the micro-batch size, so we have 7 minutes on top of that, which is about the amount of time it took to download the model. Let's run some evals! Our smoke test gives us this: Coherent enough, I think! Now the loss on our test dataset; it comes out as 3.730, so pretty similar to our other cloud trains, apart from the oddly-low one on the 40 GiB GPUs. Now let's see what GPT-5.1 thinks of the instruction fine-tuned version. It only needs two epochs of fine-tuning, and believes that "The author of 'Pride and Prejudice' is 'Pride and Prejudice'", which is not promising, and gets a score in the same kind of range as the other models, 11.71. So: we've trained four models on four different machine sizes. Let's see how they stack up against each other, against our locally-trained models, and the original OpenAI GPT-2 weights. So, I've trained four of my 163M-parameter GPT-2 models, using almost exactly the same dataset -- the Chinchilla-optimal number of tokens, rounded up to make an even number of batches. I did this on four different multi-GPU machines on Lambda Labs: I've done some evals on each of the models, so let's put those results together in one table -- results for the trains in this blog post, alongside those for the original OpenAI GPT-2 weights, both small and medium, and for the models I got when training locally. For all models, I've provided: I've sorted the models in order of increasing loss on the test set -- so, the best model by that measure is first. The instruction fine-tune results are kind of all over the place, and I'll look into that later 5 . For now, let's focus on the test loss. We have a pretty clear pattern, where the local trains are grouped together at around 4.0, and the cloud trains at around 3.7. For the local trains, as I noticed last time around, FineWeb is counter-intuitively better than FineWeb-Edu. There are two interesting things about the cloud trains: I think that what we're seeing here is that larger batches are better, but only up to a point. It's as if there's some kind of curve like this: I got that by taking the log of the batch size, then asking NumPy to do a polynomial regression -- that is, work out a , b and c so that the formula ...fits it as well as possible: It's kind of interesting that it's such a good fit with such an ad-hoc formula! We have a nice smooth curve hitting almost all of the points, and our optimal batch size looks like it's just a little below that 104 we managed with the smaller cloud machine, at about 97. But it's certainly not something that I'd like to read too much into. Best to treat it as purely illustrative: "it might be something like this". I think digging into that might be an interesting experiment at some later point. A bit of checking around the Internet (and a chat with ChatGPT) suggests that it's something people have looked into in some detail, unsurprisingly. An interesting point ChatGPT raised is that with our pretty much fixed "budget" of tokens -- we're always training on something close to the Chinchilla-optimal number -- then a larger batch size means that we're doing fewer optimiser steps. Intuitively, that sounds like a problem. The larger batches mean that each move across the loss landscape is "better", or at least more stable. But we're doing fewer of those moves over the course of the train. There's obviously a tension between those two. You can imagine a degenerate case where the batch is so large you can fit the entire run into one iteration, so you do just one update of the parameters; that obviously wouldn’t work very well. Anyway, for the purposes of this post, let's flag it as interesting and move on. Let's take a look at costs. Here's another table for those -- for each cloud model, I've listed: What do these numbers tell us, given what we were trying to do here? Like I said at the start, this was a pretty expensive learning experience: I wound up spending US$215.16 on Lambda Labs instances over the course of putting this all together. But it was worth it! At the start of this post (if you can remember so far back), I said I wanted to achieve two things: Yes, absolutely. The trains I did, if we exclude the validation time, each cost between US$35.56 and US$39.14. In time, also excluding validation, the slowest ran for about 3h25m, and the fastest just less than an hour. Now, in a future post I want to try making the changes that I listed at the end of my last post to see if I can get the loss lower: If I'm to do those, what I'll need to do is start with a baseline train on one particular size of machine, and then try introducing each change separately to see what happens to loss. I'll want to use a fixed seed for random number generation, so that I start with the same initial weights each time. Given what these experiments have already shown about loss -- that the smallest, cheapest machine has better loss than the other more expensive ones due to what I assume is the batch size -- then that actually feels like exactly the right machine to choose for this. It does take a while to train anything, but three and a half hours is pretty acceptable, I think -- I can do a train or two per day. An 8x A100 with 40 GiB VRAM per GPU is the way forward. So: next steps. I want to: This is going to be fun. Stay tuned! I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩ I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right. DataParallel (DP). With this: The default GPU (normally ) is in charge of the process. It gets a batch of data, divides it up into per-GPU "micro-batches", and sends each of those to a thread for each of the other GPUs. It then sends an up-to-date version of the model to each GPU. Next, all of the per-GPU threads do a forward pass on their replica using their specific micro-batch, and send their outputs to the thread for the default GPU. The default GPU thread aggregates all of those outputs (similarly to how the losses across all of our batches and the prefix sequences are aggregated in the normal single-GPU case ) to work out an overall loss. It then does a backward pass. This will start on the default GPU, as the aggregation step is the first thing that it will come to when going backwards through the steps that came up with that overall loss. However, it will then come to operations that happened on the other GPUs and those are (somehow) parallelised. Once that is done, each GPU has gradients that represent how their copies of the model contributed to the overall loss. Finally, they send those gradients back to the default GPU, which combines them (I think of this as just being an average, though I gather it's more complex) and applies them, producing an updated model. Then the process repeats; the updated model on the default GPU will be sent to the other GPUs in the second step of the next iteration. DistributedDataParallel (DDP). This does less work on the default GPU and does less copying around. Each GPU has its own process (rather than thread), and is essentially responsible for its own training loop. Right at the very start, the default GPU's process sends the model to all of the others. Then all processes go into their training loop: Firstly, each one works out its own micro-batch (which means you need to have code to make sure that the datasets are properly split across the GPUs) Each model does its own forward pass, then its own backward pass, working out its own independent gradients. As it comes up with those gradients, it broadcasts them to a "reducer", which handles the aggregation. This is done in a distributed way -- there's not just one reducer handling everything. When all models have completed the backward pass, the reducer has a set of combined gradients, which is visible from the per-GPU processes. Each GPU process does its own optimizer step using those combined gradients. That means that there's no model copy required -- each GPU has applied the same gradient update, so they already have in-sync models, assuming everything went well. ZeRO. This is a much more complex system, and I went into how it works in this blog post . , which gets the global rank of this process. In our one-machine case, it returns 0 for the process on , 1 for the one on , and so on. We're already using it in that setup code we looked at earlier: , which tells us how many GPU processes there are (globally -- it would be across all machines if we had more than one) = 0 for the process with rank 0 = 1 for the process with rank 1 = 7 for the process with rank 7 = 8 for the process with rank 0 = 9 for the process with rank 1 = 15 for the process with rank 7 Python calls the on the dataset, passing in a object as , so this code is called with it: Now, because that code doesn't do anything clever with s, they're passed straight down to the tensors that make up and . So it's actually equivalent to this: Or, to rewrite the whole loop (omitting the for clarity): So, the first time through the loop, we try to bind our loop variables like this: That is clearly wrong! It's equivalent to this: ...with code to blow up if has more than two elements -- the normal Python "ValueError: too many values to unpack" But if is set to 2, which it happened to be in my case, then it will silently fail -- our first eval loop will get the first X from the validation set as , and the second X as . Zoom through the records in the dataset in batches of 1,000. For each batch: Tokenising each batch, so we get a list of lists of tokens. Convert that list of lists into a single list tokens separating each item. Convert that list into a PyTorch tensor. Add the tensor to a list. After that's all done, use to convert the list into a single tensor, and then save that with . I can upload the datasets to Hugging Face; their network connection will be better than mine, so I can just pay the price in time of uploading everything from home once, and then I can download them faster from HF to LL. That also has the benefit of meaning that after this experiment I can safely delete the local files, but then download them again if I need them. And if anyone else wants to repro this experiment, the data will be easily available to them. Lambda Labs have persistent filesystems that you can use. They cost $0.20/GB/month, so that would be about $5/month for all of my datasets. So I could upload the data to a cheap instance with a persistent filesystem mounted, shut down that instance but keep the filesystem, and then mount it on each machine I use to run tests. . The world size -- that is, how many per-GPU processes are we running? The micro-batch size The sequence length An 8x B200, with 160 GiB per GPU, at $39.92/hour An 8x H100, with 80 GiB per GPU, at $23.92/hour An 8x A100, with 80 GiB per GPU, at $14.32/hour An 8x A100, with 40 GiB per GPU, at $10.32/hour The loss they got on the validation set from the first train. Strictly speaking, I was kind of cheating and using that as a test set. The score given by the OpenAI GPT 5.1 model for an instruction-following dataset. This was the one provided in the book -- an Alpaca-style Q&A dataset, with a well-defined train and test set. Each model was fine-tuned on a training set of 85% of the data until loss on a validation set of 5% of the data started rising, and then tested on the remaining 10%. Sebastian Raschka, being a pro, was splitting up the data properly :-) If we're going to do validation then it does make some sense to do one at the start -- but doing one training iteration first seems kind of arbitrary (though it's clear how that drops out of the existing code). The validation runs on this machine are taking longer than they were on the less-powerful A100 GPUs! That confused me for a bit, until I realised that I didn't notice that it was slower with the batch-size 13 test, only with the larger ones later in in the binary chop. If we're using larger batches, then there's more work to do for the validation. Doing this binary chop by hand is annoying and error-prone, and worse, we have to wait for one of those (long) validation runs before we get into proper training. The initial training iteration can succeed, while later ones hit memory limits -- it seems like we need to wait for three or four training iterations before we can be sure that we have a workable batch size. Not quite sure why that is, perhaps it's something in the optimiser or the scaler? If : Local snapshot path. If : A list of DryRunFileInfo objects containing download information. I updated the function so that it takes flags to tell it whether or not to do validation (default true) and an optional maximum number of steps, which is by default. With those default values, it does exactly the same as before, of course. I created a function, which does all of the dataset-loading stuff that the original function did, and then calls with a -wrapped model. So that maintains the current flow. Next, I added a flag to the script; if that's not set, it just calls . However, if it is set, it instead calls a new function, which determines the largest batch size we can fit onto the current hardware for the current run, and (on the rank 0 process only, to avoid log spam), prints it out. does what it says on the tin; it confirms that we can train with batch size of 1, and that we can't with batch size 70 (chosen because the limit was 64 on that massive B200 machine), then chops between them to find the largest batch size that doesn't OOM. It uses for that -- that just constructs a dataset with the appropriate batch size, then runs a three-step train with no validation to see if it raises an OOM. PyTorch rather messily just raises a generic for those, but we can look inside the exception's message to see if it is an OOM. Create the run file, commit and push. Spin up the machine. On it: Clone the repo We had two nasty loss spikes. As a result of the second of those, the best iteration as per validation loss is not the last one. Best checkpoint: 4 epochs of fine-tuning, and a score of 11.98 -- another record low! Amusingly, it confidently said "The author of 'Pride and Prejudice' is Sarah Palin". Latest checkpoint: 5 epochs of fine-tuning, and a rather good score of 17.91. 24 GiB locally, which was 6 40 GiB in the first train in this series, which was 13 80 GiB in the last one, giving us 27 160 GiB in the one on the huge machine, giving us 64 An 8x A100 40 GiB An 8x A100 80 GiB An 8x H100 80 GiB An 8x B200 160 GiB The loss on my test set. The results it got on an instruction fine-tune test based on Sebastian Raschka's. The global batch size (that is, for single GPU runs, just the batch size, but for the multi-GPU ones, where each batch is made up of per-GPU micro-batches, the per-GPU batch size times the number of GPUs). 4 They're all consistently better than the local ones. The one on the smaller machine is better than the ones on the larger ones; indeed, it looks like the larger the machine, the worse. How long the training run took. How much the machine cost per hour. How much the training run cost. How much of that was doing validation (which I'm now thinking is pointless on single-epoch trains like this). How much it would have cost, and how long it would have taken if it had been run without validation. I wanted to learn how to change a simple single-GPU training loop to make it multi-GPU. Could I get the training time for a full base model down from 48 hours to something more manageable -- and, hopefully, not too expensive? Removing dropout Tweaking the learning rate (and maybe adding the warmup and cosine learning-rate decay stuff I've read about). Reverting the architectural differences between our model and the original GPT-2: reintroducing weight tying between the token embeddings and the final linear layer, and also bias in the attention weights. Trying full-fat 32-bit precision. Fixing the exploding gradients issue with gradient clipping. Dig in to the instruction fine-tuning tests a little more -- as I've said above, I'm not 100% happy with how comparable it really is between models, at least given how I've been running it so far. Upload the models we have to Hugging Face. I have a new motherboard ready for my PC, and replacing the old one has a risk that I might mess up and break the NVMe drive I have them stored on. I was holding off on this because it would mean sharing Raschka's GPT code, but having noticed that he's already licensed it all under the Apache license, I can release them under the same one. Strip out the validation stuff. We can use training loss to track our progress, and losing evals during the train will help keep the cost down. Finally, do the trains to see how each of the levers above affects loss. I erroneously called this a "mini-batch" in earlier versions of this post and in the code -- fixed in this commit . The code in this post reflects the correct terminology, but if you follow the links to the earlier versions you will, of course, see the mistaken name.  ↩ Disregarding the "grokking" phenomenon where continued training after overfitting, in some cases, can apparently make it start generalising again.  ↩ Of course, people always say that when they add on unnecessary levels of abstraction...  ↩ The GPT-2 paper is annoyingly short on concrete numbers, but they do at least explicitly state that they used a batch size of 512.  ↩ To be strictly honest here, I've already dug into it, but adding a writeup of that to this already absurdly long blog post felt like something adjacent to sadism. Update shortly.  ↩

0 views
A Room of My Own 1 weeks ago

My Digital Workflow (Jan 2026 Edition)

My digital workflow has evolved quite a bit—really, it’s simplified a lot. I use far fewer apps now. My last post on this was in October 2024 (edited in Mar 2025), before I adopted Bear .  Since then, Bear has launched a web app beta , which means I can access my notes anywhere—especially at work, where we’re PC all the way. https://spasic.me/posts/a-digital-workflow-to-run-my-life I also just posted My App Defaults (as of Jan 2026) . (I struggle with this, so I had to write it down for myself and I’m genuinely getting better at following these rules.) Limit the amount of information I take in and process. Just because I can capture everything doesn’t mean I should Don’t rush to save every interesting idea; if it’s truly important, it’ll come back to me. Be selective about what I consume, especially online. Avoid organizing and exploring new tools. Focus on capturing my own thoughts and ideas and summarizing concepts in my own words. Don’t save everything—let things go. Write, write, write (don’t just consume - create) My “best practice” (but fluid) workflows for processing ideas, information, interests, documentation. Main Documents storage and backup Photographs, videos and their backup All current documents  (all documentation and scans, ebooks, writing, anything that would go into a computer hard drive) backup of old, unused documents and mementoes (notes apps backups, old word doc backups, old work doc backups, old email backups, various mementos) Photographs from my phone upload directly into Dropbox (although for permanent storage I upload manually et the end of the month and delete this automoatic backup when I don’t need it anymore) - I have a separate post on how I manage my Memory Keeping and Photographs I use Dropbox’s  “selective sync”  on my laptops (I only sync folders that I currently use) Cost: 120 USD a year What used to live across three or four different apps now lives almost entirely in Bear . Bear is my: Central hub for personal projects and current activities , where I store: Tasks and goals Quarterly and monthly plans Narratives and ongoing notes Central hub for admin and resources 
 (attachments mostly live in Dropbox and are linked back to Bear) : Personal information (some password-protected) Frequently accessed info (school/work details, admin notes, medical info, various records - anything I need to look up occasionally) Resources such as links, apps, wishlists, recipes, travel info, etc. Commonplace book and thinking space , where I: Make notes on topics I care about Store notes on topics of interest (old and new) Collect ideas, concepts, and connections in a non-linear way Think freely and explore without structure getting in the way Store my writing (essays, blog posts, stories) Jot things down at random and on the go Dump ideas and brainstorm Cost: 50 NZD a year A single source of truth for all my journaling and mementos. Digital Journal/Diary: My personal journal and diary (with photos). I use it daily most of the time. Mementos and Memory Keeping: This includes text notes, screenshots, random photos, voice messages, things my kids said, audio, video, photos, messages, locations, screenshots, various stats, anything I want to preseve. Logs of books , movies and TV and quotes/wisdom, etc. Some of this is still a work in progress, I am exploring it and it is constantly evolving. I back up Day One periodically in both JSON and PDF formats and store the backups in my Archive folder in Dropbox as well as on external hard drives. Note: I’m considering moving some of my personal journal entries into Bear for reference and long-term safekeeping, but I haven’t decided yet. Cost: 50 NZD per year I’ve written more about my current Trello setup. I organize my lists using the Eisenhower Matrix , along with a backlog for things I want to clear from my mind but may never actually do. All my specific To Dos and projects/tasks Anything that has a date but doesn’t belong in Google Calendar 
If it doesn’t need to happen on a specific day, it’s a task rather than an event, so it lives in Trello. Recurring tasks and reminders 
Things like document expiry dates, subscriptions, and periodic check-ins. Small personal projects 
Ideas I’d like to get to at some point, but that don’t need active scheduling yet. ⠀Trello is also great for on the go capture: I can email tasks directly into Trello On my phone for quick to-dos and relevant info (goes straight into the inbox widget) The email reminders are a bonus Dabble Writer Long-form writing. Novel in progress Memoir (snippets and fragments) After months of research I have settled on Dabble Writer to replace Scrivener for my long-form writing. While I loved Scrivener, I needed something that syncs seamlessly across multiple computers (and on the go) without requiring downloads or worrying about syncing my work. I hope to write about Dabble Writer in another post. COST: One-time purchase (subscription options available too) Kindle and article highlights Article dump for things I might read later (or delete if I don’t). Archive articles only if they’re genuinely worth keeping. Sends articles directly to my Kindle, which is where I prefer to read them. Cost: $40/year I know I could use Readwise Reader for RSS, but it doesn’t feel as casual or as easy to process as Feeder. I subscribe to a lot of personal blogs, and while I don’t read everything all the time, Feeder lets me quickly scan and dip into whatever catches my eye. It feels nice and low-pressure.
Take it or leave it. And it’s free. all appointments and events important dates and birthdays reoccurring events (like my yoga classes, kids’ sports, group meetings I regularly attend, etc.) syncing with my husband’s and son’s calendars ⠀ COST: Free My main personal email account since March 2000. I also use it as a kind of archive—emails are such an overlooked record of life and work. NOTE: I do use Gmail for Chrome, YouTube, and similar things, but I genuinely prefer Yahoo to Gmail, even though Google Calendar is my main calendar (am I the only one?). I use Inbox Zero across all my email accounts. How I Finally Settled on Bear for My Notes My One-Board Trello Task Management System How I Use Day One to Track What I Read I Journaled My TV and Movie Watching for a Year Why Did I Wait So Long to Start Using Day One? A Digital Workflow to Run My Life The Eisenhower Matrix I Forgot About (But Still Followed) My App Defaults (Jan 2026 Edition) My App Defaults (Mar 2025 Edition) A Digital Workflow to Run My Life (Mar 2025 Edition) Limit the amount of information I take in and process. Just because I can capture everything doesn’t mean I should Don’t rush to save every interesting idea; if it’s truly important, it’ll come back to me. Be selective about what I consume, especially online. Avoid organizing and exploring new tools. Focus on capturing my own thoughts and ideas and summarizing concepts in my own words. Don’t save everything—let things go. Write, write, write (don’t just consume - create) Main Documents storage and backup Photographs, videos and their backup All current documents  (all documentation and scans, ebooks, writing, anything that would go into a computer hard drive) backup of old, unused documents and mementoes (notes apps backups, old word doc backups, old work doc backups, old email backups, various mementos) Photographs from my phone upload directly into Dropbox (although for permanent storage I upload manually et the end of the month and delete this automoatic backup when I don’t need it anymore) - I have a separate post on how I manage my Memory Keeping and Photographs I use Dropbox’s  “selective sync”  on my laptops (I only sync folders that I currently use) Central hub for personal projects and current activities , where I store: Tasks and goals Quarterly and monthly plans Narratives and ongoing notes Central hub for admin and resources 
 (attachments mostly live in Dropbox and are linked back to Bear) : Personal information (some password-protected) Frequently accessed info (school/work details, admin notes, medical info, various records - anything I need to look up occasionally) Resources such as links, apps, wishlists, recipes, travel info, etc. Commonplace book and thinking space , where I: Make notes on topics I care about Store notes on topics of interest (old and new) Collect ideas, concepts, and connections in a non-linear way Think freely and explore without structure getting in the way Store my writing (essays, blog posts, stories) Jot things down at random and on the go Dump ideas and brainstorm Digital Journal/Diary: My personal journal and diary (with photos). I use it daily most of the time. Mementos and Memory Keeping: This includes text notes, screenshots, random photos, voice messages, things my kids said, audio, video, photos, messages, locations, screenshots, various stats, anything I want to preseve. Logs of books , movies and TV and quotes/wisdom, etc. All my specific To Dos and projects/tasks Anything that has a date but doesn’t belong in Google Calendar 
If it doesn’t need to happen on a specific day, it’s a task rather than an event, so it lives in Trello. Recurring tasks and reminders 
Things like document expiry dates, subscriptions, and periodic check-ins. Small personal projects 
Ideas I’d like to get to at some point, but that don’t need active scheduling yet. I can email tasks directly into Trello On my phone for quick to-dos and relevant info (goes straight into the inbox widget) The email reminders are a bonus Novel in progress Memoir (snippets and fragments) Article dump for things I might read later (or delete if I don’t). Archive articles only if they’re genuinely worth keeping. Sends articles directly to my Kindle, which is where I prefer to read them. all appointments and events important dates and birthdays reoccurring events (like my yoga classes, kids’ sports, group meetings I regularly attend, etc.) syncing with my husband’s and son’s calendars ⠀ COST: Free My main personal email account since March 2000. I also use it as a kind of archive—emails are such an overlooked record of life and work.

1 views
The Tymscar Blog 2 weeks ago

Automating What Backblaze Lifecycle Rules Don't Do Instantly

I recently moved from Synology to TrueNAS and set up cloud backups to Backblaze B2. I have two buckets: one for important files like documents, and one for homelab services. The services bucket backs up things like qcow2 disk images for my VMs, some of which are hundreds of gigabytes. When I created the buckets, I set the lifecycle rule to “Keep only the last version of the file.” I assumed this meant Backblaze would automatically replace old versions when new ones arrived. It doesn’t work that way.

0 views
ava's blog 3 weeks ago

trusting the little guys: issues with 'big tech' alternatives

A while ago, my brother-in-law asked around friends and family if anyone wanted to join the (private) cloud/file service he spun up. Practical, right? Many outside the corporate web believe in smaller services within friend groups, families, and local organizations as the way forward. Instead of trusting big companies who could (or rather, will) enshittify and become too big and bloated (Google, Meta, Microsoft...), we should trust smaller maintainers within our circles. The offer made me ponder what I would upload to the file service, and how much I would trust my brother-in-law with the files. Not just the integrity, but the uptime, the availability when issues arise, how swiftly severe bugs or security issues would be patched, and the uncomfortable question about confidentiality: Should I only upload files I don't mind him to see, or should I trust him that he wouldn't look at them? 1 That made me think: How much do we trust alternatives to big tech? When we host our various things like emails, image backups, blogs, social media accounts etc. with these big companies, a certain professionalism is expected. You're dealing with a corporate entity, so you probably have the following expectations: All of these (whether they are actually realistic and enforceable or not) can give us a sense of security. A cold, sterile business relationship, like the one to our water provider. If we want to switch away from these data-harvesting giants to smaller solutions, we are confronted with the fact that usually, it's a small group of people, or even just one person. Some try to build up a smaller service professionally, but many just do it on the side, as a hobby. A Mastodon or PixelFed instance, another social media alternative, or media sharing. That poses some challenges and questions for the average user: These concerns make smaller services feel less reliable and trustworthy. A big corporation can (and will) obviously mess up as well and the data breaches and downtimes are a lot more impactful, but: The roles are clear, legal identities are divulged publicly if needed (like their data protection officer!), and someone is responsible for an issue. With a small group of strangers or even just one person online that you don't know, this is more opaque and there are not necessarily any consequences, quality control, workflows or customer service. There is often not even a real name offered that you can use for any sort of complaint or legal action. I think I might have talked about this in another blog post or alluded to it, but there is a creator of a variety of indie web services that just refuses to delete my accounts since at least 2023. It started with just one I wanted gone, but nowadays I want all of them gone. After multiple fruitless attempts at asking for deletion via email and having no full account deletion in the settings page, I filed an official complaint at the Data Protection Authority responsible for my area. Unfortunately, they were almost entirely useless, because as long as I do not have the full legal name of the person behind all those services, they say they cannot do anything. These fossils do not want to send out an email reprimanding them for being non-compliant despite processing the EU citizen's data and even taking money for it, they insist on sending an actual letter to the person's residence and don't want to put effort into getting that address from the hoster. Their feedback ended with the great advice that next time, I shouldn't sign up to websites that don't have a privacy policy, proper account deletion process, or a responsible person named. Well, geez, wish I could time travel and tell 2021 me that, who had rose-tinted glasses about indie web alternatives. Nowadays, I indeed don't sign up, and I make sure to remind every project I see that necessitates user accounts to please fulfill at least the PP and the deletion process. I know I cannot make any of them share their full name if they don't want to. Being better than the big players doesn't just involve not doing the excessive data harvesting they do, but also handling the little bit of data you get with care, and having processes in place that make dealing with user data easier and gives a lot of control to the user, and ideally, let them know who they're dealing with. And that's where it really differs from case to case, because at Bearblog, I am really happy with how things are and have turned out so far, despite it only being one person. It is professional, I get amazing customer support, I know the legal identity, and I can find out exactly how data is collected and processed. Plus: There is an account deletion I can initiate on my own without having to message someone and hope for the best. For comparison, it took Cohost (that was ran by a small group of people) about 4 months or so to delete my account that I had to request via email, and it took someone I know over a year. That means constantly checking back in whether the deletion has gone through and the profile is still up, and that is not only annoying, but it can also threaten the safety of people who get found by stalkers, family members and others. Some of these things are time-sensitive, and it's irresponsible and non-compliant to not have a better system in place. Strangers are simply a hit or miss. Could be a creep that reads all your DMs to other people on the instance, or not. What about a friend? If your friendship breaks apart, do you lose the service and the data accumulated on there? If it's a family member and something really bad happens with your data and account, do you want to risk the family peace by holding them accountable? Honestly, no one wants to set up a formal contract for something like this as it feels silly, and many won't. So what basis do you have? If you are lucky, the indie project you want to use has open‑source code, transparent incident logs, and community reviews and PRs that serve as proxies for professionalism and quality control, but in my view, that is rather uncommon. I don't want to badmouth smaller alternatives, as I am still a big fan of them and rely on them. I just want to discuss these fears and risks, and some of my good and bad experiences. I want them to thrive and do better in these topics. Trust sadly isn't purely rational, and familiarity, perceived competence, contracts, incentives, and consequences play important roles. Reply via email Published 22 Dec, 2025 For the record, I trust him not to look at them, but it's still a thought I had, since I never had to decide that before. ↩ I'm a consumer, and I have consumer rights against this corporation. I don't feel bad about potentially suing them, because I'm suing the company, not one individual. While messaging their support, they (or nowadays, their AI chatbot?) keep it professional and are available in a reasonable time. I can lodge complaints and expect a fix fairly fast, and downtime is usually resolved within an hour or few. No one person or one department has access to absolutely everything, and especially not unchecked. Lots of eyes, control mechanisms, logs, and separation, limited rights and access on a need-to-know basis. People working there get paid for this, which affects how they treat the service or what they cannot risk doing. There are internal consequences for non-compliance, and there are internal workflows on how to deal with specific cases the same every time; even just deletion requests or requests for personal information. There are far too many employees, and far too many users; why should I, of all people, be interesting enough to have my privacy violated by an employee? I have a right to request my data, and a right to data portability. Due to financial interest in keeping the company going, they're future-proofing. Do I still have consumer rights, even just rights like the GDPR, or not? Would I be comfortable pursuing this person legally if shit went sideways or they abuse or leak my data? Can I even, if this is just an internet stranger with a nickname and an email address? Can I expect a professional relationship about this service? If this is just done on the side as a hobby or experiment, will the person actually continue it after the first few weeks? What do I do if I lose this? Will they have the time and energy to continue to update it and care for it, and keep my data safe? If I need a quick fix or tech help, would they be able to respond in a timely manner? Depending on what it is, it might be urgent. Can I trust this person not to abuse their admin power and look into everything? Even if it's SFW, maybe I wouldn't want a stranger to click through my image files (... and use them for AI training or to make deepfake nudes I don't know about?). Is data portability a thing at all with their service? Can I export the data in any meaningful and useful way? Does the maintainer do any sort of future-proofing? For the record, I trust him not to look at them, but it's still a thought I had, since I never had to decide that before. ↩

3 views
Jeremy Daly 1 months ago

We’re Already Living in a Post-Serverless World

This isn’t really about serverless. It’s about what happens when infrastructure stops asking humans to guess the future.

0 views
Nelson Figueroa 1 months ago

GitHub Actions for Pulumi with an AWS S3 Backend

This is a quick guide to set up GitHub Actions for Pulumi with an AWS S3 Backend. There are some differences compared to running commands locally. This guide assumes you have the following: First, set up secrets on your GitHub repository. These will be filled in by GitHub Actions once we create a workflow YAML. You’ll need to create 3 secrets. You can name them whatever you want, but I’ll be naming them: You can create these by browsing to your GitHub repository > Settings > Secrets and variables > Actions. There are “Environment secrets” and “Repository secrets”. In this case go with “Repository secrets”. Create the three secrets and fill in their respective values. can be whatever you want and doesn’t come from AWS. Make sure you don’t change this after the fact though, or your Pulumi state may break . The end result should look like this: Next, clone the repository locally with . We’ll need to create a few files for a minimum viable Pulumi program. Run and it’ll guide you through the creation of a basic Pulumi program. The language you choose doesn’t matter. Then we can create the YAML file to set up GitHub Actions. Create a YAML file under . I’ll call it in my example. Fill it in with the following YAML, changing values as needed. This runs a so it’ll veryify that everything is set up correctly without actually deplpoying anything. Now push your code to GitHub and see if the GitHub Action workflow ran successfully. You should see output from a successful . If this works, you should be good to go. You can update your Pulumi code and, change to in the GitHub Actions YAML file and run it again to actually deploy some infrastructure. An AWS S3 Bucket created and ready to be used with Pulumi An IAM User that has permissions to read/write to the S3 bucket The Access Key and Secret Access Key for the IAM User to use for authenticating to AWS within GitHub Actions A passphrase of your choosing that will be used to encrypt secrets in the pulumi stack A GitHub repository https://github.com/pulumi/actions

0 views
Blog System/5 1 months ago

From Azure Functions to FreeBSD

On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. Most daemons implement all of the above as features in their own code, but I did not want to have to retrofit all of these into my existing HTTP service in a rush. Fortunately, FreeBSD provides this little tool, daemon(8) , which wraps an existing binary and offers all of the above. This incantation was enough to get me going: I won’t dive into the details of each flag, but to note: specifies which PID file to create; specifies where to store the stdout and stderr of the process; is required for log rotation (much more below); drops privileges to the given user; and specifies the “title” of the process to display in output. The trick was sufficient to inject configuration variables upon process startup, simulating the same environment that my server used to see when spawned by the Azure Functions runtime. Hooking that up into an service script was then trivial: And with that: Ta-da! I had the service running locally and listening to a local port determined in the configuration file. As part of the migration out of Azure Functions, I switched to self-hosting PostgreSQL as well. This was straightforward but required a couple of extra improvements to my web framework: one to stop using a remote PostgreSQL instance for tests (something I should have done eons ago), and another to support local peer authentication to avoid unnecessary passwords. In the call to above, I briefly mentioned the need for the flag to support log rotation. What’s that about? You see, in Unix-like systems, when a process opens a file, the process holds a handle to the open file. If you delete or rename the file, the handle continues to exist exactly as it was . This has two consequences: If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! For a long-running daemon that spits out verbose logs, writing them to a file can become problematic because you can end up running out of disk space. To solve this problem, daemons typically implement log rotation : a mechanism to keep log sizes in check by moving them aside when a certain period of time passes or when they cross a size threshold, and then only keeping the last N files around. Peeking into one of the many examples in my server, note how is the “live” log where writes go to but there is a daily archive for up to a week: Having all daemons implement log rotation logic on their own would be suboptimal because you’d have duplicate logic throughout the system and you would not be able to configure policy easily for them all. This is where newsyslog(8) on FreeBSD (or on Linux) comes into play. is a tool that rotates log files based on criteria such as size or time and optionally compresses them. But remember: the semantics of open file handles mean that simply renaming log files is insufficient! Once takes action and moves a log file aside, it must ensure that the process that was writing to that file closes the file handle and reopens it so that writes start going to the new place. This is typically done via sending a to the daemon, and is why we need to pass to the call. To illustrate the sequence: The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Configuring is easy, but cryptic. We can create a service-specific configuration file under that provides entries for our service, such as: I’ll leave you to the manpage to figure out what the magic is (but in short, it controls retention count, rotation schedule, and compression). As I briefly mentioned earlier, the Azure Functions runtime was responsible for TLS termination in my previous setup. Without such a runtime in place, I had to configure TLS on my own in my HTTP server… or did I? I had been meaning to play with Cloudflare Tunnels for a while given that I already use Cloudflare for DNS. Zero Trust Tunnels allow you to expose a service without opening inbound ports in your firewall. The way this works is by installing the tunnel daemon on your machine and configuring the tunnel to redirect certain URL routes to an internal address (typically ). Cloudflare then acts as the frontend for the requests, handles TLS termination and DDOS protection, and then redirects the request to your local service. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. And here is what I’ve lost (for now): Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of . None of the above losses are inherent to self-hosting, of course. I could provide alternatives for them all and at some point I will; consider them to-dos! On Thanksgiving morning, I woke up to one of my web services being unavailable. All HTTP requests failed with a “503 Service unavailable” error. I logged into the console, saw a simplistic “Runtime version: Error” message, and was not able to diagnose the problem. I did not spend a lot of time trying to figure the issue out and I didn’t even want to contact the support black hole. Because… there was something else hidden behind an innocent little yellow warning at the top of the dashboard: Migrate your app to Flex Consumption as Linux Consumption will reach EOL on September 30 2028 and will no longer be supported. I had known for a few weeks now, while trying to set up a new app, that all of my Azure Functions apps were on death row. The free plan I was using was going to be decommissioned and the alternatives I tried didn’t seem to support custom handlers written in Rust. I still had three years to deal with this, but hitting a showstopper error pushed me to take action. All of my web services are now hosted by the FreeBSD server in my garage with just a few tweaks to their codebase. This is their migration story. Blog System/5 and the open source projects described below are all made in my limited free time. Subscribe now to show your support; it goes a long way! How did I get here? Back in 2021, I had been developing my EndBASIC language for over a year and I wanted to create a file sharing service for it. Part of this was to satisfy my users, but another part was to force myself into the web services world as I felt “behind”. At that time, I had also been at Microsoft for a few months already working on Azure Storage. One of the perks of the job was something like $300 of yearly credit to deploy stuff on Azure for learning purposes. It was only “natural” that I’d pick Azure for what I wanted to do with EndBASIC. Now… $300 can be plentiful for a simple app, but it can also be paltry. Running a dedicated VM would eat through this in a couple of months, but the serverless model offered by Azure Functions with its “infinite” free tier would go a long way. I looked at their online documentation, found a very good guide on how to deploy Rust-native functions onto a Linux runtime , and… I was sold. I quickly got a bare bones service up and running on Azure Functions and I built it up from there. Based on these foundations, I later developed a separate service for my own site analytics (poorly named EndTRACKER ), and I recently started working on a new service to provide secure auto-unlock of encrypted ZFS volumes (stay tuned!). And, for the most part, the experience with Azure has been neat. I learned a bunch and I got to a point where I had set up “push on green” via GitHub Actions and dual staging vs. prod deployments. The apps ran completely on their own for the last three years, a testament to the stability of the platform and to the value of designing for testability . Until now that is. The cloud database Compute-wise, I was set: Azure Functions worked fine as the runtime for my apps’ logic and it cost pennies to run, so the $300 was almost untouched. But web services aren’t made of compute alone: they need to store data, which means they need a database. My initial research in 2021 concluded that the only option for a database instance with a free plan was to go with, no surprise, serverless Microsoft SQL Server (MSSQL). I had never used Microsoft’s offering but it couldn’t be that different from PostgreSQL or MySQL, could it? Maybe so, but I didn’t get very far in that line of research. The very first blocker I hit was that the MSSQL connection required TLS and this hadn’t been implemented in the connector I chose to use for my Rust-based functions. I wasted two weeks implementing TLS support in (see PR #1200 and PR #1203 ) and got it to work, but that code was not accepted upstream because it conflicted with their business strategy. Needless to say, this was disappointing because getting that to work was a frigging nightmare. In any case, once I passed that point, I started discovering more missing features and bugs in the MSSQL connector, and then I also found some really weird surprises in MSSQL’s dialect of SQL. TL;DR, this turned into a dead end. On the left, the default instance and cost selected by Azure when choosing to create a managed PostgreSQL server today. On the right, minimum possible cost after dialing down CPU, RAM, disk, and availability requirements. I had no choice other than to provision a full PostgreSQL server on Azure. Their onboarding wizard tried to push me towards a pretty beefy and redundant instance that would cost over $600 per month when all I needed was the lowest machine you could get for the amount of traffic I expected. Those options were hidden under a “for development only” panel and riddled with warnings about no redundancy, but after I dialed all the settings down and accepted the “serious risks”, I was left with an instance that’d cost $15 per month or so. This fit well well within the free yearly credit I had access to, so that was it. The outage and trigger About two months ago, I started working on a new service to securely auto-unlock ZFS encrypted volumes (more details coming). For this, I had to create a new Azure Functions deployment… and I started seeing the writing on the wall. I don’t remember the exact details, but it was really difficult to get the creation wizard to provision me the same flex plan I had used for my other services, and it was warning me that the selected plan was going to be axed in 2028. At the time of this writing, 2028 is still three years out and this warning was for a new service I was creating. I didn’t want to consider migrating neither EndBASIC nor EndTRACKER to something else just yet. Until Thanksgiving, that was. On Thanksgiving morning, I noticed that my web analytics had stopped working. All HTTP API requests failed with a “503 Service unavailable.” error but, interestingly, the cron-triggered APIs were still running in the background just fine and the staging deployment slot of the same app worked fine end-to-end as well. I tried redeploying the app with a fresh binary, thinking that a refresh would fix the problem, but that was of no use. I also poked through the dashboard trying to figure out what “Runtime version: Error” would be about, making sure the version spec in was up-to-date, and couldn’t figure it out either. Summary state of my problematic Azure Functions deployment. Note the cryptic runtime error along with the subtle warning at the top about upcoming deprecations. So… I had to get out of Azure Functions, quick. Not accidentally, I had bought a second-hand, over-provisioned ThinkStation (2x36-core Xeon E5-2697, 64 GB of RAM, a 2 TB NVMe drive, and a 4x4 TB HDD array) just two years back. The justification I gave myself was to use it as my development server, but I had this idea in the back of my mind to use it to host my own services at some point. The time to put it to serving real-world traffic with FreeBSD 14.x had come. From serverless to serverful The way you run a serverless Rust (or Go) service on Azure Functions is by creating a binary that exposes an HTTP server on the port provided to it by the environment variable. Then, you package the binary along a set of metadata JSON files that tell the runtime what HTTP routes the binary serves and push the packaged ZIP file to Azure. From there on, the Azure Functions runtime handles TLS termination for those routes, spawns your binary server on a micro VM on demand, and redirects the requests to it. By removing the Azure Functions runtime from the picture, I had to make my server binary stand alone. This was actually pretty simple because the binary was already an HTTP server: it just had to be coerced into playing nicely with FreeBSD’s approach to running services. In particular, I had to: Inject configuration variables into the server process at startup time. These used to come from the Azure Functions configuration page, and are necessary to tell the server where the database lives and what credentials to use. Make the service run as an unprivileged user, easily. Create a PID file to track the execution of the process so that the framework could handle restarts and stop requests. Store the logs that the service emits via stderr to a log file, and rotate the log to prevent local disk overruns. If you rename the file, all subsequent reads and writes go to the new file location, not the old one. If you delete the file, all subsequent reads and writes continue to go to disk but to a file you cannot reference anymore. You can run out of disk space and, while will confirm the fact, will not let you find what file is actually consuming it! The system starts a service via and redirects logs to . runs and determines that needs to be rotated because a day has passed. renames to and creates a new and empty . At this point is still writing to ! sends a to the process. The process closes its file handle for the log, reopens (which is the fresh new log file), and resumes writing. compresses the file for archival now that it’s quiesced. Interactions between client machines, Cloudflare servers, the cloudflared tunnel agent, and the actual HTTP servers I wrote. The obvious downside of relying on someone else to do TLS termination instead of doing it yourself on your own server is that they can intercept and modify your traffic. For the kinds of services I run this isn’t a big deal for me, and the simplicity of others dealing with certificates is well welcome. Note that I was already offloading TLS termination to Azure Functions anyway, so this isn’t a downgrade in security or privacy. CORS But using Cloudflare as the frontend came with a little annoyance: CORS handling. You see: the services I run require configuring extra allowed origins, and as soon as I tried to connect to them via the Cloudflare tunnel, I’d get the dreaded “405 Method not allowed” error in the requests. Before, I used to configure CORS orgins from the Azure Functions console, but no amount of peeking through the Cloudflare console showed me how to do this for my tunneled routes. At some point during the investigation, I assumed that I had to configure CORS on my own server. I’m not sure how I reached that bogus conclusion, but I ended up wasting a few hours implementing a configuration system for CORS in my web framework . Nice addition… but ultimately useless. I had not accounted for the fact that because Cloudflare acts as the frontend for the services, it is the one responsible for handling the pre-flight HTTP requests necessary for CORS. In turn, this means that Cloudflare is where CORS needs to be configured but there is nothing “obvious” about configuring CORS in the Cloudflare portal. AI to the rescue! As skeptical as I am of these tools, it’s true that they work well to get answers to common problems—and figuring out how to deal with CORS in Cloudflare was no exception. They told me to configure a transformation rule that explicitly sets CORS response headers for specific subdomains, and that did the trick: Sample rule configuration on the Cloudflare portal to rewrite CORS response headers. Even though AI was correct in this case, the whole thing looked fishy to me, so I did spend time reading about the inner workings of CORS to make sure I understood what this proposed solution was about and to gain my own confidence that it was correct. Results of the transition By now, my web services are now fully running on my FreeBSD machine. The above may have seemed complicated, but in reality it was all just a few hours of work on Thanksgiving morning. Let’s conclude by analyzing the results of the transition. On the plus side, here is what I’ve gained: Predictability: Running in the cloud puts you at the mercy of the upgrade and product discontinuation treadmill of big cloud providers. It’s no fun to have to be paying attention to deprecation messages and adjust to changes no matter how long the deadlines are. FreeBSD also evolves, of course, but it has remained pretty much the same over the last 30 years and I have no reason to believe it’ll significantly change in the years to come. Performance: My apps are so much faster now it’s ridiculous. The serverless runtime of Azure Functions starts quickly for sure, but it just can’t beat a server that’s continuously running and that has hot caches at all layers. That said, I bet the real difference in performance for my use case comes from collocating the app servers with the database, duh. Ease of management: In the past, having automated deployments via GitHub Actions to Azure Functions was pretty cool, not gonna lie. But… being now able to deploy with a trivial , perform administration PostgreSQL tasks with just a , and inspecting logs trivially and quickly by looking at beats any sort of online UI and distributed system. “Doesn’t scale” you say, but it scales up my time . Cost: My Azure bill has gone from $20/month, the majority of which was going into the managed PostgreSQL instance, to almost zero. Yes, the server I’m running in the garage is probably costing me the same or more in electricity, but I was running it anyway already for other reasons. Availability (and redundancy): The cloud gives you the chance of very high availability by providing access to multiple regions. Leveraging these extra availability features is not cheap and often requires extra work, and I wasn’t taking advantage of them in my previous setup. So, I haven’t really decreased redundancy, but it’s funny that the day right after I finished the migration, I lost power for about 2 hours. Hah, I think I hadn’t suffered any outages with Azure other than the one described in this article. A staging deployment: In my previous setup, I had dual prod and staging deployments (via Azure Functions slots and separate PostgreSQL databases—not servers) and it was cool to deploy first to staging, perform some manual validations, and then promote the deployment to prod. In practice, this was rather annoying because the deployment flow was very slow and not fully automated (see “manual testing”), but it indeed saved me from breaking prod a few times. Auto-deployments: Lastly and also in my previous setup, I had automated the push to staging and prod by simply updating tags in the GitHub repository. Once again, this was convenient, but the biggest benefit of it all was that the prod build process was “containerized” and not subject to environmental interference. I’d very well set up a cron job or webhook-triggered local service that rebuilt and deployed my services on push… but it’s now hard to beat the simplicity of .

0 views
Stratechery 1 months ago

AWS re:Invent, Agents for AWS, Nova Forge

AWS re:Invent sought to present AI solutions in the spirit of AWS' original impact on startups; the real targets may be the startups from that era, not the current one.

0 views
Stratechery 1 months ago

OpenAI Code Red, AWS and Google Cloud Networking

OpenAI is declaring code red and doubling down on ChatGPT, highlighting the company's bear case. Then, AWS makes it easier to run AI workloads on other clouds.

0 views
Kev Quirk 1 months ago

Local vs Cloud

I was listening to the Waveform podcast on my way to work this morning and they were talking about cloud vs local computing, and I have thoughts... I was listening to the Waveform podcast on my commute this morning when they started talking about cloud vs local computing. The discussion quickly drifted into hypotheticals about unlimited storage and choosing one world or the other. But the whole debate felt off to me, because it rests on a bad assumption: that “cloud” and “local” are two totally separate things. Before we can argue about cloud vs local, we need to be clear about what we’re comparing. People talk about the cloud like it’s some mystical ether, but as the saying goes, it’s really just someone else’s computer. If it’s a machine you don’t own, sitting in a datacentre somewhere, it’s cloud. By contrast, local doesn’t just mean “ the laptop you’re holding ”. It includes anything you own and control: your PC, a server in a cupboard, or a NAS on your home network. Once you see it this way, the Waveform question becomes more interesting. Because I think local can include your own private cloud. At home I use a Synology NAS as the centre of my own little ecosystem. It runs all the services I rely on daily, but with the convenience you’d usually expect from big cloud providers. A few examples: Everything lives on hardware I control, but it’s still available wherever I am. Backups are handled locally (to a USB drive connected to the Synology) and off-site (to Backblaze B2, encrypted before upload). The result is a system that behaves like a cloud service, but where I hold the keys. Here’s my extremely high-quality architectural diagram: No, I never studied art. I’m not going to get into specifics for obvious reasons, but the short version is that my Synology isn’t exposed to the Internet at all. My router only accepts traffic from specific networks, so I connect over VPN. It’s always-on for me and my wife, so the experience is completely transparent. If you’re thinking of building something like this, I’d strongly recommend not exposing any part of your home network directly to the Internet. Back to the original question. The unlimited storage bit doesn’t matter; you only need enough storage, not infinite. Given the choice between 100% cloud or 100% local, I’d choose local every time. Not because I want to avoid cloud-like features, but because local gives me the same benefits without giving away control. My photos sync automatically, I can share links to files, edit documents anywhere, and my data is backed up properly . The truth is that the whole premise of cloud vs local is a false choice. You don’t have to pick one at the expense of the other. You can have the convenience of the cloud running entirely on hardware you own. The real choice isn’t cloud or local, it’s whose cloud you want to rely on. What do you think? Do you lean toward cloud, local, or something in between? Feel free to leave a comment or drop me an email, I’d love to hear how you approach it. Thanks for reading this post via RSS. RSS is great, and you're great for using it. ❤️ Reply to this post by email Plex for media Synology Photos for backing up images from my phone Calendar and Contacts, all synced via DAV Synology Drive for documents My journal app A notes app I use for Fediverse posts, so I have local copies of everything I post

0 views
Taranis 1 months ago

Datacenters in space are a terrible, horrible, no good idea.

In the interests of clarity, I am a former NASA engineer/scientist with a PhD in space electronics. I also worked at Google for 10 years, in various parts of the company including YouTube and the bit of Cloud responsible for deploying AI capacity, so I'm quite well placed to have an opinion here. The short version: this is an absolutely terrible idea, and really makes zero sense whatsoever. There are multiple reasons for this, but they all amount to saying that the kind of electronics needed to make a datacenter work, particularly a datacenter deploying AI capacity in the form of GPUs and TPUs, is exactly the opposite of what works in space. If you've not worked specifically in this area before, I'll caution against making gut assumptions, because the reality of making space hardware actually function in space is not necessarily intuitively obvious. The first reason for doing this that seems to come up is abundant access to power in space. This really isn't the case. You basically have two options: solar and nuclear. Solar means deploying a solar array with photovoltaic cells – something essentially equivalent to what I have on the roof of my house here in Ireland, just in space. It works, but it isn't somehow magically better than installing solar panels on the ground – you don't lose that much power through the atmosphere, so intuition about the area needed transfers pretty well. The biggest solar array ever deployed in space is that of the International Space Station (ISS), which at peak can deliver a bit over 200kW of power. It is important to mention that it took several Shuttle flights and a lot of work to deploy this system – it measures about 2500 square metres, over half the size of an American football field. Taking the NVIDIA H200 as a reference, the per-GPU-device power requirements are on the order of 0.7kW per chip. These won't work on their own, and power conversion isn't 100% efficient, so in practice 1kW per GPU might be a better baseline. A huge, ISS-sized, array could therefore power roughly 200 GPUs. This sounds like a lot, but lets keep some perspective: OpenAI's upcoming Norway datacenter is intending to house 100,000 GPUs, probably each more power hungry than the H200. To equal this capacity, you'd need to launch 500 ISS-sized satellites. In contrast, a single server rack (as sold by NVIDIA preconfigured) will house 72 GPUs, so each monster satellite is only equivalent to roughly three racks. Nuclear won't help. We are not talking nuclear reactors here – we are talking about radioisotope thermal generators (RTGs) , which typically have a power output of about 50W - 150W. So not enough to even run a single GPU, even if you can persuade someone to give you a subcritical lump of plutonium and not mind you having hundreds of chances to scatter it across a wide area when your launch vehicle explosively self-disassembles. Thermal Regulation I've seen quite a few comments about this concept where people are saying things like, "Well, space is cold, so that will make cooling really easy, right?" Really, really no. Cooling on Earth is relatively straightforward. Air convection works pretty well – blow air across a surface, particularly one designed to have a large surface area to volume ratio like a heatsink, will transfer heat from the heatsink to the air quite effectively. If you need more power density than can be directly cooled in this way (and higher power GPUs are definitely in that category), you can use liquid cooling to transfer heat from the chip to a larger radiator/heatsink elsewhere. In datacenters on Earth, it is common to set up cooling loops where machines are cooled via chilled coolant (usually water) that is pumped around racks, with the heat extracted and cold coolant returned to the loop. Typically the coolant is cooled via convective cooling to the air, so one way or another this is how things work on Earth. In space, there is no air. The environment is close enough to a hard, total vacuum as makes no practical difference, so convection just doesn't happen. On the space engineering side, we typically think about thermal management , not just cooling. Thing is, space doesn't really have a temperature as-such. Only materials have a temperature. It may come as a surprise, but in the Earth-Moon system the average temperature of pretty much anything is basically the same as the average temperature of Earth, because this is why Earth has that particular temperature. If a satellite is rotating, a bit like a chicken on a rotisserie, it will tend toward having a consistent temperature that's roughly similar to that of the Earth surface. If it isn't rotating, the side pointing away from the sun will tend to get progressively colder, with a limit due to the cosmic microwave background, around 4 Kelvin, just a little bit above absolute zero. On the sunward side, things can get a bit cooked, hitting hundreds of centigrade. Thermal management therefore requires very careful design, making sure that heat is carefully directed where it needs to go. Because there is no convection in a vacuum, this can only be achieved by conduction, or via some kind of heat pump. I've designed space hardware that has flown in space. In one particular case, I designed a camera system that needed to be very small and lightweight, whilst still providing science-grade imaging capabilities. Thermal management was front and centre in the design process – it had to be, because power is scarce in small spacecraft, and thermal management has to be achieved whilst keeping mass to a minimum. So no heat pumps or fancy stuff for me – I went in the other direction, designing the system to draw a maximum of about 1 watt at peak, dropping to around 10% of that when the camera was idle. All this electrical power turns into heat, so if I can draw 1 watt only while capturing an image, then turn the image sensor off as soon as the data is in RAM, I can halve the consumption, then when the image has been downloaded to the flight computer I can turn the RAM off and drop the power down to a comparative trickle. The only thermal management needed was bolting the edge of the board to the chassis so the internal copper planes in the board could transfer any heat generated. Cooling even a single H200 will be an absolute nightmare. Clearly a heatsink and fan won't do anything at all, but there is a liquid cooled H200 variant. Let's say this was used. This heat would need to be transferred to a radiator panel – this isn't like the radiator in your car, no convection, remember? – which needs to radiate heat into space. Let's assume that we can point this away from the sun. The Active Thermal Control System (ATCS) on the ISS is an example of such a thermal control system. This is a very complex system, using an ammonia cooling loop and a large thermal radiator panel system. It has a dissipation limit of 16kW, so roughly 16 H200 GPUs, a bit over the equivalent to a quarter of a ground-based rack. The thermal radiator panel system measures 13.6m x 3.12 m, i.e., roughly 42.5 square metres. If we use 200kW as a baseline and assume all of that power will be fed to GPUs, we'd need a system 12.5 times bigger, i.e., roughly 531 square metres, or about 2.6 times the size of the relevant solar array. This is now going to be a very large satellite, dwarfing the ISS in area, all for the equivalent of three standard server racks on Earth. Radiation Tolerance This is getting into my PhD work now. Assuming you can both power and cool your electronics in space, you have the further problem of radiation tolerance. The first question is where in space? If you are in low Earth orbit (LEO), you are inside the inner radiation belt, where radiation dose is similar to that experienced by high altitude aircraft – more than an airliner, but not terrible. Further out, in mid Earth orbit (MEO), where the GPS satellites live, they are not protected by the Van Allen belts – worse, this orbit is literally inside them. Outside the belts, you are essentially in deep space (details vary with how close to the Sun you happen to be, but the principles are similar). There are two main sources of radiation in space – from our own star, the Sun, and from deep space. This basically involves charged particles moving at a substantial percentage of the speed of light, from electrons to the nuclei of atoms with masses up to roughly that of oxygen. These can cause direct damage, by smashing into the material from which chips are made, or indirectly, by travelling through the silicon die without hitting anything but still leaving a trail of charge behind them. The most common conseqence of this happening is a single-event upset (SEU), where a direct impact or (more commonly) a particle passing through a transistor briefly (approx 600 picoseconds) causes a pulse to happen where it shouldn't have. If this causes a bit to be flipped, we call this a SEU. Other than damage to data, they don't cause permanent damage. Worse is single-event latch-up. This happens when a pulse from a charged particle causes a voltage to go outside the power rails powering the chip, causing a transistor essentially to turn on and stay on indefinitely. I'll skip the semiconductor physics involved, but the short version is that if this happens in a bad way, you can get a pathway connected between the power rails that shouldn't be there, burning out a gate permanently. This may or may not destroy the chip, but without mitigation it can make it unusable. For longer duration missions, which would be the case with space based datacenters because they would be so expensive that they would have to fly for a long time in order to be economically viable, it's also necessary to consider total dose effects . Over time, the performance of chips in space degrades, because repeated particle impacts make the tiny field-effect transistors switch more slowly and turn on and off less completely. In practice, this causes maximum viable clock rates to decay over time, and for power consumption to increase. Though not the hardest issue to deal with, this must still be mitigated or you tend to run into a situation where a chip that was working fine at launch stops working because either the power supply or cooling has become inadequate, or the clock is running faster than the chip can cope with. It's therefore necessary to have a clock generator that can throttle down to a lower speed as needed – this can also be used to control power consumption, so rather than a chip ceasing to function it will just get slower. The next FAQ is, can't you just use shielding? No, not really, or maybe up to a point. Some kinds of shielding can make the problem worse – an impact to the shield can cause a shower of particles that then cause multiple impact at once, which is far harder to mitigate. The very strongest cosmic rays can go through an astonishing amount of solid lead – since mass is always at a premium, it's rarely possible to deploy significant amounts of shielding, so radiation tolerance must be built into the system (this is often described as Radiation Hardness By Design, RHBD). GPUs and TPUs and the high bandwidth RAM they depend on are absolutely worst case for radiation tolerance purposes. Small geometry transistors are inherently much more prone both to SEUs and latch-up. The very large silicon die area also makes the frequency of impacts higher, since that scales with area. Chips genuinely designed to work in space are taped out with different gate structures and much larger geometries. The processors that are typically used have the performance of roughly a 20-year-old PowerPC from 2005. Bigger geometries are inherently more tolerant, both to SEUs and total dose, and the different gate topologies are immune to latch up, whilst providing some degree of SEU mitigation via fine-grained redundancy at the circuit level. Taping out a GPU or TPU with this kind of approach is certainly possible, but the performance would be a tiny fraction of that of a current generation Earth-based GPU/TPU. There is a you-only-live-once (my terminology!) approach, where you launch the thing and hope for the best. This is commonplace in small cubesats, and also why small cubesats often fail after a few weeks on orbit. Caveat emptor! Communications Most satellites communicate with the ground via radio. It is difficult to get much more than about 1Gbps reliably. There is some interesting work using lasers to communicate with satellites, but this depends on good atmospheric conditions to be feasible. Contrasting this with a typical server rack on Earth, where 100Gbps rack-to-rack interconnect would be considered at the low end, and it's easy to see that this is also a significant gap. Conclusions I suppose this is just about possible if you really want to do it, but I think I've demonstrated above that it would firstly be extremely difficult to achieve, disproportionately costly in comparison with Earth-based datacenters, and offer mediocre performance at best. If you still think this is worth doing, good luck, space is hard. Myself, I think it's a catastrophically bad idea, but you do you.

0 views
Dangling Pointers 1 months ago

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.

0 views
Martin Alderson 1 months ago

I Finally Found a Use for IPv6

Using IPv6 with Cloudflare to run multiple services on a single server without a reverse proxy

1 views
DHH 1 months ago

No backup, no cry

I haven't done a full-system backup since back in the olden days before Dropbox and Git. Every machine I now own is treated as a stateless, disposable unit that can be stolen, lost, or corrupted without consequences. The combination of full-disk encryption and distributed copies of all important data means there's just no stress if anything bad happens to the computer. But don't mistake this for just a "everything is in the cloud" argument. Yes, I use Dropbox and GitHub to hold all the data that I care about, but the beauty of these systems is that they work with local copies of that data, so with a couple of computers here and there, I always have a recent version of everything, in case either syncing service should go offline (or away!). The trick to making this regime work is to stick with it. This is especially true for Dropbox. It's where everything of importance needs to go: documents, images, whatever. And it's instantly distributed on all the machines I run. Everything outside of Dropbox is essentially treated as a temporary directory that's fully disposable. It's from this principle that I built Omarchy too. Given that I already had a way to restore all data and code onto a new machine in no time at all, it seemed so unreasonable that the configuration needed for a fully functional system still took hours on end. Now it's all encoded in an ISO setup that installs in two minutes on a fast computer. Now it's true that this method relies on both multiple computers and a fast internet connection. If you're stuck on a rock in the middle of nowhere, and you somehow haven't discovered the glory of Starlink, maybe just stick to your old full-disk backup ways. But if you live in the modern world, there ought to be no reason why a busted computer is a calamity of data loss or a long restore process.

2 views
iDiallo 1 months ago

The real cost of Compute

Somewhere along the way, we stopped talking about servers. The word felt clunky, industrial, too tied to physical reality. Instead, we started saying "the cloud". It sounds weightless, infinite, almost magical. Your photos live in the cloud. Your documents sync through the cloud. Your company's entire infrastructure runs in the cloud. I hated the term cloud. I wasn't alone, someone actually created a "cloud to butt" browser extension that was pretty fun and popular. But the world has adopted the term, and I had no choice but to oblige. So what is the actual cloud? Why is it hiding behind this abstraction? Well, the cloud is rows upon rows of industrial machines, stacked in massive data centers, consuming electricity at a scale most of us can't even imagine. The cloud isn't floating above us. It's bolted to concrete floors, surrounded by cooling systems, and plugged into power grids that strain under its appetite. I'm old enough to remember the crypto boom and the backlash that followed. Critics loved to point out that Bitcoin mining consumed as much electricity as entire countries. Argentina, the Netherlands, and so many nations were picked for comparison. But I was not outraged by it at all. My reaction at the time was simpler. Why does it matter if they pay their electric bill? If you use electricity and compensate for it, isn't that just... how markets work? Turns out, I was missing the bigger picture. And the AI boom has made it impossible to ignore. When new data centers arrive in a region, everyone's electric bill goes up. Even if your personal consumption stays exactly the same. It has nothing to do with fairness and free markets. Infrastructure is not free. The power grids weren't designed for the sudden addition of facilities that consume megawatts continuously. When demand surges beyond existing capacity, utilities pass those infrastructure costs onto everyone. New power plants get built, transmission lines get upgraded, and residential customers help foot the bill through rate increases. The person who never touches AI, never mines crypto, never even knows what a data center does, this person is now subsidizing the infrastructure boom through their monthly utility payment. The cloud, it turns out, has a very terrestrial impact on your wallet. We've abstracted computing into its purest conceptual form: "compute." I have to admit, it's my favorite term in tech. "Let's buy more compute." "We need to scale our compute." It sounds frictionless, almost mathematical. Like adjusting a variable in an equation. Compute feels like a slider you can move up and down in your favorite cloud provider's interface. Need more? Click a button. Need less? Drag it down. The interface is clean, the metaphor is seamless, and completely disconnected from the physical reality. But in the real world, "buying more compute" means someone is installing physical hardware in a physical building. It means racks of servers being assembled, hard drives being mounted, cables being routed. The demand has become so intense that some data center employees have one job and one job only: installing racks of new hard drives, day in and day out. It's like an industrial assembly line. Every gigabyte of "cloud storage" occupies literal space. Every AI query runs on actual processors that generate actual heat. The abstraction is beautiful, but the reality is concrete and steel. The cloud metaphor served its purpose. It helped us think about computing as a utility. It's always available, scalable, detached from the messy details of hardware management. But metaphors shape how we think, and this one has obscured too much for too long. Servers are coming out of their shells. The foggy cloud is lifting, and we're starting to see the machinery underneath: vast data centers claiming real estate, consuming real water for cooling, and drawing real power from grids shared with homes, schools, and hospitals. This isn't an argument against cloud computing or AI. There nothing to go back to. But we need to acknowledge their physical footprint. The cloud isn't a magical thing in the sky. It's industry. And like all industry, it needs land, resources, and infrastructure that we all share.

0 views
Dangling Pointers 1 months ago

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views

brownouts reveal system boundaries

One of the many basic tenets of internal control is that a banking organization ensure that employees in sensitive positions be absent from their duties for a minimum of two consecutive weeks. Such a requirement enhances the viability of a sound internal control environment because most frauds or embezzlements require the continual presence of the wrongdoer. Failure free operations require experience with failure. Yesterday, Cloudflare ’s global edge network was down across the world. This post is not about why that happened or how to prevent it. It’s about the fact that this was inevitable. Infinite uptime does not exist . If your business relies on it, sooner or later, you will get burned. Cloudflare’s last global edge outage was on July 2, 2019. They were down yesterday for about 3 hours (with a long tail extending about another 2 and a half hours). That’s an uptime of 99.99995% over the last 6 years. Hyperscalers like Cloudflare, AWS, and Google try very very hard to always be available, to never fail. This makes it easy to intertwine them in your architecture, so deeply you don’t even know where. This is great for their business. I used to work at Cloudflare, and being intertwined like this is one of their explicit goals. My company does consulting, and one of our SaaS tools is a time tracker. It was down yesterday because it relied on Cloudflare. I didn’t even know until it failed! Businesses certainly don’t publish their providers on their homepage. The downtime exposes dependencies that were previously hidden. This is especially bad for “cascading” dependencies, where a partner of a partner of a partner has a dependency on a hyperscaler you didn’t know about. Failures like this really happen in real life; Matt Levine writes about one such case where a spectacular failure in a fintech caused thousands of families to lose their life savings. What I want to do here is make a case that cascading dependencies are bad for you, the business depending on them. Not just because you go down whenever everyone else goes down, but because depending on infinite uptime hides error handling issues in your own architecture . By making failures frequent enough to be normal, organizations are forced to design and practice their backup plans. Backup plans don’t require running your own local cloud. My blog is proxied through cloudflare; my backup plan could be “failover DNS from cloudflare to github when cloudflare is down”. Backup plans don’t have to be complicated. A hospital ER could have a backup plan of “keep patient records for everyone currently in the hospital downloaded to an offline backup sitting in a closet somewhere”, or even just “keep a printed copy next to the hospital bed”. The important thing here is to have a backup plan, to not just blithely assume that “the internet” is a magic and reliable thing. One way to avoid uptime reliance is brownouts, where services are down or only partially available for a predetermined amount of time. Google intentionally brownouts their internal infrastructure so that nothing relies on another service being up 100% at the time 1 . This forces errors to be constantly tested, and exposes dependency cycles. Another way is Chaos Monkey, pioneered at Netflix, where random things just break and you don’t know which ahead of time. This requires a lot of confidence in your infrastructure, but reveals kinds of failures you didn’t even think were possible. I would like to see a model like this for the Internet, where all service providers are required to have at least 24 hours of outages in a year. This is a bit less than 3 nines of uptime (about 5 minutes a day): enough that the service is usually up, but not so much that you can depend on it to always be up. In my experience, and according to studies about failure reporting , both people and organizations tend to chronically underestimate tail risks. Maybe you’re just a personal site and you don’t need 100% reliability. That’s ok. But if other people depend on you, and others depend on them, and again, eventually we end up with hospitals and fire stations and water treatment plants depending on the internet. The only way I see to prevent this is to make the internet unreliable enough that they need a backup plan. People fail. Organizations fail. You can’t control them. What you can control is whether you make them a single point of failure. You have backups for your critical data. Do you have backups for your critical infrastructure? Of course, they don't brown-out their external-facing infra. That would lose them customers. ↩ Of course, they don't brown-out their external-facing infra. That would lose them customers. ↩

0 views