Posts in Cloud (20 found)
Stratechery 2 days ago

Mythos, Muse, and the Opportunity Cost of Compute

Listen to this post : In January 2025, Doug O’Laughlin at Fabricated Knowledge declared that o1 and reasoning models marked the end of Aggregation Theory: I believe that there is no practical limit to the improvements of models other than economics, and I think that will be the real constraint in the future. It is reasonable that if we spent infinite dollars on a model, it would be improved. The problem is whether infinite dollars would make sense for a business. That is going to be the key question for 2025. How do the economics of AI make this work? One of the core assumptions about the internet has just been broken. Marginal costs now exist again, meaning that most hyperscalers will become increasingly capital-intensive. The era of Aggregation Theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking. And this will be the big problem that will be reckoned with this year. Hyperscaler’s business models are mainly underpinned by the marginal cost being zero. So, as long as you set up the infrastructure and fill an internet-scale product with users, you can make money. This era will soon be over, and the future will be much weirder and more compute-intensive. Looking back on the 2010s, we will probably consider them a naive time in the long arc of technology. One of our fundamental assumptions about this period is unraveling. This will be the single most significant change in the technology landscape going forward. Aggregation Theory was, if I may say so myself, the single best way to understand the 2010s, particularly consumer tech. It explained the dynamics undergirding Google and Facebook’s dominance, as well as the App Store and Amazon’s e-commerce business; it was also a useful ( albeit incomplete ) framework to understand an entire host of consumer services like Uber, Airbnb, and Netflix. It’s worth pointing out, however, that some of the critical insights undergirding Aggregation Theory are much older, and are embedded in the fundamental nature of tech itself. They are, as O’Laughlin notes, rooted in the concept of zero marginal costs. Marginal costs are how much it costs to make one more unit of a good. Consider a widget-making factory: Land and machines are clearly fixed costs; you have to have both to get started, and you are paying for both whether or not you make one more widget. Raw material, on the other hand, is clearly a marginal cost: if you make one more widget, you need one more widget’s worth of raw material. When it comes to physical goods, electricity and humans are also marginal costs: you need more or fewer of them depending on whether you make more or fewer widgets. Where marginal costs matter is that they provide a price floor. Companies will operate unprofitably because profit and loss is an accounting concept that incorporates depreciation, i.e. your fixed costs. For example, imagine that a company spent $1,000 on a factory to make widgets that have a marginal cost of $10: as long as the price of widgets is >$10 the company will make them even if they don’t earn enough money to cover their depreciation costs (i.e. they operate at a loss) because at least they are still making a marginal profit on each widget (what the company may not do is invest in any more fixed costs, and, eventually, will probably go bankrupt from interest on the debt that likely financed those fixed costs). I explain all of this precisely because it’s almost completely immaterial to tech. First, there generally are no raw material costs, because the outputs are digital. Second, because there are no raw material costs, and because the fixed costs are so large, electricity and humans are generally treated as fixed costs, not marginal costs: of course you will run your servers all of the time and at full capacity, because every scrap of additional revenue you can generate is worth it. AI very much fits in this paradigm: the output is digital, and while AI chips use a lot of electricity, the cost is a fraction of the cost of the chips themselves, which is to say that no one with AI chips is making marginal cost calculations in terms of utilizing them. They’re going to be used! Rather, the decision that matters is what they will be used for. Consider Microsoft: last quarter the company missed the Street’s Azure growth expectations not because there wasn’t demand, but because the company decided to use its capacity for its own products. CFO Amy Hood said on the company’s earnings call : I think it’s probably better to think about the Azure guidance that we give as an allocated capacity guide about what we can deliver in Azure revenue. Because as we spend the capital and put GPUs specifically, it applies to CPUs, the GPUs more specifically, we’re really making long-term decisions. And the first thing we’re doing is solving for the increased usage in sales and the accelerating pace of M365 Copilot as well as GitHub Copilot, our first-party apps. Then we make sure we’re investing in the long-term nature of R&D and product innovation. And much of the acceleration that I think you’ve seen from us and products over the past a bit is coming because we are allocating GPUs and capacity to many of the talented AI people we’ve been hiring over the past years. Then, when you end up, is that, you end up with the remainder going towards serving the Azure capacity that continues to grow in terms of demand. And a way to think about it, because I think, I get asked this question sometimes, is if I had taken the GPUs that just came online in Q1 and Q2 in terms of GPUs and allocated them all to Azure, the KPI would have been over 40. And I think the most important thing to realize is that this is about investing in all the layers of the stack that benefit customers. And I think that’s hopefully helpful in terms of thinking about capital growth, it shows in every piece, it shows in revenue growth across the business and shows as OpEx growth as we invest in our people. The cost that Microsoft is contending with here is not marginal cost, but rather opportunity cost: compute spent in one area cannot be used in another area; in the case of these earnings, Microsoft was admitting that they could have made their Azure number if they wanted to, but chose to prioritize their own workloads because, as CEO Satya Nadella noted later in the call, those have higher gross margin profiles and higher lifetime value. It’s opportunity costs, not marginal costs, that are the challenge facing hyperscalers. How much compute should go to customers, and which ones? How much should be reserved for internal workloads? Microsoft needs to balance Azure — both for its enterprise customers and OpenAI — and its software business; Amazon needs to balance its e-commerce business, AWS, and its strategic investments in both Anthropic and OpenAI. Google has to balance GCP, its own strategic investment in Anthropic, and its consumer businesses. Last week Anthropic released announced Mythos, its most advanced model. And, in somewhat typical Anthropic fashion, it did so by focusing on its dangers; from the introductory post for Project Glasswing , the company’s initiative for leveraging Mythos to address security: We formed Project Glasswing because of capabilities we’ve observed in a new frontier model trained by Anthropic that we believe could reshape cybersecurity. Claude Mythos Preview is a general-purpose, unreleased frontier model that reveals a stark fact: AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities. Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser. Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. The fallout—for economies, public safety, and national security—could be severe. Project Glasswing is an urgent attempt to put these capabilities to work for defensive purposes. In an Update last week I analogized Anthropic’s “disaster-porn-as-marketing-tool” approach to The Boy Who Cried Wolf ; what’s important about that analogy is not just that the boy raised false alarms, but also that, in the end, the wolf did come. To that end, I wrote two weeks ago about the myriad of security issues that underpin all software, and my optimism that AI would solve these issues in the long run, even if it made things much worse in the short run. In other words, it’s actually not important whether or not Mythos represents a major security threat: if this model doesn’t, a future model will; to that end, I do support leveraging Mythos to proactively find and fix bugs before bad actors can find and exploit them. At the same time, it’s also worth noting that there are other reasons for Anthropic to not make Mythos widely available, limiting access to a finite number of companies with a high capacity and willingness to pay. The first are those opportunity costs: Anthropic is already short on compute serving its current models; X was overrun with complaints and debates this weekend about Anthropic allegedly dumbing down Claude over the last month or so . Making Mythos more widely available — particularly to subscription plans that don’t pay per usage — would make the situation much worse. In other words, Anthropic isn’t facing a marginal cost problem, but an opportunity cost problem: where to allocate its compute. Of course this could become a margin problem: I suspect that Anthropic is going to overcome its conservatism in terms of compute by acquiring more compute from hyperscalers and neoclouds, and paying dearly for the privilege. The key to handling those costs will be to charge more for Claude going forward; that, by extension, means maintaining pricing power, which leads to a second benefit of not releasing Mythos broadly. Anthropic certainly faces competition from OpenAI; for both frontier labs, however, the real competition in the long run are open source models. Right now those primarily come from China, and a key ingredient in fast-following frontier models is distillation; from Anthropic’s blog : We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions. These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently. I absolutely believe this is a real problem, and wrote as much when DeepSeek R1 was released last year . I also think it’s in the interest of everyone other than the frontier labs to pretend that it isn’t; open source models are not subject to the frontier labs’ markup or compute constraints, which is exactly why it benefits most companies to have them available, whether or not they are distilled. Of course that doesn’t mean they are free to run: you still need to provide the compute. Notice, however, how that makes stopping distillation even more of a priority for the frontier labs: first, they want to protect their margins. Second, however, their biggest cost is opportunity cost: the customers they can’t serve because they don’t have enough compute. To the extent they can make compute less useful for their potential customers — by stopping open source models from distilling their models — is the extent to which they can acquire that compute for themselves at more favorable rates. Mythos wasn’t the only new model announced last week: Meta released the first fruit of their new frontier lab as well. From the company’s blog post : Today, we’re excited to introduce Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts. To support further scaling, we are making strategic investments across the entire stack — from research and model training to infrastructure, including the Hyperion data center… Muse Spark offers competitive performance in multimodal perception, reasoning, health, and agentic tasks. We continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows. Muse Spark isn’t state of the art, but it’s in the game, and overall a positive first impression from Meta Superintelligence Labs. What is most notable to me, however, is the extent to which the last nine months of AI have made clear that CEO Mark Zuckerberg made the right call to embark on that “ground-up overhaul of [Meta’s] AI efforts”. The trigger for O’Laughlin’s post that I opened this Article with was reasoning, where models using more tokens led to better answers; since then agents have exponentially increased token demand , as they can use LLMs continuously without a human in the loop. This is a huge driver in sky-rocketing demand for Claude, as well as OpenAI’s Codex. Moreover, this use case is so potentially profitable that not only is Anthropic’s revenue sky-rocketing, but OpenAI is pivoting its focus to enterprise. Indeed, you can make the argument that one of OpenAI’s biggest challenges is the fact it has such a popular consumer product in ChatGPT. I, with my Aggregation Theory lens, have long maintained that that userbase was a big advantage for OpenAI, but that assumed that the company could effectively monetize it, which is why I have argued so vociferously for an advertising model . OpenAI has big projections for exactly that, but until that materializes, that big consumer base is a big opportunity cost in terms of OpenAI’s focus and compute. The company has, to its credit and in the face of widespread skepticism, made significant investments in more compute, but the temptation to allocate more and more compute to agentic use cases that enterprises will pay for, even at the expense of the consumer business, will be very large. This puts Meta in a unique position relative to everyone else in the industry: unlike any of the hyperscalers or the frontier labs, Meta does not have an enterprise or cloud business to worry about. That means that serving the consumer market comes with no opportunity costs. Of course those opportunity costs would be much smaller anyways, given that Meta already has an at-scale advertising business to monetize usage. In other words, Meta may actually face less competition in winning the consumer space than it might have seemed a few months ago, simply because that is their primary focus — and because they have their own model, which means they don’t need to worry about not having access to the frontier labs (much of this analysis applies to Google, of course). This, by the same token, is why Meta should open source Muse, just like they did Llama. The entities that will be the most hurt by widespread availability of a frontier model are other frontier labs, who will see their pricing power reduced and face increased competition for compute. This will make it even harder for them to bear the opportunity cost of pursuing the consumer market, leaving it for Meta. So is “the era of Aggregation Theory…behind us”? On one hand, the insight that the way to create and maintain value will come from owning the customer is almost certainly going to continue to be the case. On the consumer side owning customers leads to advertising which provides the revenue to provide services to customers. On the enterprise side — which, I would note, has never been an arena where Aggregation Theory was meant to be applied — I think it’s likely that both Anthropic and OpenAI continue to move up the stack and deliver features that compete with software providers directly (an approach that is also in line with not making leading edge models publicly available). On the other hand, O’Laughlin’s observation that we are and will continue to be compute constrained is an important one: companies will not be able to assume they can serve everyone, because serving one set of customers imposes the opportunity cost of not serving another. This won’t, at least in theory, last forever: at some point AI will be “good enough” for enough use cases that there will be enough compute capacity to take advantage of the fact that there really aren’t meaningful marginal costs entailed in serving AI; that theoretical future, however, feels further away than ever. OpenAI is betting that this compute constraint — and the deals they have made to overcome it — will matter more than Anthropic’s current momentum with end users. From Bloomberg : OpenAI told investors this week that its early push to dramatically increase computing resources gives it a key advantage over Anthropic PBC at a moment when its longtime rival is gaining ground and mulling a potential public offering. The ChatGPT maker said it has outpaced Anthropic by “rapidly and consistently” adding computing capacity to support wider adoption of its software, according to a note the company sent to some of its investors after Anthropic announced a more powerful AI model called Mythos. The ambitious infrastructure build-out, criticized by some as too costly, has enabled OpenAI to better keep pace with rising demand for AI products, the memo states. I’m less certain that this will be dispositive. When it comes to AI, distribution and transaction costs are still free — the two preconditions for Aggregators — which means that the winners should be those with the most compelling products. Those products will win the most users, providing the money necessary to source the compute to serve them; consider Anthropic’s deal to secure a meaningful portion of TPU supply , which, given the capacity constraints at TSMC, is ultimately an example of taking supply from Google. I suspect that Anthropic can take more, including already built hyperscaler and neocloud capacity. Yes, that compute will be more expensive, but if demand is high enough the necessary cash flow will be there. In other words, my bet is that owning demand will ultimately trump owning supply, suggesting that the underlying principles of Aggregation Theory lives on. To put it another way, I think that OpenAI will need to win with better products, not just more compute; then again, if more compute is the key to better products, then does supply matter most? Regardless, they’ll certainly be focused on delivering both to the enterprise customers who are driving Anthropic’s astonishing growth. The real cost may be the consumer market they currently dominate, given that Meta has nothing to lose and everything to gain. You need land for the factory You need machines for the factory You need electricity to operate the machines You need humans to operate the machines You need the raw material for the widgets

0 views
iDiallo 3 days ago

Your AWS Certificate Makes You an AWS Salesman

I must have been the last developer still confused by the AWS interface. I knew how to access DynamoDB, that was the only tool I needed for my daily work. But everything else was a mystery. How do I access web hosting? If I needed a small server to host a static website, what service would I use? Searching for "web hosting" inside the AWS console yielded nothing. After digging through the web, I found the answer: an Elastic Cloud Compute instance, better known as EC2. I learned that I could use it under the "Free Tier." Amazon offers free tiers for many services, but figuring out the actual cost beyond that introductory period requires elaborate calculation tools. In fact, I’ve often seen independent developers build tools specifically to help people decipher AWS pricing If you want to use AWS effectively, it seems the only path is to get certified. Companies send employees to conferences and courses to learn the platform. I took some of those courses and they taught me how to navigate the interface and build very specific things. But that skill isn't transferrable. In the course, I wasn't exactly learning a new engineering skill. Instead, I was learning Amazon. Amazon has created a complex suite of tools that has become the industry standard. Hidden within its moat of confusion, we are trained to believe it is the only option. Its complexity justifies the high cost, and the Free Tier lures in new users who settle into the idea that this is just "the way" to do web development. When you are presented with a simple interface like DigitalOcean or Linode and a much cheaper price tag, you tend to think that something is missing. Surely, a cheaper, simpler service must lack half the features, right? The reality is, you don't need half the stuff AWS offers. Where other companies create tutorials to help you build, Amazon offers certificates. It is a powerful signal for enterprise legitimacy, but for most developers, it is overkill. This isn't to say AWS is "bad," but it obscures the reality of running a web service. It is much easier than it seems. There are hundreds of alternatives for hosting. You can run your services reliably on a VPS without ever breaking the bank. Most web programming is free , or at the very least, affordable.

0 views
Giles's blog 6 days ago

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of them would help? I found a list of solid-looking interventions, and in my last post I came to the conclusion that the improvements in loss I had seen with all of them -- with two possible exceptions -- seemed unlikely to be in the noise. What would happen if I tried to put them into a new model? Let's start by looking at the results that we have for the interventions so far -- this is the table I've been using as I go through them, but I've updated it to contain the loss figures for each model to six decimal places instead of three, and made each model name link to the associated post. I've also corrected the loss for the model, which was mistakenly using the training loss at the end of the run rather than the loss on the test set 1 . As I've mentioned before, simply moving to training in the cloud improved things markedly, getting loss down from 3.944 to 3.691526; I suspect this was due to having a closer-to-optimal batch size (more about that in my next post). What to do about the other interventions, though? It seemed clear that two of them were not helping: weight tying, and the one using the figure for weight decay that I'd (I suspect incorrectly) derived from a paper by Cerebras Research. The "no-AMP" run (which would be better described as "full-fat float32") had a small positive effect, but was so costly in terms of both time and money that it wasn't worthwhile. So we had five interventions to try: How would they stack up? It seemed pretty unlikely that their independent contributions would just sum up neatly so that we got a total improvement of 0.013209 + 0.022141 + 0.048586 + 0.050244 + 0.089609 = 0.223789 (though that would certainly be nice!). One question to consider was how independent they were. For any set of interventions, you can imagine them being independent and adding up nicely, or pulling in separate directions so that the combined effect is worse than the sum, or pulling in the same direction so that they amplify each other. My intuition was that gradient clipping and removing dropout were pretty independent, at least conceptually. They might affect other interventions indirectly (eg. via changing the training run's use of the random number generator) but they'd be unlikely to have a direct effect. QKV bias I was less sure about, but it seemed -- again, just intuitively -- at least reasonably independent of the others, with one important exception (which I'll get into below). By contrast, weight decay and the learning rate interact together quite strongly, at least in standard gradient descent, and I'd tested them in isolation. The result for changing the weight decay to 0.01 was based on a fixed learning rate of 0.0004, and the result for scheduling the learning rate was based on a weight decay of 0.1. That felt like an issue, and definitely needed some thought. Additionally, there were some issues with which interventions might have not had a real effect, and instead just been the results of the use of randomness. While my analysis of how that might have affected things was somewhat limited by the number of test runs I could afford to do, it did show up two plausible issues: After some thought, I came up with a plan. If I were doing this properly and scientifically, I suppose I'd try every combination of interventions, but that would be ruinously expensive 2 , so a sensible minimal set of training runs felt like this: When those completed, I'd find the test set loss for both models. I'd choose the best run, and then do another run with those settings, but with weight decay switched back to the original value of 0.1. I chose to revert weight decay rather than the learning rate stuff because this was the one I was least sure about -- the updated "GPT-2" value of 0.01 is very unusual by today's standards, and I'd come to it via a rather circuitous route -- see the post for more details. The best of the three runs would be the winning combination of interventions. Again, this was not an exhaustive plan 3 . But it seemed to make sense. Let's see how it turned out. Just to recap, this one had these interventions against the baseline: It did not have QKV bias. You can see the config here . Here's the loss chart over the course of the training run: As normal with learning rate scheduling, I also charted that to make sure it was doing the right thing (you can see that it was): And I also tracked the gradient norms -- you can see that there was some clipping happening near the start of the run: At the end of the run, it reported this: That's a slightly lower final train loss than normal, and it took 3h10m, which is faster than usual, but about the same as the other train we did without dropout -- that makes sense, as the process of zeroing out random activations isn't free. I downloaded the model -- here it is -- and then ran the smoke test: ...and got its loss on the test set: Not bad at all -- the best result we've had so far, albeit not quite up to the standard of the original GPT-2 weights. Now the next one, with QKV bias. This one had these interventions: You can see the config here . Here's the loss chart: ...the learning rate: ...the gradient norms (note that we had more clipping, about halfway through): ...and the final printout at the end. That final train loss is slightly higher, which is normally an indicator that the test loss will be higher, but we'll have to see. Time to download the model -- here it is -- and on to the smoke test: ...and then the moment of truth -- what was its loss on the test set? As I suspected from the training loss at the end, slightly worse than the run without QKV bias. So, that meant that we should do the next run, with a weight decay of 0.1, with no QKV bias. Given the above results, this one had these interventions vs the baseline: Weight decay was back to the baseline value of 0.1, rather than the value of 0.01 used in the previous two runs, and QKV bias was switched back off. You can see the config here . Here's the loss chart: You can see that it's much choppier than the previous two runs; that initially surprised me, as the higher weight decay means that we're regularising the model more than we were with those, which I thought would "calm things down". But on reflection, I had it backward. Hand-waving a bit, a more regularised model is fitting less closely every detail to the data it has seen, considering the typical stuff more than it does the outliers. That means that when something a bit more out-of-distribution appears, it might not have yet learned how to integrate it into its model of the world. Well, it sounds plausible, anyway :-) On to the learning rate (just to double-check), and it's fine: And again, the gradient norms: ...which similarly to the loss chart show more occasions where gradients spiked and had to be clipped -- even towards the end of the training run this time. The final printout at the end: Once again, although the final train loss is not definitive, it tends to be indicative of the test loss. It's in between the last two runs, so we'd expect the test loss to be likewise in between theirs: Time to download the model -- here it is -- and on to the smoke test: Hmm. At least vaguely coherent, though I'm not 100% convinced. It looks like ads for personal injury lawyers have crept into FineWeb somehow... Still, it's time for the test loss (drumroll): As predicted from the train loss, it's in between the two runs above. Let's put these three runs into the results table: As a reminder: You can see that adding on QKV bias actually made the model worse than the learning-rate-only intervention. That pushes me slightly away from the "it's all about the initial weights" direction; perhaps instead the bias adds some kind of stability that the learning rate scheduling also provides, and they fight against each other? Unfortunately I think the only way to pick it apart would be to do a full set of runs, switching each intervention on and off independently, and that would be too costly. The fact that the weight decay change from 0.1 to 0.01 actually did help when combined with the learning rate change and scheduling was a bit of a surprise; because they're both coupled when we think about standard gradient descent, I was expecting them to be too intertwined for my tests of them in isolation to have been valid. Quite pleased that it didn't work out that way, though, because sweeping across values for different parameters is much easier than it would be if they were connected. However, at this point it occurs to me that it might be because we're using the AdamW optimiser. As I understand it, its big difference versus Adam is that it decouples weight decay. I don't have a solid mental model of what that means exactly (will read up and post about it eventually), but it certainly seems pertinent here. Anyway, I have to say, I'm both pleased with and disappointed by these results. Pleased because we got a result by putting interventions together that was better than any of them in isolation, but disappointed that the end result wasn't even better. The difference between 's loss, at 3.691526, and original GPT-2 small's, at 3.5, was 0.191526. Our best result, for , was 3.577761, so an improvement of 0.113765. That's about 60% of the way there. That said, by sheer chance, while trying out the different sizes of cloud machines, I'd got from a loss of 3.944 training locally to the baseline's value of 3.691526 -- I suspect due to the fact that training in the cloud meant that I could use batch sizes of 96. So a different way of looking at it is that we should include that in the calculations too. From 3.944 to 3.5, the gap with GPT-2 small was 0.444. And we went from 3.944 to 3.577761, an improvement of 0.366239. And that means that we managed to get 82% of the improvement we needed. On the other hand, it means that in terms of my improvements, 0.252474 came from a happy accident, while all of my careful work on interventions only got me 0.113765. :-( Anyway, I think that for now, I'll have to rest happy with that as a result -- and next time around, let's see if we can get to the same level of improvement locally, using gradient accumulation. Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩ Gradient clipping. QKV bias (that is, adding bias to the attention weight matrices). Changing weight decay to the GPT-2 value (0.01 rather than the 0.1 that is typical nowadays). Removing dropout Updating the learning rate from 0.0004 to 0.0014, but also scheduling it so that it varies over the course of the training run. Adding gradient clipping looked like it might have been within the training run noise. Adding QKV bias would have had a large effect on the model's initial weights. All of the others would have started with essentially the same weights (apart from weight tying, though even that would have had the same values for the initial weights apart from the tied ones). But adding the bias would have completely changed them, and its effect size was comfortably within the range of differences you might expect from that. Start a training run with all of the interventions apart from QKV bias. In parallel (Lambda instance availability permitting) run another one, with all of the interventions including QKV bias. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. Gradient clipping at 3.5 Weight decay changed from 0.1 to 0.01 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. QKV bias switched on. Gradient clipping at 3.5 Dropout removed Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014. was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, but no QKV bias was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01, dropout removed, and the learning rate intervention, with QKV bias was gradient clipping at 3.5, dropout removed, and the learning rate intervention, but no QKV bias, and no change to weight decay . Luckily the difference was small enough that it doesn't change any of the conclusions I'd made about it.  ↩ Because there are five interventions, and each can be on or off, then it's equivalent to a 5-digit binary number. So that's 2 5 trains, less the five ones I'd already done and the baseline, for a total of 32 − 6 = 26 . At US$50-odd for a train, that's definitely a no-go.  ↩ I did also consider changing the random seed at the start of the code to 67 rather than 42, given that it seemed to provide better initial weights when I was exploring the effects of random noise on the training. I even started the first two training runs with that in place. However, on reflection I realised that it would be one step too far away from scientific rigour. I'm not trying to be 100% rigorous in these posts, but it seemed like a step too far to diligently test all of the interventions against one seed, and then YOLO in a different one for the final training runs.  ↩

0 views
Jim Nielsen 6 days ago

Fewer Computers, Fewer Problems: Going Local With Builds & Deployments

Me, in 2025, on Mastodon : I love tools like Netlify and deploying my small personal sites with But i'm not gonna lie, 2025 might be the year I go back to just doing builds locally and pushing the deploys from my computer. I'm sick of devops'ing stupid stuff because builds work on my machine and I have to spend that extra bit of time to ensure they also work on remote linux computers. Not sure I need the infrastructure of giant teams working together for making a small personal website. It’s 2026 now, but I finally took my first steps towards this. One of the ideas I really love around the “local-first” movement is this notion that everything canonical is done locally, then remote “sync” is an enhancement. For my personal website, I want builds and deployments to work that way. All data, build tooling, deployment, etc., happens first and foremost on my machine. From there, having another server somewhere else do it is purely a “progressive enhancement”. If it were to fail, fine. I can resort back to doing it locally very easily because all the tooling is optimized for local build and deployment first (rather than being dependent on fixing some remote server to get builds and deployments working). It’s amazing how many of my problems come from the struggle to get one thing to work identically across multiple computers . I want to explore a solution that removes the cause of my problem, rather than trying to stabilize it with more time and code. “The first rule of distributed computing is don’t distribute your computing unless you absolutely have to” — especially if you’re just building personal websites. So I un-did stuff I previously did (that’r right, my current predicament is self-inflicted — imagine that). My notes site used to work like this : It worked, but sporadically. Sometimes it would fail, then start working again, all without me changing anything. And when it did work, it often would take a long time — like five, six minutes to run a build/deployment. I never could figure out the issue. Some combination of Netlify’s servers (which I don’t control and don’t have full visibility into) talking to Dropbox’s servers (which I also don’t control and don’t have full visibility into). I got sick of trying to make a simple (but distributed) build process work across multiple computers when 99% of the time, I really only need it to work on one computer. So I turned off builds in Netlify, and made it so my primary, local computer does all the work. Here are the trade-offs: The change was pretty simple. First, I turned off builds in Netlify. Now when I Netlify does nothing. Next, I changed my build process to stop pulling markdown notes from the Dropbox API and instead pull them from a local folder on my computer. Simple, fast. And lastly, as a measure to protect myself from myself, I cloned the codebase for my notes to a second location on my computer. This way I have a “working copy” version of my site where I do local development, and I have a clean “production copy” of my site which is where I build/deploy from. This helps ensure I don’t accidentally build and deploy my “working copy” which I often leave in a weird, half-finished state. In my I have a command that looks like this: That’s what I run from my “clean” copy. It pulls down any new changes, makes sure I have the latest deps, builds the site, then lets Netlify’s CLI deploy it. As extra credit, I created a macOS shortcut So I can do , type “Deploy notes.jim-nielsen.com” to trigger a build, then watch the little shortcut run to completion in my Mac’s menubar. I’ve been living with this setup for a few weeks now and it has worked beautifully. Best part is: I’ve never had to open up Netlify’s website to check the status of a build or troubleshoot a deployment. That’s an enhancement I can have later — if I want to. Reply via: Email · Mastodon · Bluesky Content lives in Dropbox Code is on GitHub Netlify’s servers pull both, then run a build and deploy the site What I lose : I can no longer make edits to notes, then build/deploy the site from my phone or tablet. What I gain : I don’t have to troubleshoot build issues on machines I don’t own or control. Now, if it “works on my machine”, it works period.

0 views
Nelson Figueroa 1 weeks ago

Proxying GoatCounter Requests for a Hugo Blog on CloudFront to bypass Ad Blockers

I’ve been running GoatCounter on my site using the script . The problem is that adblockers like uBlock Origin block it (understandably). To get around this, I set up proxying so that the GoatCounter requests go to an endpoint under my domain , and then from there CloudFront handles it and sends it to GoatCounter. Most ad blockers work based on domain and GoatCounter is on the blocklists. Since the browser is now sending requests to the same domain as my site, it shouldn’t trigger any ad blockers. This post explains how I did it in case it’s useful for anyone else. It’s possible to self-host GoatCounter, but my approach was easier to do and less infrastructure to maintain. Perhaps in the future. I know there are concerns around analytics being privacy-invasive. GoatCounter is privacy-respecting. I care about privacy. I am of the belief that GoatCounter is harmless. I just like to keep track of the visitors on my site. Read the GoatCounter developer’s take if you want another opinion: Analytics on personal websites . Clicking through the AWS console to configure CloudFront distributions is a pain in the ass. I took the time to finally get the infrastructure for my blog managed as infrastructure-as-code with Pulumi and Python . So while you can click around the console and do all of this, I will be showing how to configure everything with Pulumi. If you don’t want to use IaC, you can still find all of these options/settings in AWS itself. To set up GoatCounter proxying via CloudFront, we’ll need to CloudFront functions are JavaScript scripts that run before a request reaches a CloudFront distribution’s origin. In this case, the function strips the from . We need to strip for two reasons: Here is the code for the function: And here is the CloudFront function resource defined in Pulumi (using Python) that includes the JavaScript from above. This is a new resource defined in the same Python file where my existing distribution already exists: Here is my existing CloudFront distribution being updated with a new origin and cache behavior in Pulumi code. At the time of writing CloudFront only allows to be a list of HTTP methods in specific combinations. The value must be one of these: Since the GoatCounter JavaScript sends a request, and the third option is the only one that includes , we’re forced to use all HTTP verbs. It should be harmless though. Now that my Pulumi code has both the CloudFront function defined and the CloudFront distribution has been updated, I ran to apply changes. Finally, I updated goatcounter.js to use the new endpoint. So instead of I changed it to my own domain at the very top of the snippet: After this, I built my site with Hugo and deployed it on S3/CloudFront by updating the freshly built HTML/CSS/JS in my S3 Bucket and then invalidating the existing CloudFront cache . Now, GoatCounter should no longer be blocked by uBlock Origin. I tested by loading my site on an incognito browser window and checked that uBlock Origin was no longer blocking anything on my domain. Everything looks good! If you’re using GoatCounter you should consider sponsoring the developer . It’s a great project. Create a new CloudFront function resource Add a second origin to the distribution Add an ordered cache behavior to the distribution (which references the CloudFront function using its ARN) Update the GoatCounter script to point to this new endpoint I chose to proxy requests that hit the endpoint on my site to make sure there’s no collision with post titles/slugs. I’ll never use the path for posts. GoatCounter accepts requests under , not https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-functions.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/DownloadDistS3AndCustomOrigins.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/DownloadDistValuesCacheBehavior.html https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html https://www.goatcounter.com/help/js https://www.goatcounter.com/help/backend https://www.goatcounter.com/help/countjs-host

0 views
Stratechery 1 weeks ago

Anthropic’s New TPU Deal, Anthropic’s Computing Crunch, The Anthropic-Google Alliance

Anthropic needs compute, and Google has the most: it's a natural partnership, particularly for Google.

0 views
Martin Alderson 1 weeks ago

What next for the compute crunch?

I thought it'd be a good time to continue on the same theme as my previous two articles The Coming AI Compute Crunch and Is the AI Compute Crunch Here? given that both OpenAI and Anthropic are now publicly agreeing they are (very?) compute starved. I came across this really interesting tweet from the COO of GitHub which really underlines the scale of change that the world is seeing now: This shows that GitHub in the last 3 months (!) has seen a ~14x annualised increase in the number of commits. Commits are a crude proxy for inference demand - but even directionally, if we assume that most of the increase is due to coding agents hitting the mainstream, it points to an outrageously large increase in compute requirements for inference. If anything, this is probably a huge undercount - many people new to "vibe coding" are unlikely to get their heads round Git(Hub) quickly - distributed source control is quite confusing to non engineers (and, at least for me, took longer than I'd like to admit to get totally fluent with it as an engineer). Plus this doesn't include all the Cowork usage which is very unlikely to go anywhere near GitHub. OpenAI's Thibault Sottiaux (head of the Codex team) also tweeted recently that AI companies are going through a phase of demand outstripping supply: It's been rumoured - and indeed in my opinion highly likely given how compute intensive video generation is - that Sora was shut down to free up compute for other tasks. All AI companies are feeling this intensely . Even worse, there is a domino effect with this - when Claude Code starts tightening usage limits or experiencing compute-related outages, people start switching to e.g. Codex or OpenCode, putting increased pressure on them. As I mentioned in my last articles, I believe everyone was looking at these "crazy" compute deals that OpenAI, Anthropic, Microsoft etc were signing like they were going out of fashion back in ~2025 the wrong way. Signing a $100bn "commitment" to buy a load of GPU capacity does not suddenly create said capacity. Concrete needs poured, power needs to be connected, natural gas turbines need to be ordered [1] and GPUs need to be fabricated, racked and networked. All of these products are in short supply, as is the labour required. One of the key points I think worth highlighting that often gets overlooked is how difficult the rollout of GB200 (NVidia's latest chips) has been. Unlike previous generations of GPUs from NVidia the GB200-series is fully liquid cooled - not air cooled as before. Liquid cooling at gigawatt scale just hasn't really been done in datacentres before. From what I've heard it's been unbelievably painful. Liquid cooling significantly increases the power density/m 2 , which makes the electrical engineering required harder - plus a real shortage of skilled labour [2] to plumb it all together - and even shortages of various high end plumbing components has led to most (all?) of the GB200 rollout being vastly behind schedule. While no doubt these issues will get resolved - and the supply chains will gain experience and velocity in delivering liquid cooled parts - this has no doubt put even more pressure on what compute is available in the short to medium term. Even worse, Stargate's 1GW under construction datacentre in the UAE is now a chess piece in the current geopolitical tensions in the recent US/Iran conflict, with the Iranian government putting out a video featuring the construction site. The longer term issue I wrote about in my previous articles on this subject is the hard constraints on DRAM fabrication. While SK Hynix recently signed a $8bn deal for more EUV production equipment from ASML, it's unlikely to come online for another couple of years. Indeed I noticed Sundar Pichai specifically called out memory as a significant constraint on his recent appearance on the Stripe podcast. While recent innovations like TurboQuant are extremely promising in driving memory requirements down via KV cache compression, given the pace at which AI usage is growing it at best buys a small window of breathing room. I believe the next 18-24 months are going to be defined by compute shortages. When you have exponential demand increases and ~linear additions on the supply side, the market is going to be pretty volatile, to say the least. The cracks are already showing. Anthropic's uptime is famously now at "1" nine reliability, and doesn't seem to be getting any better. I don't envy the pressure on SRE teams trying to scale these systems dramatically while deploying new models and efficiency strategies. We've seen Anthropic introduce increasingly more heavy handed measures on the Claude subscription side - starting with "peak time" usage limits being cut significantly, and now moving to ban even usage from 3rd party agent harnesses - no doubt to try and reduce demand. The issue is that if my guesswork at the start of the article is correct and Anthropic is seeing ~10x Q/Q inference demand there is only so much you can do by banning 3rd party use of the product - 1st party use will quickly eat that up. And time based rationing - while extremely useful to smooth out the peaks and troughs - can only go so far. Eventually you incentivise it enough that you max out your compute 24/7. That's not to say there isn't a lot more they can (and will) do here, but when you are facing those kind of demand increases it doesn't end up getting you to a steady state. That really only leaves one lever to pull - price. I was hesitant in my previous articles to suggest major price increases, as gaining marketshare is so important to everyone involved in this trillion dollar race, but if all AI providers are compute starved then I think the game theory involved changes. The paradox of this though is that as models get better and better - and the rumours around the new "Spud" and "Mythos" models from OpenAI and Anthropic point that way - users get less price sensitive. While spending $200/month when ChatGPT first brought out their Pro subscription seemed almost comically expensive for the value you could get out of it, I class my $200/month Anthropic subscription as some of the best value going and would probably pay a lot more for it if I had to, even with current models. We're in completely uncharted territory as far as I can tell. I've been doing a lot of reading about the initial electrification of Europe and North America recently in the late 1800s/early 1900s but the analogy quickly breaks down - the demand growth is so much steeper and the supply issues were far less concentrated. So, we're about to find out what people will actually pay for intelligence on tap. My guess is a lot more than most expect - which is both extremely bullish for the industry and going to be extremely painful for users in the short term. [3] Fundamentally, I believe there is a near infinite demand for machines approaching or surpassing human cognition, even if that capability is spread unevenly across domains. The supply will catch up eventually. But it's the "eventually" that's going to hurt. Increasingly large AI datacentres are skipping grid connections (too slow to come online) and connecting straight to natural gas pipelines and installing their own gas turbines and generation sets ↩︎ I've also read that various manufacturing problems from NVidia has lead to parts leaking, which famously does not combine well with high voltage electrical systems. ↩︎ One flip side of this is how much better the small models have got. I'll be writing a lot more on this, but Gemma 4 26b-a4b running locally is hugely impressive for software engineering. It's not quite good enough, but perhaps we are only a few months off local models on consumer hardware being "good enough". Maybe it's worth buying that Mac or GPU you were thinking about as a hedge? ↩︎ Increasingly large AI datacentres are skipping grid connections (too slow to come online) and connecting straight to natural gas pipelines and installing their own gas turbines and generation sets ↩︎ I've also read that various manufacturing problems from NVidia has lead to parts leaking, which famously does not combine well with high voltage electrical systems. ↩︎ One flip side of this is how much better the small models have got. I'll be writing a lot more on this, but Gemma 4 26b-a4b running locally is hugely impressive for software engineering. It's not quite good enough, but perhaps we are only a few months off local models on consumer hardware being "good enough". Maybe it's worth buying that Mac or GPU you were thinking about as a hedge? ↩︎

0 views
Giles's blog 1 weeks ago

Automating starting Lambda Labs instances

I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series , but they're really busy at the moment, and it's rare to see anything. Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager . It has three commands: Let's see if that helps -- though it's been running for six hours now, with no luck... , which prints which kinds of instances are available. , which prints out all of the possible instance types (available or not) with both their "friendly" names -- what you'd see on the website -- and the instance type names that the API uses. , which polls the API until it sees a specified type of instance, at which point it starts one and sends a Telegram message.

0 views

‘CanisterWorm’ Springs Wiper Attack Targeting Iran

A financially motivated data theft and extortion group is attempting to inject itself into the Iran war, unleashing a worm that spreads through poorly secured cloud services and wipes data on infected systems that use Iran’s time zone or have Farsi set as the default language. Experts say the wiper campaign against Iran materialized this past weekend and came from a relatively new cybercrime group known as TeamPCP . In December 2025, the group began compromising corporate cloud environments using a self-propagating worm that went after exposed Docker APIs, Kubernetes clusters, Redis servers, and the React2Shell vulnerability. TeamPCP then attempted to move laterally through victim networks, siphoning authentication credentials and extorting victims over Telegram. A snippet of the malicious CanisterWorm that seeks out and destroys data on systems that match Iran’s timezone or have Farsi as the default language. Image: Aikido.dev. In a profile of TeamPCP published in January, the security firm Flare  said the group weaponizes exposed control planes rather than exploiting endpoints, predominantly targeting cloud infrastructure over end-user devices, with Azure (61%) and AWS (36%) accounting for 97% of compromised servers. “TeamPCP’s strength does not come from novel exploits or original malware, but from the large-scale automation and integration of well-known attack techniques,” Flare’s Assaf Morag wrote . “The group industrializes existing vulnerabilities, misconfigurations, and recycled tooling into a cloud-native exploitation platform that turns exposed infrastructure into a self-propagating criminal ecosystem.” On March 19, TeamPCP executed a supply chain attack against the vulnerability scanner Trivy from Aqua Security , injecting credential-stealing malware into official releases on GitHub actions. Aqua Security said it has since removed the harmful files, but the security firm Wiz notes the attackers were able to publish malicious versions that snarfed SSH keys, cloud credentials, Kubernetes tokens and cryptocurrency wallets from users. Over the weekend, the same technical infrastructure TeamPCP used in the Trivy attack was leveraged to deploy a new malicious payload which executes a wiper attack if the user’s timezone and locale are determined to correspond to Iran, said Charlie Eriksen , a security researcher at Aikido . In a blog post published on Sunday, Eriksen said if the wiper component detects that the victim is in Iran and has access to a Kubernetes cluster, it will destroy data on every node in that cluster. “If it doesn’t it will just wipe the local machine,” Eriksen told KrebsOnSecurity. Image: Aikido.dev. Aikido refers to TeamPCP’s infrastructure as “ CanisterWorm ” because the group orchestrates their campaigns using an Internet Computer Protocol (ICP) canister — a system of tamperproof, blockchain-based “smart contracts” that combine both code and data. ICP canisters can serve Web content directly to visitors, and their distributed architecture makes them resistant to takedown attempts. These canisters will remain reachable so long as their operators continue to pay virtual currency fees to keep them online. Eriksen said the people behind TeamPCP are bragging about their exploits in a group on Telegram and claim to have used the worm to steal vast amounts of sensitive data from major companies, including a large multinational pharmaceutical firm. “When they compromised Aqua a second time, they took a lot of GitHub accounts and started spamming these with junk messages,” Eriksen said. “It was almost like they were just showing off how much access they had. Clearly, they have an entire stash of these credentials, and what we’ve seen so far is probably a small sample of what they have.” Security experts say the spammed GitHub messages could be a way for TeamPCP to ensure that any code packages tainted with their malware will remain prominent in GitHub searches. In a newsletter published today titled GitHub is Starting to Have a Real Malware Problem , Risky Business reporter Catalin Cimpanu writes that attackers often are seen pushing meaningless commits to their repos or using online services that sell GitHub stars and “likes” to keep malicious packages at the top of the GitHub search page. This weekend’s outbreak is the second major supply chain attack involving Trivy in as many months. At the end of February, Trivy was hit as part of an automated threat called HackerBot-Claw , which mass exploited misconfigured workflows in GitHub Actions to steal authentication tokens. Eriksen said it appears TeamPCP used access gained in the first attack on Aqua Security to perpetrate this weekend’s mischief. But he said there is no reliable way to tell whether TeamPCP’s wiper actually succeeded in trashing any data from victim systems, and that the malicious payload was only active for a short time over the weekend. “They’ve been taking [the malicious code] up and down, rapidly changing it adding new features,” Eriksen said, noting that when the malicious canister wasn’t serving up malware downloads it was pointing visitors to a Rick Roll video on YouTube. “It’s a little all over the place, and there’s a chance this whole Iran thing is just their way of getting attention,” Eriksen said. “I feel like these people are really playing this Chaotic Evil role here.” Cimpanu observed that supply chain attacks have increased in frequency of late as threat actors begin to grasp just how efficient they can be, and his post documents an alarming number of these incidents since 2024. “While security firms appear to be doing a good job spotting this, we’re also gonna need GitHub’s security team to step up,” Cimpanu wrote. “Unfortunately, on a platform designed to copy (fork) a project and create new versions of it (clones), spotting malicious additions to clones of legitimate repos might be quite the engineering problem to fix.” Update, 2:40 p.m. ET: Wiz is reporting that TeamPCP also pushed credential stealing malware to the KICS vulnerability scanner from Checkmarx , and that the scanner’s GitHub Action was compromised between 12:58 and 16:50 UTC today (March 23rd).

0 views
alikhil 3 weeks ago

What is a CDN and Why It Matters?

With the rapid growth of GenAI solutions and the continuous launch of new applications, understanding the fundamental challenges and solutions of the web is becoming increasingly important. One of the core challenges is delivering content quickly to the end user . This is where a CDN comes into play. A CDN stands for Content Delivery Network . Let’s break it down. (Note: Modern CDN providers often bundle additional services such as WAF, DDoS protection, and bot management. Here, we focus on the static content delivery.) Content refers to any asset that needs to be loaded on the user’s device: images, audio/video files, JavaScript, CSS, and more. Delivery means that this content is not only available but also delivered efficiently and quickly. A CDN is a network of distributed nodes that cache content. Instead of fetching files directly from the origin server, users receive them from the nearest node, minimizing latency. Consider an online marketplace for digital assets, such as a photo stock or NFT platform. The application stores thousands of images on a central server. Whenever users open the app, those images must load quickly. If the application server is hosted in Paris, users in Paris will experience minimal ping. However: These numbers only reflect simple ICMP ping times. Actual file delivery involves additional overhead such as TCP connections and TLS handshakes, which increase delays even further. With a CDN, each user connects to the nearest edge node instead of the origin server. This is typically achieved via GeoDNS. Importantly, only the CDN knows the actual address of the origin server, which also improves security by reducing exposure to direct DDoS attacks. CDN providers usually operate edge nodes in major world cities. When a request is made: If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. For popular websites, the cache hit rate approaches but rarely reaches 100% due to purges, new files, or new users. The shield node plays a critical role. Without it, each cache miss from any edge node would hit the origin server directly, increasing load. Many providers offer shields as an optional feature, and enabling them can significantly reduce origin stress. Beyond cache hits and misses, performance can be measured with concrete indicators: Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches. These KPIs provide a real, measurable view of CDN efficiency rather than theoretical assumptions. The closer the edge node is to the end user, the faster the content loads. The key questions are: Where are the users located? Which CDN providers have the best edge coverage for those locations? But don’t rely on maps alone. Measure real performance with Real User Monitoring (RUM) using metrics like TTFB and Core Web Vitals. There are plenty of ready-made tools available. If you’re interested in building your own RUM system, leave a comment or reaction – I can cover that in a follow-up post. Users in Spain may see about 2× ping time. Users in the USA may see 6× ping time. Users in Australia may see 12× ping time. If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches.

0 views
Stratechery 1 months ago

Oracle Earnings, Oracle’s Cloud Growth, Oracle’s Software Defense

Oracle crushed earnings in a way that not only speaks to the secular AI wave they are riding but also to Oracle's strong position

0 views
Dangling Pointers 1 months ago

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds

Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds I learned a lot about the LLC configuration and monitoring capabilities of modern CPUs from this paper, I bet you will too. The problem this paper addresses is: how to avoid performance variability in cloud applications due to cross-VM contention for the last level cache (e.g., the L3 cache on a Xeon)? In a typical CPU, the L1 and L2 caches are private to a core, but the L3 is shared. In a cloud environment, the L3 is shared by multiple tenants, and is an avenue for a “noisy neighbor” to annoy its neighbors. The work described by this paper builds upon Intel CMT and CAT. Cache Monitoring Technology allows the hypervisor to track how much of the L3 cache is occupied by each VM. Cache Allocation Technology allows the hypervisor to restrict a VM to only use a subset of the L3. CAT allows a VM to be assigned to a cache level of service (CLOS), which defines the set of L3 ways accessible to the VM ( this page defines the term “ways” if you are unfamiliar). A typical CPU used by a cloud service provider has more CPU cores than L3 ways. If a cloud server hosts many small VMs, then L3 ways must be shared amongst VMs. The key problem solved by this paper is how to reduce performance variability given this constraint. Fig. 1 illustrates the assignments of CLOS levels to LLC ways advocated by this paper. Each row is a level of service, and each column is a way of the LLC cache. CLOS[0] can access all ways, CLOS[1] can access all LLC ways except for one. CLOS[7] can only access a single way of the LLC. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 The hypervisor uses Intel CMT to monitor how much of the LLC is occupied by each VM. Every 6 seconds the hypervisor uses this information to change the CLOS that each VM is assigned to. The hypervisor computes a target LLC occupancy for each VM based on the number of cores assigned to the VM. This target is compared against the measured LLC occupancy to classify each VM into one of three categories: Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VMs in the poor category are de-suppressed (i.e., assigned to a CLOS with access to more LLC ways). Additionally, VMs in the excess category are suppressed (i.e., assigned to a CLOS with access to fewer ways), but this suppression only occurs when there are VMs in the poor category. This policy means that cache-hungry VMs can use more than their fair share of the L3 during periods of low server utilization. This can lead to higher mean performance, at the cost of a wider standard deviation. The paper describes a 4th state (overflow), which is only applied to VMs that wish to be held back even if there is plenty of L3 space available. These VMs are suppressed when they are found to be using too much L3, even if all other VMs on the system are getting enough cache space. Fig. 5 shows a case where this strategy works well compared to static allocation. The server in question is running 5 VMs, each running a different application: VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores The top of figure 5 shows a simple static partitioning of LLC ways. VM1 is assigned to 6 ways, VM2 is assigned to 3 ways, VM3 is assigned to 2 ways, and VMs 4 and 5 must share 1 way. They have to share because sharing based on the number of ways in the LLC is inherently coarse-grained. The two charts show measured LLC utilization over 10 minutes. Notice the Y-axis. The technique described in this paper (Cacheman) allows VM4 and VM5 to use far more aggregate LLC capacity than the static partitioning. Also notice that in the static partitioning, VM5 always uses more LLC than VM4 (because they are running different applications), whereas Cacheman allows for a more even balance between them. Source: https://dl.acm.org/doi/10.1145/3774934.3786415 Dangling Pointers While the L3 cache is logically a monolithic shared resource, it is physically partitioned across the chip (with a separate slice near each core). It seems like it could be more efficient if VMs could be assigned to nearby L3 slices rather than L3 ways. Subscribe now Poor (the VM is starved for space) Adequate (the VM is using just the right amount of cache) Excess (the VM is hogging too much) VM1 - 32 cores VM2 - 16 cores (but doesn’t fully utilize those cores) VM3 - 8 cores VM4 - 4 cores VM5 - 4 cores

0 views
Stratechery 1 months ago

Copilot Cowork, Anthropic’s Integration, Microsoft’s New Bundle

Microsoft is seeking to commoditize its complements, but Anthropic has a point of integration of their own; it's good enough that Microsoft is making a new bundle on top of it.

0 views
Stratechery 1 months ago

MacBook Neo, The (Not-So) Thin MacBook, Apple and Memory

The MacBook Neo was built to be cheap; that it is still good is not only a testament to Apple Silicon, but also the fact that the most important software runs in the cloud.

0 views
David Bushell 1 months ago

Bunny.net shared storage zones

Whilst moving projects off Cloudflare and migrating to Bunny I discovered a neat ‘Bunny hack’ to make life easier. I like to explicitly say “no” to AI bots using AI robots.txt † . Updating this file across multiple websites is tedious. With Bunny it’s possible to use a single file. † I’m no fool, I know the AI industry has a consent problem but the principle matters. My solution was to create a new storage zone as a single source of truth. In the screenshot above I’ve uploaded my common file to its own storage zone. This zone doesn’t need any “pull zone” (CDN) connected. The file doesn’t need to be publicly accessible by itself here. With that ready I next visited each pull zone that will share the file. Under “CDN > Edge rules” in the menu I added the following rule. I chose the action: “Override Origin: Storage Zone” and selected the new shared zone. Under conditions I added a “Request URL” match for . Using a wildcard makes it easier to copy & paste. I tried dynamic variables but they don’t work for conditions. I added an identical edge rule for all websites I want to use the . Finally, I made sure the CDN cache was purged for those URLs. This technique is useful for other shared assets like a favicon, for example. Neat, right? One downside to this approach is vendor lock-in. If or when Bunny hops the shark and I migrate elsewhere I must find a new solution. My use case for is not critical to my websites functioning so it’s fine if I forget. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views
./techtipsy 1 months ago

The cloud just stopped scaling

It has happened. The cloud just stopped scaling. Hetzner’s cloud, for now. At this rate, my home server will actually have to become production at work, and my gaming PC has to be converted to a server because it has a whopping 32 GB of RAM and 6 good CPU cores. With Forza Horizon 6 on the horizon , it is time for some difficult decisions… Shortly after publishing this post, AWS had a different type of issue with availability zones going down due to… Iranian missile strikes. I’ve wanted decentralized hosting to be more popular, but not like this.

0 views
Stratechery 1 months ago

Thin Is In

Listen to this post : There was, in the early days of computing, no debate about thick clients versus thin: When a computer was the size of a room, there were no clients: you scheduled time or submitted jobs, and got back the results when it was your turn. A few years later, however, thin clients in the form of a monitor and keyboard arrived: There is no computer in this image; rather, this is a terminal connected to a mainframe. That’s why it’s called a “thin” client: it’s just an interface, with all of the computing happening elsewhere (i.e. in another room). By the 1980s, however, “thick” clients were the dominant form of computing, in the form of the PC. All of your I/O and compute were packaged together: you typed on a keyboard connected to a PC, which output to the monitor in front of you. A decade later, and Sun Microsystems in particular tried to push the idea of a “network computer”: This was a device that didn’t really have a local operating system; you ran Java applications and Java applets from a browser that were downloaded as they were used from a central server. Sun’s pitch was that network computers would be much cheaper and easier to administer, but PCs were dropping in price so quickly that the value proposition rapidly disappeared, and Windows so dominant that it was already the only platform that network administrators wanted to deal with. Thick clients won, and won decisively. If you wanted to make a case for thin clients, you could argue that mobile devices are a hybrid; after all, the rise of mobile benefited from and drove the rise of the cloud: nearly every app on a phone connects to a server somewhere. Ultimately, however, mobile devices are themselves thick clients: they are very capable computers in their own right, that certainly benefit from being connected to a server, but are useful without it. Critically, the server component is just data: the actual interface is entirely local. You can make the same argument about SaaS apps: on one hand, yes, they operate in the cloud and are usually accessed via a browser; on the other hand, the modern browser is basically an operating system in its own right, and the innovations that made SaaS apps possible were the fact that interactive web apps could be downloaded and run locally. Granted, this isn’t far off from Sun’s vision (although the language ended up being JavaScript, not Java), but you still need a lot of local compute to make these apps work. The thick-versus-thin debate felt, for many years, like a relic; that’s how decisive was the thick client victory. One of the things that is fascinating about AI, however, is that the thin client concept is not just back, it’s dominant. The clearest example of this is the interface that most people use to interact with AI: chat. There is no UI that matters other than a text field and a submit button; when you click that button the text is sent to a data center, where all of the computation happens, and an answer is sent back to you. The quality of the answer or of the experience as a whole is largely independent of the device you are using: it could be a browser on a PC, an app on a high-end smartphone, or the cheapest Android device you can find. The device could be a car, or glasses, or just an earpiece. The local compute that matters is not processing power, but rather connectivity. This interaction paradigm actually looks a lot like the interaction paradigm for mainframe computers: type text into a terminal, send it to the computer, and get a response back. Unlike mainframe terminals, however, the user doesn’t need to know a deterministic set of commands; you just say what you want in plain language and the computer understands. There is no pressure for local compute capability to drive a user interface that makes the computer easier to use, because a more complex user interface would artificially constrain the AI’s capabilities. Nicolas Bustamante, in an X Article about the prospects for vertical software in an AI world , explained why this is threatening: When the interface is a natural language conversation, years of muscle memory become worthless. The switching cost that justified $25K per seat per year dissolves. For many vertical software companies, the interface was most of the value. The underlying data was licensed, public, or semi-commoditized. What justified premium pricing was the workflow built on top of that data. That’s over. Bustamante’s post is about much more than chat interfaces, but I think the user interface point is profound: it’s less that AI user interfaces are different, and more that, for many use cases, they basically don’t exist. This is even clearer when you consider the next big wave of AI: agents. The point of an agent is not to use the computer for you; it’s to accomplish a specific task. Everything between the request and the result, at least in theory, should be invisible to the user. This is the concept of a thin client taken to the absolute extreme: it’s not just that you don’t need any local compute to get an answer from a chatbot; you don’t need any local compute to accomplish real work. The AI on the server does it all. Of course most agentic workflows that work tread a golden path, but stumble with more complex situations or edge cases. That, though, is changing rapidly, as models become better and the capabilities of the chips running them increase, particularly in terms of memory. When it comes to inference, memory isn’t just important for holding the model weights, but also retaining context about the task at hand. To date most of the memory that matters has been high-bandwidth memory attached to the GPUs, but future architectures will offload context to flash storage . At the same time, managing agents is best suited to CPUs , which themselves need large amounts of DRAM. In short, both the amount of compute we have, and the capability of that compute, still isn’t good enough; once it crosses that threshold, though, demand will only get that much stronger. This combination of factors will only accentuate the dominance of the thin client paradigm: Yes, you can run large language models locally, but you are limited in the size of the model, the size of the context window, and speed. Meanwhile, the superior models with superior context windows and faster speeds don’t require a trip to the computer lab; just connect to the Internet from anywhere. Note that this reality applies even to incredible new local tools like OpenClaw: OpenClaw is an orchestration layer that runs locally, but the actual AI inference is, by default and in practice for most users, done by models in the cloud. To put it another way, to be competitive, local inference would need some combination of smaller-yet-sufficiently-capable models, a breakthrough in context management, and critically, lots and lots of memory. It’s that last one that might be the biggest problem of all. From Bloomberg : A growing procession of tech industry leaders including Elon Musk and Tim Cook are warning about a global crisis in the making: A shortage of memory chips is beginning to hammer profits, derail corporate plans and inflate price tags on everything from laptops and smartphones to automobiles and data centers — and the crunch is only going to get worse… Sony Group Corp. is now considering pushing back the debut of its next PlayStation console to 2028 or even 2029, according to people familiar with the company’s thinking. That would be a major upset to a carefully orchestrated strategy to sustain user engagement between hardware generations. Close rival Nintendo Co., which contributed to the surplus demand in 2025 after its new Switch 2 console drove storage card purchases, is also contemplating raising the price of that device in 2026, people familiar with its plans said. Sony and Nintendo representatives didn’t respond to requests for comment. A manager at a laptop maker said Samsung Electronics has recently begun reviewing its memory supply contracts every quarter or so, versus generally on an annual basis. Chinese smartphone makers including Xiaomi Corp., Oppo and Shenzhen Transsion Holdings Co. are trimming shipment targets for 2026, with Oppo cutting its forecast by as much as 20%, Chinese media outlet Jiemian reported. The companies did not respond to requests for comment. The memory shortage has been looming for a while, and is arguably the place where consumers will truly feel the impact of AI; I wrote in January in the context of Nvidia’s keynote at CES: CES stands for “Consumer Electronics Show”, and while Nvidia’s gaming GPUs received some updates, they weren’t a part of [Nvidia CEO Jensen] Huang’s keynote, which was focused on that Vera Rubin AI system and self-driving cars. In other words, there wasn’t really anything for the consumer, despite the location, because AI took center stage. This is fine as far as Nvidia goes: both the Vera Rubin announcement and its new Alpamayo self-driving system are big deals. It is, however, symbolic of the impact AI is having on technology broadly, and that impact is set to impact consumer electronics in a major way. Specifically, not only is all of the energy and investment in the tech sector going towards AI, but so is the supply chain. A big story over the last few months has been the dramatically escalating cost of memory as the major memory manufacturers shift their focus to high-bandwidth memory for AI chips in particular. What that means is that everything else is going to get a lot more expensive: memory is one of the most expensive components in nearly everything tech-related, and given the competitive and commoditized nature of the industry those costs will almost certainly be passed on to the end users. This AI crowd-out dynamic arguably started with the hyperscalers, who diverted ever increasing parts of their budget to GPUs in place of CPU purchases, but now it’s coming for everything from grid power to turbines and now to components, and it’s only going to increase and become more impactful to end users. In other words, Nvidia may not have talked about consumer electronics at the Consumer Electronics Show, but they are having the biggest impact on the industry by far. The downsides of this crowd-out effect are obvious; I pity anyone trying to build their own PC, for example, but soon their pain will be everyone’s pain as prices inevitably rise on everything that needs RAM. At the same time, I think the reported PlayStation delay is telling: apparently the PS5 is “good enough” for Sony to wait for more memory capacity to come online, and they’re probably right! Thick clients — of which consoles like the PS5 are the ultimate example — have long since reached the point of diminishing returns when it comes to hardware improvements. I think you could make the same case for PCs and phones as well: what we already have is already more than sufficient for almost any task we want to do. Moreover, the plateau in thick client capability is happening at the same time that the need for any capability at all is disappearing, thanks to these entirely new AI workflows that happen in the cloud. Yes, it sucks that AI is making memory scarce and personal computers of all kinds — from PCs to phones to consoles — more expensive; it’s also making them less important than ever. Of course thick clients could make a comeback, particularly since local inference is “free” (i.e. the user pays for their own electricity). As I noted above, however, I’m skeptical about local inference in the near term for performance reasons, and the memory crunch is going to make it uneconomical for the foreseeable future. And, by the time local inference is a viable alternative, path dependency downstream of these few years may have already led to many workflows moving to this new paradigm. It will, to be clear, be a transition: UI isn’t just about how to use a computer, it also, as Benedict Evans noted on a recent Interview , embeds critical aspects of how a business works. Open-ended text prompts in particular are a terrible replacement for a well-considered UI button that both prompts the right action and ensures the right thing happens. That’s why it’s the agent space that will be the one to watch: what workflows will transition from UI to AI, and thus from a thick client architecture to a thin one? Current workflows are TBD; future workflows seem inevitable. First, if compute isn’t yet good enough, then workloads will flow to wherever compute is the best, which is going to be in large data centers. Second, if larger models and more context makes for better results, then workloads will flow to wherever there is the most memory available. Third, the expense of furnishing this level of compute means that it will be far more economical to share the cost of that compute amongst millions of users; guaranteeing high utilization and maximizing leverage on your up-front costs.

0 views
neilzone 2 months ago

Moving away from Nextcloud

I have used Nextcloud for a long time. In fact, I have used Nextcloud from before it was Nextcloud - before the fork of Owncloud. And while I have not used many of its features - sync, calendar, and contacts - I’ve been a very happy user for a long time. Until a year or so ago, at least. I’ve had a worry, at the back of my mind, for a while, that Nextcloud is trying to do too much. A collaborative document editor. An email client. A voice/video conferencing tool, and so on. I’m sure that, in some contexts, this is amazing, and convenient. For me, as someone who typically prefers a piece of software to do one thing well, it left me a bit uneasy. But that was not, in itself, enough of a reason for me to switch. A year or so ago, I had problem after problem keeping files in sync. I routinely got error messages about the database (or files; I don’t quite remember) being locked. And, for me, that was the mainstay of Nextcloud, and indeed the reason why I started to use it in the first place. I tried all sorts of things, including setting up redis, and trying other memcache options, even though I am the only regular user. I could not get it to sync reliably. And I really did try, using the voluminous logs to try to determine what was going wrong. But I failed. And so I started considering other options. Did I actually need Nextcloud at all? I’ve moved to Syncthing for syncing, and so far, that has been working fine. It is fast, and appears to be reliable. I should probably write about it at some point. Using Nextcloud to sync photos from my phone was not too bad, but from Sandra’s iPhone, it did not work well. I have switched to Immich for photo sync / gallery, and I’ve been very happy with it. For contacts and calendar sync - DAV - I am using Radicale . The main annoyance is that Sandra cannot invite me (or anyone) to appointments using the iOS or macOS calendar. For me, I’ve just given Sandra write access to my calendar, so that she can add events directly, but it is far from ideal. I’ve tried using Radicale’s server-side email functionality, and that is not suitable for my needs, as it sends out far too many email. But, for now, Radicale is tolerable, even if I might try to find another option at some point. And that just leaves the directories which I share via Nextcloud and mount in my file browser. Stuff that I don’t need on my computer, but still want to access. For that, I’m going back to samba. It works. And so, once I’ve finalised this and tested it and given it some time to bed in, I will turn off the Nextcloud server.

0 views
Stratechery 2 months ago

Amazon Earnings, CapEx Concerns, Commodity AI

Amazon's massive CapEx increase makes me much more nervous than Google's, but it is understandable.

0 views