Latest Posts (20 found)

re: The commodification of travel

I relate a lot to Herman's recent post , in more ways then one. His thoughts on travel are very similar to mine, and the fact that he's writing from Kyoto is even better. You see, I spent a semester studying in Japan 14 years ago (good grief has it been that long?). I went to a business school in Tokyo (武蔵大学), despite being a computer science major. I honestly had never been interested in traveling until then. The only reason I found myself in Japan was having been lured into the International Studies office by a sign saying "Free Pizza (Learn About Exchange Programs)", and hey, my college town did have good pizza! While in Japan, I visited Kyoto during the freezing cold of November. I went with another American in my program, we stayed in a hostel, hit the 銭湯 (public bathhouse) every night and hopped around on bus to each destination. We had little plan outside of a small list of things we wanted to see. Most of what we ended up doing was the result of recommendations from our hostel host. And we had a blast . Kyoto was near empty at that time. For example, when we visited 伏見稲荷大社 (the fox shrine that consists of thousands of Torii Gates leading up a mountain), we hiked for hours and saw only a handful of people. To escape from the crisp air, we stepped into a tea house and had a wonderful conversation with the owner (who was just happy to have company). It's one of the most memorable times of my life. I absolutely love Japan. I've been twice, a second time in 2017. But now, like Herman, I don't know if I would return. At least not to Kyoto. To explain why, let's switch countries. The first time in Taiwan, my wife and I went to a famous museum. We went by city bus and a whole lotta walking. Most people arrived by chartered tour busses. As we browsed the isles at a leisurely pace, we were constantly interrupted. First would be a person wearing a speaker, holding a little flag and talking in a microphone. Then, the herd of tourists would follow, each cramming their bodies infront of whatever object so as to get a selfie. They'd stay just long enough at each object to grab a picture, then shuttle off to the next. Not a person read the plaques or admired the details. It ruined the experience, I remember leaving that museum very frustrated. Unfortunately, I've heard that's the state of Kyoto these days. For example, the Geisha district my friend and I randomly found ourselves walking through while looking for dinner was closed due to unruly tourists . People travel for the photo, the checkbox, the badge of honor. They don't dare do or eat as the locals, they instead are pin-balled between selfie spots, only to end the day at a buffet of their home cuisine. When they get home, they post their photos, rake in the likes and forget the whole experience. And look, I've been part of these groups before. My wife is Chinese, and this is the normal way Chinese travel. A big reason for this is visas, a lot of countries (Japan included) don't allow Chinese tourists that are not part of a tour group. The first time I saw the Great Wall was as part of a tour group, the same with the China/Mongolia border. They shuttled us from selfie spot to selfie spot. We had very little time to explore. Each meal was a buffet. Most of my memories are of the inside of the charter bus, not the attractions themselves. I hated it. People are surprised when I mention I'm a repeat visitor to a small set of countries. I've visited Taiwan twice, Japan twice, China twice and Mexico 5 times. I'll get comments of "you really should do X country instead of places you've already done". I fall in love with cultures. And food (same thing). The last time I was in Taiwan, I was there for a month living in an Airbnb working out of a cowork spot. I met so many amazing people. There was the guy who designed the paint for subway cars that took me to a local noodle shop. The American that asked to join me for a day of temple adventures. And the guy who's wife insisted he invite me to sit with them at the theater and share the beer they snuck in. The second time in Japan I stayed in the same suburb as my dorm from my student days. In Mexico, I worked remotely from cafes and explored the incredible food malls of Mérida. I'm not into tourist spots, I'm into finding out what stores are in the alley, discovering what small town is at the end of the Tobu Tojo line, going behind the scenes on a TV show set at the top of a skyscraper, or discovering where the random boat I found that accepts my transit card takes me (it was an island with a bar on the beach, awesome night). That's what travel should be. Stepping into the unknown, excited, and a little afraid. Discovering the local culture, food and people. Making connections and memories that last a lifetime, so when things get tough you can sit in the shower, remembering the water in the public bath house washing over you. 14 years ago I sat on a futon, resting my worn luggage against the wall. After 20+ hours of travel I had made it to my new home in Saitama, Japan (さいたま市). As I sat there, I had a panic attack. Where the fuck was I? What was I doing in a country I knew nothing about. Hell, I didn't even speak a single word of the language, how could I possibly survive on my own for the next semester? An hour or so later I calmed myself down and went on a walk to a department store with a few of the other exchange students. I marveled at how similar, but different, everything was. I also gawked at the insane price of fruit. Three hours later I unexpectedly found myself in a public bath. Talk about culture shock! The next day was the start of the best 4 months of my life. That's what travel is.

0 views

Is Claude Code going to cost $100/month? Probably not - it's all very confusing

Anthropic today quietly (as in silently , no announcement anywhere at all) updated their claude.com/pricing page (but not their Choosing a Claude plan page , which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, and it's already reverted ): The Internet Archive copy from yesterday shows a checkbox there. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans. Update : don't miss the update to this post , they've already changed course a few hours after this change went live. So what the heck is going on? Unsurprisingly, Reddit and Hacker News and Twitter all caught fire. I didn't believe the screenshots myself when I first saw them - aside from the pricing grid I could find no announcement from Anthropic anywhere. Then Amol Avasare, Anthropic's Head of Growth, tweeted : For clarity, we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected. And that appears to be the closest we have had to official messaging from Anthropic. I don't buy the "~2% of new prosumer signups" thing, since everyone I've talked to is seeing the new pricing grid and the Internet Archive has already snapped a copy . Maybe he means that they'll only be running this version of the pricing grid for a limited time which somehow adds up to "2%" of signups? I'm also amused to see Claude Cowork remain available on the $20/month plan, because Claude Cowork is effectively a rebranded version of Claude Code wearing a less threatening hat! There are a whole bunch of things that are bad about this. If we assume this is indeed a test, and that test comes up negative and they decide not to go ahead with it, the damage has still been extensive: Last month I ran a tutorial for journalists on "Coding agents for data analysis" at the annual NICAR data journalism conference. I'm not going to be teaching that audience a course that depends on a $100/month subscription! This also doesn't make sense to me as a strategy for Anthropic. Claude Code defined the category of coding agents. It's responsible for billions of dollars in annual revenue for Anthropic already. It has a stellar reputation, but I'm not convinced that reputation is strong enough for it to lose the $20/month trial and jump people directly to a $100/month subscription. OpenAI have been investing heavily in catching up to Claude Code with their Codex products. Anthropic just handed them this marketing opportunity on a plate - here's Codex engineering lead Thibault Sottiaux : I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them. Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world. I should note that I pay $200/month for Claude Max and I consider it well worth the money. I've had periods of free access in the past courtesy of Anthropic but I'm currently paying full price, and happy to do so. But I care about the accessibility of the tools that I work with and teach. If Codex has a free tier while Claude Code starts at $100/month I should obviously switch to Codex, because that way I can use the same tool as the people I want to teach how to use coding agents. Here's what I think happened. I think Anthropic are trying to optimize revenue growth - obviously - and someone pitched making Claude Code only available for Max and higher. That's clearly a bad idea, but "testing" culture says that it's worth putting even bad ideas out to test just in case they surprise you. So they started a test, without taking into account the wailing and gnashing of teeth that would result when their test was noticed - or accounting for the longer-term brand damage that would be caused. Or maybe they did account for that, and decided it was worth the risk. I don't think that calculation was worthwhile. They're going to have to make a very firm commitment along the lines of "we heard your feedback and we commit to keeping Claude Code available on our $20/month plan going forward" to regain my trust. As it stands, Codex is looking like a much safer bet for me to invest my time in learning and building educational materials around. In the time I was typing this blog entry Anthropic appear to have reversed course - the claude.com/pricing page now has a checkbox back in the Pro column for Claude Code. I can't find any official communication about it though. Let's see if they can come up with an explanation/apology that's convincing enough to offset the trust bonfire from this afternoon! Amol on Twitter : was a mistake that the logged-out landing page and docs were updated for this test [ embedded self-tweet ] Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected. This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes. So the experiment is still running, just not visible to the rest of the world? You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . A whole lot of people got scared or angry or both that a service they relied on was about to be rug-pulled. There really is a significant difference between $20/month and $100/month for most people, especially outside of higher salary countries. The uncertainty is really bad! A tweet from an employee is not the way to make an announcement like this. I wasted a solid hour of my afternoon trying to figure out what had happened here. My trust in Anthropic's transparency around pricing - a crucial factor in how I understand their products - has been shaken. Strategically, should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product? More of a personal issue, but one I care deeply about myself: I invest a great deal of effort (that's 105 posts and counting) in teaching people how to use Claude Code. I don't want to invest that effort in a product that most people cannot afford to use.

0 views

The commodification of travel

I've noticed that travel has become, of late, an act of collecting places. I've literally heard people referring to visiting a place as doing that place, as in "Have you done Japan?", assuming that one can do an entire country, and once that country is done it remains as such. As if a place is a product to be consumed and checked off the list. Why bother returning to a place if you've already done it? I received a gift many years ago which, while being well-intentioned, typifies this idea: a scratch-off map of the world. Each time you visit a country, you can scratch off the metallic coating and the country is now done , according to the map. The work trip I took to São Paulo a decade ago? Brazil: done. Bus tour through Europe? Germany? Check. France? Check. Spain? You get the idea. This kind of mentality is typified in the question I've heard asked many times: "How many countries have you been to?" This is often followed by a debate on whether layovers count towards your tally if you don't leave the airport, as if stepping beyond the airport boundaries bestows doneness . Like many things, I blame social media. It's changed travel from an exploration to social status signalling. I started thinking about this a few years ago while visiting some waterfalls in Indonesia. I love a good frolic in a waterfall, but all of them were just lines of people waiting to take their photo under the falls, and then they'd better get out of the way for the next photo-goers. No frolicking allowed! People need to do these waterfalls! I spent this morning in a beautiful garden outside of Kyoto, which exemplifies the cultural ideals of appreciating nature and meditating on the beauty that surrounds us. It was lovely during the early morning, but then the rest of the world showed up and all they wanted to do was take photos and move on to the next spot to do . There was one moment, in perhaps one of the most heart-wrenchingly beautiful places I've ever visited, where I was surrounded by about 20 other people, all of them either in the process of taking a photo, or looking at what they had just taken on their phones. No one was looking at the amazing stuff they were doing ! That isn't to say taking photos is bad. They're a great way to share an experience with others and save a memory of a time and place—but I think the threshold of what is enough has been crossed in the age of Instagram where images and video are socially valuable. Now beautiful places are commodified. And I don't know if we'll ever go back. I appreciate that many places in Japan limit photos and videos, such as on trains or in gyms, for the sake of not annoying those around you. Perhaps once sunglasses cameras take off and people can record their entire lives they can finally experience where they are, instead of trying to capture it perfectly for later. All that being said, I don't want to gate-keep. If this is the form of travel that makes people happy, then they should do it to their heart's content. Similarly to how some people collect Magic cards while never playing the game, sometimes the fun is in the collection itself. But perhaps look up from your phone once in a while. The world is prettier in full resolution.

0 views

News: Anthropic Removes Claude Code From $20-A-Month "Pro" Subscription Plan For New Users (Developing)

In developing news, Anthropic appears to have removed access to AI coding tool Claude Code from its $20-a-month "Pro" accounts. This is likely another cost-cutting move that follows a recent change ( per The Information ) that forced enterprise users to pay on a per-million-token based rate rather than having rate limits that were, based on researchers' findings, often much higher than the cost of the subscription. Previously, users were able to access Claude using their Pro subscriptions via a command-line interface and both the web and desktop Claude apps. Users were, instead of paying on a per-million-token basis, allowed to use their subscription to access Claude Code, but will likely now have to pay for API access. Anthropic's Claude Code support documents ( as recently as this April 10th archived page ) previously read "Using Claude Code with your Pro or Max plan." The page now reads "Using Claude Code with your Max plan." Pricing on Anthropic's website reflects the removal of Claude Code on both mobile and desktop. Some Pro users report that they are still able to access Claude Code via the web app and Command-Line Interface. It is unclear at this time whether this change is retroactive or for new Pro subscribers, or whether Anthropic intends to entirely remove access to Claude Code (without paying for API tokens) from every Pro customer. I have requested a comment from Anthropic, and will update this piece when I receive it, or if Anthropic confirms this move otherwise. If you liked this news hit and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I recently put out the timely and important Hater’s Guide To The SaaSpocalypse , another on How AI Isn't Too Big To Fail , a deep (17,500 word) Hater’s Guide To OpenAI , and just last week put out the massive Hater’s Guide To Private Credit . Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Anthropic appears to have removed access to Claude Code for its $20-a-month "Pro" Plans. Current Pro users appear to still have access via the Claude web app. Claude Code support documents exclusively refer to accessing Claude Code via "your Max Plan," after previously saying you could access "with your Pro or Max Plan."

0 views

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

OpenAI released ChatGPT Images 2.0 today , their latest image generation model. On the livestream Sam Altman said that the leap from gpt-image-1 to gpt-image-2 was equivalent to jumping from GPT-3 to GPT-5. Here's how I put it to the test. First as a baseline here's what I got from the older gpt-image-1 using ChatGPT directly: I wasn't able to spot the raccoon - I quickly realized that testing image generation models on Where's Waldo style images (Where's Wally in the UK) can be pretty frustrating! I tried getting Claude Opus 4.7 with its new higher resolution inputs to solve it but it was convinced there was a raccoon it couldn't find thanks to the instruction card at the top left of the image: Yes — there's at least one raccoon in the picture, but it's very well hidden . In my careful sweep through zoomed-in sections, honestly, I couldn't definitively spot a raccoon holding a ham radio. [...] Next I tried Google's Nano Banana 2, via Gemini : That one was pretty obvious, the raccoon is in the "Amateur Radio Club" booth in the center of the image! Claude said: Honestly, this one wasn't really hiding — he's the star of the booth. Feels like the illustrator took pity on us after that last impossible scene. The little "W6HAM" callsign pun on the booth sign is a nice touch too. I also tried Nano Banana Pro in AI Studio and got this, by far the worst result from any model. Not sure what went wrong here! With the baseline established, let's try out the new model. I used an updated version of my openai_image.py script, which is a thin wrapper around the OpenAI Python client library. Their client library hasn't yet been updated to include but thankfully it doesn't validate the model ID so you can use it anyway. Here's how I ran that: Here's what I got back. I don't think there's a raccoon in there - I couldn't spot one, and neither could Claude. The OpenAI image generation cookbook has been updated with notes on , including the setting and available sizes. I tried setting to and the dimensions to - I believe that's the maximum - and got this - a 17MB PNG which I converted to a 5MB WEBP: That's pretty great! There's a raccoon with a ham radio in there (bottom left, quite easy to spot). The image used 13,342 output tokens, which are charged at $30/million so a total cost of around 40 cents . I think this new ChatGPT image generation model takes the crown from Gemini, at least for the moment. Where's Waldo style images are an infuriating and somewhat foolish way to test these models, but they do help illustrate how good they are getting at complex illustrations combining both text and details. rizaco on Hacker News asked ChatGPT to draw a red circle around the raccoon in one of the images in which I had failed to find one. Here's an animated mix of their result and the original image: Looks like we definitely can't trust these models to usefully solve their own puzzles! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views

Writing an LLM from scratch, part 32m -- Interventions: conclusion

Last November, when I finished the main body of " Build a Large Language Model (from Scratch) ", I set myself a number of follow-on goals . One was "training the full GPT-2 base model myself". I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went. In December, I trained my first model , taking two days, but was disappointed to see that it was worse in terms of loss, and in terms of how well it could be fine-tuned to follow instructions, than the original GPT-2 model. I expected that a chunk of that difference was likely to be due to the original model having been trained for longer, but also noticed that there were a number of changes -- interventions -- that I could make to the model and the training run, and I thought they might help. In January, I got a DDP training system together that would allow me to iterate on those interventions without having to wait for two days for each result. In February, I got started by training a baseline model in the cloud , and I've since ground through all of the interventions, and come up with a set that lowered the loss nicely, both in the cloud , and locally . Along the way, I've learned about, or refined my knowledge of, a bunch of ML concepts. In increasing order of how they helped with the loss (with the first two actually making it slightly worse): I've also learned how to upload my custom models to Hugging Face , found out some interesting things about how random noise affects training , and come up with improvements in the setup I have for using an LLM as a judge for instruction fine-tuned models . There was a bit of a mystery when I tried out the instruction fine-tuning tests, though. Although two of my models were very close to GPT-2 small in terms of loss, I found that while one of them had an instruction fine-tuning result that was likewise close to GPT-2 small, the other was much worse! A mystery to dig into later, I think. But it was still very satisfying that my best model -- trained locally in 44 hours -- was almost as good as GPT-2 small, even if it did fall somewhat short. So on that positive note, I'm going to wrap up this "Interventions" series-within-a-series, and move on to the two other things I wanted to do before wrapping up the "LLM from scratch" series as a whole: The appendices first, I think -- I'll post about them shortly. But I think the big one will be the JAX implementation -- really looking forward to that. Weight tying , which I found made the loss worse, but it was interesting how simple it was to implement. PyTorch's Automated Mixed Precision , which also harmed the loss a tiny bit, but had the benefit of making training twice as fast, and 66% cheaper in the cloud -- well worth the loss penalty. Gradient clipping -- a cheap, but (somewhat to my surprise) not particularly effective intervention for this model. QKV bias -- that is, adding bias to the attention weight matrices -- which also helped a tiny bit, though I later felt that this might have been in the noise. Weight decay -- more effective, and something that's simple enough to understand with simple gradient descent. I still need to learn more about it in the context of optimisers, though -- particularly with AdamW. Dropout , which seems to be less than useful for single-epoch training: removing it helped the model quite a lot. The learning rate , which I built up quite a lot of new knowledge about, and by both increasing it and scheduling it, I got the biggest bang for the buck. Going through the appendices in the book to see if there's anything I want to highlight there. The final test as to whether I've really understood everything: building my own LLM from scratch without reference to the book. I want to do that in a different framework, not PyTorch, to minimise the risk of just regurgitating code -- I asked people on X/Twitter which one I should use, and the winner was JAX -- so it should be interesting to see how that goes!

0 views

Four Horsemen of the AIpocalypse

If you liked this piece, please subscribe to my premium newsletter. It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I recently put out the timely and important Hater’s Guide To The SaaSpocalypse , another on How AI Isn't Too Big To Fail , a deep (17,500 word) Hater’s Guide To OpenAI , and just last week put out the massive Hater’s Guide To Private Credit . Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Soundtrack — Megadeth — Hangar 18 (Eb Tuning) For the best part of four years I’ve been wrapped up in writing these massive, sprawling narratives about the AI bubble and the tech industry at large. I still intend to write them, but today I’m going to do what I do best — explaining all the odd shit that’s happening in the tech industry and explaining why it’s concerning to me.  And because I love a good bit, I’m tying these stories to my pale horses of the AIpocalypse — signs that things are beginning to unwind in the most annoying bubble in history.   Anyway, considering that the newsletter and the podcast are now my main form of income, I’m going to be experimenting with the formats across the free and premium to keep things interesting and varied.  Let’s start with a fairly direct statement: Anthropic should stop taking on new customers until it works out its capacity issues. So, generally any service — Netflix, for example — you use with any regularity has the “four nines” of availability, meaning that it’s up 99.99% of the time. Once a company grows beyond a certain scale, having four 9s is considered standard business practice… … unless you’re Anthropic! As of writing this sentence, Anthropic’s availability for its Claude Chatbot has 98.79% uptime, its platform/console is at 99.14%, its API is at 99.09%, and Claude Code is at 99.25% for the last 90 days.  Let me put this into context. When you have 99.99% uptime, a service is only down for a minute (and 0.48 of a second) each week. If you’re hitting 98.79% uptime, as with the Claude chatbot, your downtime jumps to two hours, one minute, and 58 seconds.  Or, put another way, 98.79% uptime equates to nearly four-and-a-half days in a calendar year where the service is unavailable. More-astonishingly, Claude for Government sits at 99.91%. Government services are generally expected to be four 9s minimum, or 5 (99.999%) for more important systems underlying things like emergency services.  This is a company that recently raised $30 billion dollars and gets talked about like somebody’s gifted child, yet Anthropic’s services seem to have constant uptime issues linked to a lack of capacity.  Per the Wall Street Journal : Yet Anthropic’s problems go far further than simple downtime ( as I discussed last week ), leading to (deliberately or otherwise) severe performance issues with Opus 4.6 :  While Anthropic claims that it doesn’t degrade models to better serve demand , that doesn’t really square with the many, many users complaining about the problem. Anthropic’s response has, for the most part, been to pretend like nothing is wrong, with a spokesperson waving off Carl Franzen of VentureBeat ( who has a great article on the situation here ) by pointing him to two different Twitter posts, neither of which actually explain what’s going on. Things only got worse with last week’s launch of Opus 4.7, which appears to have worse performance and burn more tokens.  Per Business Insider : I think it’s deeply bizarre that a huge company allegedly worth hundreds of billions of dollars A) can’t seem to keep its services online with any level of consistency, B) appears to be making its products worse, and C) refuses to actually address or discuss the problem. Users have been complaining about Claude models getting “dumber” going back as far as 2024 , each time faced with a tepid gaslighting from a company with a CEO that loves to talk about his AI products wiping out half of white collar labor . Some might frame this as Anthropic having “insatiable demand for its products,” but what I see is a terrible business with awful infrastructure run in an unethical way. It is blatantly, alarmingly obvious that Anthropic cannot afford to provide a stable and reliable service to its customers, and its plans to expand capacity appear to be signing deals with Broadcom that will come online “starting in 2027,” near-theoretical capacity with Hut8, which does not appear to have ever built an AI data center , and also with CoreWeave , a company that is yet to build the full capacity for its 2025 deals with OpenAI and only has around 850MW of “active power capacity” — so around 653MW of actual compute capacity — as of the end of 2025, up from 360MW of power at end of 2024 .    Remember: data centers take forever to build, and there’s only a limited amount of global capacity, most of which is taken up by Microsoft, Google, Amazon, Meta and OpenAI, with the first three of those already providing capacity to both Anthropic and OpenAI. We’re likely hitting the absolute physical limits of available AI compute capacity, if we haven’t already done so, and even if other data centers are coming online, is the plan to just hand them over to OpenAI or Anthropic in perpetuity? It’s also unclear what the goal of that additional capacity might be, as I discussed last week : What’s the goal, exactly? Providing a better experience to its current customers? Securing enough capacity to keep adding customers? Securing enough capacity to support larger models like Mythos? When, exactly, does Anthropic hit equilibrium, and what does that look like?  There’s also the issue of cost.  Anthropic is currently losing billions of dollars a year offering a service with amateurish availability and oscillating quality, and continues to accept new subscribers, meaning that capacity issues are not affecting its growth. As a result, adding more capacity simply makes the product work better for a much higher cost. Anthropic’s growth story is a sham built on selling subscriptions that let users burn anywhere from $8 to $13.50 for every dollar of subscription revenue and providing a brittle, inconsistent service, made possible only through a near-infinite stream of venture capital money and infrastructure providers footing the bill for data center construction. Put another way, Anthropic doesn’t have to play by the rules. Venture capital funding allows it to massively subsidize its services. The endless, breathless support from the media runs cover for the deterioration of its services. A lack of any true regulation of tech , let alone AI , means that it can rugpull its customers with varying rate limits whenever it feels like .  If Anthropic were forced to charge its actual costs — and no, I don’t believe its API is profitable no matter how many people misread Dario Amodei’s interview — its growth would quickly fall apart as customers faced the real costs of AI (which I’ll get to in a bit). If Anthropic was forced to provide a stable service, it would have to stop accepting new customers or massively increase its inference costs.  Anthropic is a con , and said con is only made possible through endless, specious hype. Everybody who blindly applauded everything this company did is a mark. Congratulations to all the current winners of the “Fell For It Again Award.” Per the Financial Times : So, yeah, anyone in the media who bought the line of shit from Dario Amodei that this was “too dangerous to release” is a mark. Cal Newport has an excellent piece debunking the hype , but my general feeling is that if Mythos was so powerful, how did Claude Code’s source code leak ?  Did… Anthropic not bother to use its super-powerful Mythos model to check? Or did it not find anything? Either way, very embarrassing for all involved.  As I’ve discussed in the past, only 5GW of AI compute capacity is currently under construction worldwide (based on research from Sightline Climate ), with “under construction” meaning everything from a scaffolding yard with a fence ( as is the case with Nscale’s Loughton-based data center ) to a building nearing handoff to the client.  I reached out to Sightline to get some clarity, and they told me that of the 114GW of capacity due to come online by the end of 2028, only 15.2GW is under construction, including the 5GW due in 2026.  That’s…very bad.  It gets worse when you realize that the majority of that construction is for two companies: So, to summarize, at least 4.6GW of the 15.2GW of data center capacity under construction is for OpenAI, with at least another 4GW of that reserved for Anthropic through partners like Microsoft, Google and Amazon. In truth, the number could be much higher.  This is a fundamentally insane situation. OpenAI and Anthropic both burn billions of dollars a year, with The Information reporting that Anthropic expects to burn at least $11 billion and OpenAI $25 billion in 2026 . The only way that these companies can continue to exist is by raising endless venture capital funding or, assuming they make it to IPO, endless debt offerings or at-the-market stock sales. It’s also very concerning that only such a small percentage of announced compute capacity is being built, especially when you run the numbers against NVIDIA’s actual sales. Last year, Jerome Darling of TD Cowen estimated that it cost around $30 million per megawatt in critical IT (GPUs, servers, storage, and so on) and $12 million to $14 million per megawatt to build a data center, making critical IT around 68% (at the higher end of construction) of the total cost-per-megawatt. Now, to be clear, those gigawatt and megawatt numbers for data centers refer to the power rather than critical IT , and if we take an average PUE (power usage efficiency, a measurement of how efficient a data center’s power is) of 1.35, we get 11.2GW of critical IT hardware, with the majority (I’d say 90%) being GPUs, bringing us down to around 10.1GW of GPUs. If we then cut that up into GB200 or GB300 NVL72 racks with a power draw of around 140KW, that’s around 71,429 racks’ worth of hardware at an average of $4 million each, which gives us around $285.7 billion in revenue for NVIDIA. NVIDIA claims it had a combined $500 billion in orders between 2025 and 2026 , and $1 trillion of sales through 2027 , and it’s unclear where any of those orders are meant to go other than a warehouse in Taiwan.  At this point, I think it’s fair to ask why anyone is buying more GPUs, as there’s nowhere to fucking put them. Every beat-and-raise earnings from NVIDIA is now deeply suspicious.  Last week, a report from Goldman Sachs revealed that ( and I quote ) “...companies are overrunning their initial budgets for inference by orders of magnitude (we heard one industry datapoint on inference costs in engineering now approaching about 10% of headcount cost, but could be on track to be on par with headcounts costs in the next several quarters based on current trajectories.”  To simplify, this means that some companies are spending as much as 10% of the cost of their employees on generative AI services, all without appearing to provide any stability, quality or efficiency gains, or (not that I want this) justification to lay people off.  The Information’s Laura Bratton also reported last week that Uber had managed to blow through its entire AI budget for the year a few months into 2026:  Uber’s CTO also added that about “...11% of real, live updates to the code in its backend systems are being written by AI agents primarily built with Claude Code, up from just a fraction of a percent three months ago.” Anyone who has ever used Uber’s app in the last year can see how well that’s going, especially if they’ve had to file any kind of support ticket. Honestly, I find this all completely fucking insane. The whole sales pitch for generative AI is that it’s meant to be this magical, efficiency-driving panacea, yet whenever you ask somebody about it the answer is either “yeah, we’re writing all the code with it!” without any described benefits or “it costs so much fucking money, man.”  Let’s get practical about these economics, and use Spotify as an example because its CEO proudly said that its “top engineers” are barely writing code anymore , though to be clear, the Goldman Sachs example didn’t specifically name any one company. For the sake of argument, let’s say that the company has 3000 engineers — one of its sites claims it has 2700 , but I’ve seen reports as high as 3500. Let’s also assume, based on the Spotify Blind (an anonymous social media site for tech workers), that these engineers make a median salary of 192,000 a year. In the event that Spotify spent 10% of its engineering headcount (around $576 million) on AI inference, it would be spending roughly $57.6 million, or approximately 4.1% of its $1.393 billion in Research and Development costs from its FY2025 annual report . Eager math-doers in the audience will note that 100% of headcount would be nearly half of the R&D budget, or around a quarter of its $2.2 billion in net income for the year. Now, to be clear, these numbers likely already include some AI inference spend, but I’m just trying to illustrate the sheer scale of the cost.  While this is great for Anthropic (and to a lesser extent OpenAI), I don’t see how it works out for any of its customers. A flat 10% bump on the cost of software engineering is the direct opposite of what AI was meant to do, and in the event that costs continue to rise, I’m not sure how anybody justifies the expense much further.  And we’re going to find out fairly quickly, because the world of token subsidies is going away. As I reported yesterday , internal documents have revealed that Microsoft plans to temporarily suspend individual account signups to its GitHub Copilot coding product, tighten rate limits across the board, remove Opus models from its $10-a-month Pro subscription, and transition from requests (single interactions with GitHub Copilot) towards token-based billing some time later this year, with Microsoft confirming some of these details (but not token-based billing) in a blog post . This is a significant move, driven by (per my own reporting) Microsoft’s week-over-week costs of running GitHub Copilot nearly doubling since January.  The move to token-based billing will see GitHub users charged based on their usage of the platform, and how many tokens their prompts consume — and thus, how much compute they use. It’s unclear at this time when this will begin, but it significantly changes the value of the product. I’ll also say that the fact that Microsoft has stopped signing up new paid GitHub Copilot subscriptions entirely is one of the most shocking moves in the history of software. I’ve literally never seen a company do this outside of products it intended to kill entirely, and that’s likely because — per my source — it intends to move paid customers over to token-based-billing, though it’s unclear what these tiers would look like, as the $10-a-month and $39-a-month subscriptions are mostly differentiated based on the amount of requests you can use.  What’s remarkable about this story is that Microsoft is one of the few players capable of bankrolling AI in perpetuity, with over $20 billion a quarter in profits since the middle of 2023 .  Its decision to start cutting costs around AI suggests that said costs have become unbearable — The Information reported back in January that it was on pace to spend $500 million a year with Anthropic alone , and if that amount has doubled, it likely means that Microsoft is spending upwards of ten times its GitHub Copilot revenue, as I can report today that at the end of 2025, GitHub Copilot was at around $1.08 billion, with the majority of that revenue coming from its CoPilot Business and Enterprise subscriptions.  The Information also reported a few weeks ago that GitHub had recently seen a surge of outages attributed to “spiking traffic as well as its effort to move its applications from its own servers to Microsoft’s Azure cloud”: “Agents” in this case could refer to just about anything — OpenAI’s Codex, Anthropic’s Claude Code, or even people plugging in the wasteful, questionably-useful OpenClaw to their GitHub Copilot account, and if that’s what happened, it’s very likely behind the move to Token-Based Billing and rate limits. In any case, if Microsoft’s making this move, it means that CFO Amy Hood — the woman behind last year’s pullback on data center construction — has decided that the subsidy party is over. Though Microsoft is yet to formally announce the move to Token-Based Billing, I imagine it’ll be sometime this week that it rips off the bandage. Two weeks ago, Anthropic did the same with its enterprise customers , shifting them to a flat $20-a-seat fee and otherwise charging the per-token rate for whatever models they wanted to use.  I’m making the call that by the end of 2026, a majority of AI services will move some or all of their customers to token-based billing as they reckon with the true costs of running AI models.  I kept things simple today both to give myself a bit of a break and because these were stories I felt needed telling.  Nevertheless, I do have to remark on how ridiculous everything has become. Everywhere you turn, somebody is talking about “agents” in a way that doesn’t remotely match with reality, like Aaron Levie’s epic screeds about how “ AI agents make it so every other company on the planet starts to create software for bringing automation to their workflows in a way that would be either infeasible technically or unaffordable economically ,” a statement that may as well be about fucking unicorns and manticores as far as its connections to reality.  I feel bad picking on Aaron, as he doesn’t seem like a bad guy. He is, however, increasingly-indicative of the hysterical brainrot of executive AI hysteria, where the only way to discuss the industry is in vaguely futuristic-sounding terms about “agents” and “inference” and “tokens as a commodity,” all with the intent of obfuscating the ugly, simple truth: that generative AI is deeply unprofitable, doesn’t seem to provide tangible productivity benefits, and appears to only lose both the business and the customer money.  Though my arguments might be verbose, they’re ultimately pretty simple: AI does not provide even an iota of the benefits — economic or otherwise — to justify its ruinous costs. Every new story that runs about cost-cutting or horrible burnrates increasingly validates my position, and for the most part, boosters respond by saying “ well LOOK at how BIG the REVENUES are .” It isn’t! AI revenues are dogshit. They’re awful. They’re pathetic. The entire industry — including OpenAI and Anthropic’s theoretical revenues of $13.1 billion and $4.5 billion — hit around $65 billion last year , and that includes the revenues from providing compute generated by neoclouds like CoreWeave and hyperscalers like Microsoft. I’m also just gonna come out and say it: I think the AI startups are misleading their investors and the general public about their revenues. My reporting from last year had OpenAI’s revenues at somewhere in the region of $4.3 billion in the first three quarters of 2025, and Anthropic CFO Krishna Rao said in an an affidavit that the company had made revenue “exceeding” (sigh) $5 billion through March 9, 2026 , which does not make sense when you add up all the annualized revenue figures reported about this company.  Cursor is also reportedly at $6 billion in annualized revenue (or around $500 million a month) and “gross margin positive” — which I also doubt given that it had to raise over $3 billion last year and is apparently raising another $2 billion this year. Even if said numbers were real, the majority of OpenAI, Cursor and Anthropic’s revenues come from subsidized software subscriptions. Things have gotten so dire that even Deidre Bosa of CNBC agrees with me that AI demand is inflated by token-maxxing and subsidized services. Otherwise, everybody else is making single or double-digit millions of dollars and losing hundreds of millions of dollars to get there. And per founder Scott Stevenson , overstating annualized revenues is extremely common, with AI startups booking “three-year-long” enterprise deals with the first year discounted and a twelve-month out : While it’s hard to say how widespread this potential act of fraud might be, Stevenson estimates that more than 50% of enterprise AI startups are using “contracted ARR” to pump their values. One (honest) founder responded to Stevenson saying that his company has $350,000 in contracted ARR but only $42,000 of ARR, adding that “next year is gonna be awesome though,” which I don’t think will be the case for what appears to be a chatbot for finding investors. This industry’s future is predicated entirely on the existence of infinite resources, and most AI companies are effectively front-ends for models owned by Anthropic and OpenAI, two other companies that rely on infinite resources to run their services and fund their infrastructure. And at the top of the pile sits NVIDIA, the largest company on the stock market, which is selling more GPUs than can be possibly installed, and very few people seem to notice or care.   I’m talking about hundreds of billions of dollars of GPUs sitting in warehouses that aren’t being installed, with it taking six months to install a single quarter’s worth of GPU sales . The assumption, based on every financial publication I’ve read, appears to be “it will keep selling GPUs forever, and it will all be so great.” Where are you going to put them, Jensen? Where do the fucking GPUs go? There isn’t enough capacity under construction! If, in fact, NVIDIA is actually selling as many GPUs as it says, it’s likely taking liberties with “ transfers of ownership ” where NVIDIA marks a product as “sold” to somebody that has yet to actually take it on. In any case, I keep coming back to the word “hysteria,” because it’s hard to find another word to describe this hype cycle. The way that the media, the markets, analysts, executives, and venture capitalists discuss AI is totally divorced from reality, discussing “agents” in terms that don’t match with reality and AI data centers in terms of “gigawatts” that are entirely fucking theoretical , all with a terrifying certainty that makes me wonder what it is I’m missing. But every sign points to me being right, and if I’m right at the scale I think I’m right, I think we’re about to have a legitimacy crisis in investing and mainstream media, because regular people are keenly aware that something isn’t right, in many cases, it’s because they’re able to count. OpenAI’s Stargate data centers account for 4.6GW — with 1.2GW in Abilene, Texas; 1.4GW in Shackelford, Texas; 1GW in Dona Ana, New Mexico; and 1GW in Port Washington, Wisconsin.  It’s safe to assume that with big tech’s hundreds of billions of dollars of capex that its data centers will make up a large amount — as much as 6GW — with most of that likely going to Anthropic or OpenAI. An indeterminately-large chunk could be Amazon’s Project Rainier in Indiana , which will “eventually” (per CNBC) draw more than 2.2GW of electricity.  While Amazon says it’s “ fully operational ,” it’s fucking lying, as it also claims that it has “nearly half a million Trainium 2 chips,” with each chip being 500 watts, and 500,000 times 500 watts being around 250MW. Other reports said it would be up to 1 million Trainium2 chips by the end of 2025, but that would still only amount to 500MW. Anthropic is apparently the primary tenant.  Anthropic also agreed to take 3.5GW of capacity of TPUs from Google Cloud , with the first 1GW coming online in 2027 , and also agreed to take a gigawatt from Microsoft made up of “Vera Rubin and Grace Blackwell systems,” meaning that these are likely data centers that are currently under construction. Anthropic and Google also announced in Q4 2025 that Anthropic would use 1 million TPUs as part of a new deal with Google Cloud, and that “well over” a gigawatt of capacity would come online in 2026 . Microsoft is also taking the 900MW extension to Stargate Abilene , and considering that most of Microsoft’s GPU infrastructure already goes to OpenAI, I can only imagine that’s where it’s going. I also will add that Satya Nadella claimed that Microsoft brought 2GW of capacity online in 2025 , and that its Fairwater Data Center cluster was “going live ahead of schedule ,” only to fail to clarify when that might happen or what said schedule was.  Microsoft’s also been relatively vague about the actual capacity, but based on there being “hundreds of thousands of GB200 GPUs,” that would be (assuming 300,000 GPUs) about 583MW.

0 views

Web tools are cool

My website has a new page! The URL pathname is an “official” slash page . I’m only listing web tools I use for now. My default apps change too frequently. The list is an evolution of an old post I was secretly maintaining. 👉 Visit my /uses page! Web tools are web-based tools, obviously. Used in a web browser without a download. Some may be installed as progressive web apps but I prefer a progressive browser tabs . Most tools do one thing and do it well. Some of my links are actually just text information, but “web tools and documentation and specifications” is a mouthful. All the tools I’m collating are related to web development. I’d guess I use Squoosh the most, followed by Can I use or this specificity calculator (there are many, I like that one.) Stu Robson’s ReliCSS is a new addition that looks very useful. I realise now I’m not hosting any web tools myself. Many moons ago I maintained a few NPM packages that did stuff ( grunted , mostly). Web tools are so much better. Less malware, for one. I once hosted a to-do app , but who hasn’t? That reminds me, sub-domains are the secret to hosting side projects. I must stop buying short-lived project domains. Anyway, I should build a web tool . Any ideas? Do you have any favourite web tools I should be aware of? I’m not looking to list everything, just stuff I might personally use. P.S. This is definitely not another bookmarks project I’ll forget about. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds.

0 views

Fragments: April 21

Last week Thoughtworks released the 34th volume of our Technology Radar . This radar is our biannual survey of our experience of the technology scene, highlighting tools, techniques, platforms, and languages that we’ve used or otherwise caught our eye. This edition contains 118 blips, each briefly describing our impressions of one of these elements. As we would expect, the radar is dominated by AI-oriented topics. Part of this is revisiting familiar ground with LLM-assisted eyes: An interesting consequence of AI in software development is that it’s not only forcing us to look to the future; it’s also pushing us to revisit the foundations of our craft. While assembling this edition, we found ourselves returning to many established techniques, from pair programming to zero trust architecture, and from mutation testing to DORA metrics. We also revisited core principles of software craftsmanship, such as clean code, deliberate design, testability and accessibility as a first-class concern. This is not nostalgia, but a necessary counterweight to the speed at which AI tools can generate complexity. We also observed a resurgence of the command line: After years of abstracting it away in the name of usability, agentic tools are bringing developers back to the terminal as a primary interface. I was especially happy to see my colleague Jim Gumbley added to the writing team, he’s been a regular source of security information for me over the years, including working on this site’s Threat Modeling Guide . Having a strong security presence on the radar team is especially important given the serious security concerns around using LLMs. One of the themes of the radar is securing “permission hungry” agents: “Permission hungry” describes the bind at the heart of the current agent moment: the agents worth building are the ones that need access to everything. OpenClaw and Claude Cowork supervise real work tasks; Gas Town coordinates agent swarms across entire codebases. These agents require broad access to private data, external communication and real systems — each arguing that the payoff justifies it. However, like a skier who’s just learned to turn and confidently points themselves at the hardest black run, the safeguards haven’t caught up with that ambition. The appetite for access collides with unsolved problems. Prompt injection means models still can’t reliably distinguish trusted instructions from untrusted input. Given all of this, many of this radar’s blips are about Harness Engineering, indeed the radar meeting was a major source of ideas for Birgitta’s excellent article on the subject. The radar includes several blips suggesting the guides and sensors necessary for a well-fitting harness. I expect that when the next radar appears in six months time, that list will increase. ❄                ❄                ❄                ❄                ❄ Mike Mason looks what happens when developers aren’t reading the code . The Python codebase Claude produced was largely working. Unit tests passed, and a few hours of real-world testing showed it was successfully managing a fairly complex piece of my infrastructure. But somewhere around 100KB of total code I noticed something: the main file had grown to about 50KB (2,000 lines) and Claude Code, when it needed to make edits, had started reaching for sed to find and modify code within that file. When I saw that, it was a serious alarm bell. As well as the experience of “a friend”, he ponders the 500,000 lines of Claude Code after the leak. Both things are true: there is good architecture in Claude Code, and there is also an incomprehensible mess. That’s actually the point. You don’t get to know which is which without reading the code. His conclusion is a rough framework. Throw-away analysis scripts are fine to vibe away. Tooling you need to maintain and durable code, needs regular human review - even if it’s just a human asking a model to evaluate the code with some hints as to what good code looks like The moment you say “I’m getting uncomfortable with how big this is getting, can we do something better?” it does the right thing: sensible decomposition, new classes, sometimes even unit tests for the new thing. It knew, it just didn’t volunteer it. He does recommend being serious with , I don’t know if he’s tried many of the patterns that Rahul Garg has recently posted to break the similar frustration loop that he saw. ❄                ❄                ❄                ❄                ❄ Dan Davies poses an annoying philosophy thought experiment for us to consider how we feel about LLMs indulging in ghost writing. ❄                ❄                ❄                ❄                ❄ DOGE dismantled many useful things during their brief period with the wood chipper. One of these was DirectFile, a government program that supported people filing their taxes online. Don Moynihan has talked to many folks involved in Direct File, has penned a worthwhile essay that isn’t just relevant to DirectFile and other U.S. government technology projects, but indeed any technology initiative in a large organization. Moynihan highlights: a paradox of government reform: the simpler a potential change appears, the more likely that it has not been implemented because it features deceptive complexity that others have tried and failed to resolve. I’ve heard that tale in many a large corporation too One way government initiatives are different is that, at its best, it’s built on an attitude of public service Many who worked on Direct File drew a sharp contrast with DOGE and their approach to building tech products. One point of distinction was DOGE’s seeming disinterest in public interest goals and of the public itself: “if you do not think government has a responsibility to serve people, I think it draws into question how good are you going to be at making government work better for people if you just don’t believe in that underlying principle” The tragedy for U.S. taxpayers like me is that we’ve lost an effective way to go through the annual hassle of taxes. In addition the IRS is much weaker - it’s lost 25% of its staff and its budget is 40% below what it was in 2010. Much though we hate tax collectors, this isn’t a good thing. An efficient tax system is an important part of national security, many historians consider the ability to raise taxes effectively was an important reason why Britain won its century-long struggle with France in the Eighteenth century. A wonky tax system is also a major reason why the French monarchy, so powerful at the start of that century, fell to revolution. Indeed there is considerable evidence that increasing the budget of the IRS would more than pay for itself by increasing revenue .

0 views
DHH Yesterday

Celebrating computers at Omacon

Do you see the same truth? That's how C.S. Lewis defined the essence of friendship. And that's what we gathered 130 people in New York to honor for Omacon two weeks ago. Seeing the same truth: A love of computers. Bespoke computers. Malleable computers. Our computers. It's the kind of magic you can only really summon in person. We do our best online, but you instantly realize what an impoverished medium it is for creating real connections once you're all together in the same room. So that's what we did. We connected. We shared our work, our passion, and our opinions about all these new Linux vibes. It happened in an absolutely gorgeous venue, generously offered for the occasion by Tobi and his event team at Shopify. The space had an almost comedy-club intimacy, with chairs just a few inches from the podium. Thanks to the single-track format, we made the most of that warm atmosphere. I gave the keynote. I also got to meet Prime, TJ, Bjarne, Spencer, and Vaxry for the first time in person. Which is always a bit odd when you've been working together for a while over the internet. It feels so familiar, but like an unfinished agreement. And then, boom, it's signed with a handshake and a smile. Same with getting to meet and talk to a ton of other Omarchy users from all walks of life. Many were programmers, but plenty were not. Some came from other Linux distributions, but most from either Windows or Mac. Everyone shared a passion for computers, though. Not just as instruments of action, but as delightful environments for play, learning, and connection. It all added up to a massive recharge. I built Omarchy for myself, but sharing it makes it mean so much more. Seeing others enthusiastically embrace it as a starting point for their own Linux adventure is a real boost to the motivation needed to keep making it better. Because there's always more to do: more systems to cover with perfect compatibility, more corners to polish. So that's what we're going to do, together. Make this distro reach more kindred spirits. Entice those who would love a bespoke, kintsugi system, but don't know where to start. It's never going to be for everyone, but that's also why it works as a beacon for those who choose to share the quest.

0 views
Stratechery Yesterday

Tim Cook’s Impeccable Timing

Listen to this post : It’s the nature of business that the eulogy for a chief executive doesn’t happen when they die, but when they retire, or, in the case of Apple CEO Tim Cook, announce that they will step up to the role of Executive Chairman on September 1 . The one morbid exception is when a CEO dies on the job — or quits because they are dying — and the truth of the matter is that that is where any honest recounting of Cook’s incredibly successful tenure as Apple CEO, particularly from a financial perspective, has to begin. The numbers, to be clear, are extraordinary. Cook became CEO of Apple on August 24, 2011, and in the intervening 15 years revenue has increased 303%, profit 354%, and the value of Apple has gone from $297 billion to $4 trillion, a staggering 1,251% increase. The reason for Cook’s accession in 2011 became clear a mere six weeks later, when Steve Jobs passed away from cancer on October 5, 2011. Jobs’ death isn’t the reason Cook was chosen — Cook had already served as interim CEO while Jobs underwent treatment in 2009 — but I think the timing played a major role in making Cook arguably the greatest non-founder CEO of all time. Peter Thiel introduced the concept of Zero To One thusly: When we think about the future, we hope for a future of progress. That progress can take one of two forms. Horizontal or extensive progress means copying things that work — going from 1 to n. Horizontal progress is easy to imagine because we already know what it looks like. Vertical or intensive progress means doing new things — going from 0 to 1. Vertical progress is harder to imagine because it requires doing something nobody else has ever done. If you take one typewriter and build 100, you have made horizontal progress. If you have a typewriter and build a word processor, you have made vertical progress. Steve Jobs made 0 to 1 products, as he reminded the audience in the introduction to his most famous keynote : Every once in a while, a revolutionary product comes along that changes everything. First of all, one’s very fortunate if one gets to work on one of these in your career. Apple’s been very fortunate: it’s been able to introduce a few of these into the world. In 1984, we introduced the Macintosh. It didn’t just change Apple, it changed the whole computer industry. In 2001, we introduced the first iPod. It didn’t just change the way we all listen to music, it changed the entire music industry. Well, today we’re introducing three revolutionary products of this class. The first one: a widescreen iPod with touch controls. The second: a revolutionary mobile phone. And the third is a breakthrough Internet communications device. Three things…are you getting it? These are not three separate devices. This is one device, and we are calling it iPhone. Steve Jobs would, three years later, also introduce the iPad, which makes four distinct product categories if you’re counting. Perhaps the most important 0 to 1 product Jobs created, however, was Apple itself, which raises the question: what makes Apple Apple? “What Makes Apple Apple” isn’t a new question; it was the central question of Apple University, the internal training program the company launched in 2008. Apple University was hailed on the outside as a Steve Jobs creation, but while I’m sure he green lit the concept, it was clear to me as an intern on the Apple University team in 2010, that the program’s driving force was Tim Cook. The core of the program, at least when I was there, was what became known as The Cook Doctrine : We believe that we’re on the face of the Earth to make great products, and that’s not changing. We’re constantly focusing on innovating. We believe in the simple, not the complex. We believe that we need to own and control the primary technologies behind the products we make, and participate only in markets where we can make a significant contribution. We believe in saying no to thousands of projects so that we can really focus on the few that are truly important and meaningful to us. We believe in deep collaboration and cross-pollination of our groups, which allow us to innovate in a way that others cannot. And frankly, we don’t settle for anything less than excellence in every group in the company, and we have the self-honesty to admit when we’re wrong and the courage to change. And I think, regardless of who is in what job, those values are so embedded in this company that Apple will do extremely well. Cook explained this on Apple’s January 2009 earnings call , during Jobs’ first leave of absence, in response to a question about how Apple would fare without its founder. It’s a brilliant statement, but it is — as the last paragraph makes clear — ultimately about maintaining, nurturing, and growing what Jobs built. That is why I started this Article by highlighting the timing of Cook’s ascent to the CEO role. The challenge for CEOs following iconic founders is that the person who took the company from 0 to 1 usually sticks around for 2, 3, 4, etc.; by the time they step down the only way forward is often down. Jobs, however, by virtue of leaving the world too soon, left Apple only a few years after its most important 0 to 1 product ever, meaning it was Cook who was in charge of growing and expanding Apple’s most revolutionary device yet. Cook, to be clear, managed this brilliantly. Under his watch the iPhone not only got better every year, but expanded its market to every carrier in basically every country, and expanded the line from one model in two colors to five models in a plethora of colors sold at the scale of hundreds of millions of units a year. Cook was, without question, an operational genius. Moreover, this was clearly the case even before he scaled the iPhone to unimaginable scale. When Cook joined Apple in 1998 the company’s operations — centered on Apple’s own factories and warehouses — were a massive drag on the company; Cook methodically shut them down and shifted Apple’s manufacturing base to China, creating a just-in-time supply chain that year-after-year coordinated a worldwide network of suppliers to deliver Apple’s ever-expanding product line to customers’ doorsteps and a fleet of beautiful and brand-expanding stores. There was not, under Cook’s leadership, a single significant product issue or recall. Cook also oversaw the introduction of major new products, most notably AirPods and Apple Watch; the “Wearables, Home, and Accessories” category delivered $35.4 billion in revenue last year, which would rank 128 on the Fortune 500. Still, both products are derivative of the iPhone; Cook’s signature 0 to 1 product, the Apple Vision Pro, is more of a 0.5. Cook’s more momentous contribution to Apple’s top line was the elevation of Services. The Google search deal actually originated in 2002 with an agreement to make Google the default search service for Safari on the Mac, and was extended to the iPhone in 2007; Google’s motivation was to ensure that Apple never competed for their core business , and Cook was happy to take an ever increasing amount of pure profit. The App Store also predated Cook; Steve Jobs said during the App Store’s introduction that “we keep 30 [percent] to pay for running the App Store”, and called it “the best deal going to distribute applications to mobile platforms”. It’s important to note that, in 2008, this was true! The App Store really was a great deal. Three years later, in a July 28, 2011 email — less than a month before Cook officially became CEO — Phil Schiller wondered if Apple should lower its take once they were making $1 billion a year in profit from the App Store. John Gruber, writing on Daring Fireball in 2021 , wondered what might have been had Cook followed Schiller’s advice: In my imagination, a world where Apple had used Phil Schiller’s memo above as a game plan for the App Store over the last decade is a better place for everyone today: developers for sure, but also users, and, yes, Apple itself. I’ve often said that Apple’s priorities are consistent: Apple’s own needs first, users’ second, developers’ third. Apple, for obvious reasons, does not like to talk about the Apple-first part of those priorities, but Cook made explicit during his testimony during the Epic trial that when user and developer needs conflict, Apple sides with users. (Hence App Tracking Transparency, for example.) These priorities are as they should be. I’m not complaining about their order. But putting developer needs third doesn’t mean they should be neglected or overlooked. A large base of developers who are experts on developing and designing for Apple’s proprietary platforms is an incredible asset. Making those developers happy — happy enough to keep them wanting to work and focus on Apple’s platforms — is good for Apple itself. I want to agree with Gruber — I was criticizing Apple’s App Store policies within weeks of starting Stratechery , years before it became a major issue — but from a shareholder perspective, i.e. Cook’s ultimate bosses, it’s hard to argue with Apple’s uncompromising approach. Last year Apple Services generated 26% of Apple’s revenue and 41% of the company’s profit; more importantly, Services continues to grow year-over-year, even as iPhone growth has slowed from the go-go years. Another way to frame the Services question is to say that Gruber is concerned about the long-term importance of something that is somewhat ineffable — developer willingness and desire to support Apple’s platforms — which is, at least in Gruber’s mind, essential for Apple’s long-term health. Cook, in this critique, prioritized Apple’s financial results and shareholder returns over what was best for Apple in the long run. This isn’t the only part of Apple’s business where this critique has validity. Cook’s greatest triumph was, as I noted above, completely overhauling and subsequently scaling Apple’s operations, which first and foremost meant developing a heavy dependence on China. This dependence was not inevitable: Patrick McGee explained in Apple In China , which I consider one of the all-time great books about the tech industry, how Apple made China into the manufacturing behemoth it became. McGee added in a Stratechery Interview : Let me just refer back to something that you wrote I think a few months ago when you called the last 20, 25 years, like the golden age for companies like Apple and Silicon Valley focused on software and Chinese taking care of the hardware manufacturing. That is a perfect partnership, and if we were living in a simulation and it ended tomorrow, you’d give props for Apple to taking advantage of the situation better than anybody else. The problem is we’re probably not living in the simulation and things go on, and I’ve got this rather disquieting conclusion where, look, Apple’s still really good probably, they’re not as good as they once were under Jony Ive, but they’re still good at industrial design and product design, but they don’t do any operations in our own country. That’s all dependent on China. You’ve called this in fact the biggest violation of the Tim Cook doctrine to own and control your destiny, but the Chinese aren’t just doing the operations anymore, they also have industrial design, product design, manufacturing design. It really is ironic: Tim Cook built what is arguably Apple’s most important technology — its ability to build the world’s best personal computer products at astronomical scale — and did so in a way that leaves Apple more vulnerable than anyone to the deteriorating relationship between the United States and China. China was certainly good for the bottom line, but was it good for Apple’s long-run sustainability? This same critique — of favoring a financially optimal strategy over long-term sustainability — may also one day be levied on the biggest question Cook leaves his successor: what impact will AI have on Apple? Apple has, to date, avoided spending hundreds of billions of dollars on the AI buildout, and there is one potential future where the company profits from AI by selling the devices everyone uses to access commoditized models; there is another future where AI becomes the means by which Apple’s 50 Years of Integration is finally disrupted by companies that actually invested in the technology of the future. If Tim Cook’s timing was fortunate in terms of when in Apple’s lifecycle he took the reins, then I would call his timing in terms of when in Apple’s lifecycle he is stepping down as being prudent, both for his legacy and for Apple’s future. Apple is, in terms of its traditional business model, in a better place than it has ever been. The iPhone line is fantastic, and selling at a record pace; the Mac, meanwhile, is poised to massively expand its market share as Apple Silicon — another Jobs initiative, appropriately invested in and nurtured by Cook — makes the Mac the computer of choice for both the high end (thanks to Apple Silicon’s performance and unified memory architecture) and the low end (the iPhone chip-based MacBook Neo significantly expands Apple’s addressable market). Meanwhile, the Services business continues to grow. Cook is stepping down after Apple’s best-ever quarter, a milestone that very much captures his tenure, for better and for worse. At the same time, the AI question looms — and it suggests that Something Is Rotten in the State of Cupertino . The new Siri still hasn’t launched, and when it does, it will be with Google’s technology at the core. That was, as I wrote in an Update , a momentous decision for Apple’s future: Apple’s plans are a bit like the alcoholic who admits that they have a drinking problem, but promises to limit their intake to social occasions. Namely, how exactly does Apple plan on replacing Gemini with its own models when (1) Google has more talent, (2) Google spends far more on infrastructure, and (3) Gemini will be continually increasing from the current level, where it is far ahead of Apple’s efforts? Moreover, there is now a new factor working against Apple: if this white-labeling effort works, then the bar for “good enough” will be much higher than it is currently. Will Apple, after all of the trouble they are going through to fix Siri, actually be willing to tear out a model that works so that they can once again roll their own solution, particularly when that solution hasn’t faced the market pressure of actually working, while Gemini has? In short, I think Apple has made a good decision here for short term reasons, but I don’t think it’s a short-term decision: I strongly suspect that Apple, whether it has admitted it to itself or not, has just committed itself to depending on 3rd-parties for AI for the long run. As I noted above and in that Update, this decision may work out; if it doesn’t, however, the sting will be felt long after Cook is gone. To that end, I certainly hope that John Ternus, the new CEO, was heavily involved in the decision; truthfully, he should have made it. To that end, it’s right that Cook is stepping down now. Jobs might have been responsible for taking Apple from 0 to 1, but it was Cook that took Apple from 1 to $436 billion in revenue and $118 billion in profit last year. It’s a testament to his capabilities and execution that Apple didn’t suffer any sort of post-founder hangover; only time will tell if, along the way, Cook created the conditions for a crash out, by virtue of he himself forgetting The Cook Doctrine and what makes Apple Apple.

0 views
baby steps Yesterday

Symposium: community-oriented agentic development

I’m very excited to announce the first release of the Symposium project as well as its inclusion in the Rust Foundation’s Innovation Lab . Symposium’s goal is to let everyone in the Rust community participate in making agentic development better. The core idea is that crate authors should be able to vend skills, MCP servers, and other extensions, in addition to code. The Symposium tool then installs those extensions automatically based on your dependencies. After all, who knows how to use a crate better than the people who maintain it? If you want to read more details about how Symposium works, I refer you to the announcement post from Jack Huey on the main Symposium blog . This post is my companion post, and it is focused on something more personal – the reasons that I am working on Symposium. The short version is that I believe in extensibility everywhere . Right now, the Rust language does a decent job of being extensible: you can write Rust crates that offer new capabilities that feel built-in, thanks to proc-macros, traits, and ownership. But we’re just getting started at offering extensibility in other tools, and I want us to hurry up! I want crate authors to be able to supply custom diagnostics. I want them to be able to supply custom lints. I want them to be able to supply custom optimizations. I want them to be able to supply custom IDE refactorings. And, as soon as I started messing around with agentic development, I wanted extensibility there too. The goal of Symposium is to give crate authors, and the broader Rust community, the ability to directly influence the experience of people writing Rust code with agents. Rust is a really popular target language for agents because the type system provides strong guardrails and it generates efficient code – and I predict it’s only going to become more popular . Despite Rust’s popularity as an agentic coding target, the Rust community right now are basically bystanders when it comes to the experience of people writing Rust with agents; I want us to have a means of influencing it directly. Enter Symposium. With Symposium, Crate authors can package up skills etc and then Symposium will automatically make them available for your agent. Symposium also takes care of bridging the small-but-very-real gaps between agents (e.g., each has their own hook format, and some of them use and some use , etc). Let me give you an example. Consider the assert-truct crate, recently created by Carl Lerche. lets you write convenient assertions that test the values of specific struct fields: This crate is neat, but of course, no models are going to know how to use it – it’s not part of their training set. They can figure it out by reading the docs, but that’s going to burn more tokens (expensive, slow, consumes carbon), so that’s not a great idea. In practice what people do today is to add skills to their project – for example, in his crate, Carl has a testing skill that also shows how to use assert-struct . But it seems silly for everybody who uses the crate to repeat that content. With Symposium, teaching your agent how to use your dependencies should not be necessary. Instead, your crates can publish their own skills or other extensions. The way this works is that the assert-struct crate defines the skill once, centrally, in its own repository 1 . Then there is a separate file in Symposium’s central recommendations repository with a pointer to the assert-struct repository. Any time that the assert-struct repository updates that skill, the updates are automatically synchronized for you. Neat! (You can also embed skills directly in the rr repository, but then updating them requires a PR to that repo.) It’s easy! Check out the docs here: https://symposium.dev/crate-authors/supporting-your-crate.html Skills, hooks, and MCP Servers, for now. Currently we allow skill content to be defined in a decentralized fashion but we require that a plugin be added to our central recommendations repository . This is a temporary limitation. We eventually expect to allow crate authors to adds skills and plugins in a fully decentralized fashion. We chose to limit ourselves to a centralized repository early on for three reasons: No problem, you can add a custom plugin source. I am, very much so. I feel like a lot of the uses of LLMs we see today are not great (e.g., chat bots hijack conversational and social cues to earn trust that they don’t deserve ) and to reconfirm peoples’ biases instead of challenging their ideas. And I’m worried about the environmental cost of data centers and the way companies have retreated from their climate goals . And I don’t like how centralized models concentrate economic power . 2 So yeah, I see all that. And I also see how LLMs enable people to build things that they couldn’t build before and help to make previously intractable problems soluble – and that includes more and more people who never thought of themselves as programmers 3 . My goal with Symposium and other projects is to be part of the solution, finding ways to leverage LLMs that are net positive: opening doors, not closing them. Fundamentally, the reason I am working on Symposium is that I believe everybody has something unique to offer . I see the appeal of strongly opinionated systems that reflect the brilliant vision of a particular person. But to me, the most beautiful systems are the ones that everybody gets to build together 4 . This is why I love open source. This is why I love emacs 5 . It’s why I love VSCode’s extension system, which has so many great gems 6 . To me, Symposium is a double win in terms of empowerment. First, it makes agents extensible, which is going to give crate authors more power to support their crates. But it also helps make agentic programming better, which I believe will ultimately open up programming to a lot more people . And that is what it’s all about. Actually as of this posting, the assert-struct skill is embedded directly in the recommendations repo . But I opened a PR to put it on assert-struct and I’ll port it over once it lands.  ↩︎ I’m very curious to do more with open models.  ↩︎ Within Amazon, it’s been amazing to watch how many people who never thought of themselves as software developers are starting to build software. Considering the challenges the software industry has with representation, I find this very encouraging. Diverse teams are stronger, better teams!   ↩︎ None of this is to say I don’t believe in good defaults; there’s a reason I use Zed and VSCode these days, and not emacs, much as I love it in concept.  ↩︎ OMG. One of my friends college wrote this amazing essay some time back on emacs . Next time you’re doomscrolling on the toilet or whatever, pop over to this essay instead. Fair warning, it’s long, so it’ll take you a while to read, but I think it nails what people love about emacs.  ↩︎ These days I’m really enjoying Zed, but I have to say, I really miss kahole/edamagit ! Which of course is inspired by the magit emacs package .  ↩︎ Even when decentralized support exists, a centralized repository will be useful, since there will always be crates that choose not to provide that support. Having a central list of plugins will make it easy to update people as we evolve Symposium. Having a centralized repository will help protect against malicious skills[^threat] while we look for other mechanisms, since we can vet the crates that are added and easily scan their content. Actually as of this posting, the assert-struct skill is embedded directly in the recommendations repo . But I opened a PR to put it on assert-struct and I’ll port it over once it lands.  ↩︎ I’m very curious to do more with open models.  ↩︎ Within Amazon, it’s been amazing to watch how many people who never thought of themselves as software developers are starting to build software. Considering the challenges the software industry has with representation, I find this very encouraging. Diverse teams are stronger, better teams!   ↩︎ None of this is to say I don’t believe in good defaults; there’s a reason I use Zed and VSCode these days, and not emacs, much as I love it in concept.  ↩︎ OMG. One of my friends college wrote this amazing essay some time back on emacs . Next time you’re doomscrolling on the toilet or whatever, pop over to this essay instead. Fair warning, it’s long, so it’ll take you a while to read, but I think it nails what people love about emacs.  ↩︎ These days I’m really enjoying Zed, but I have to say, I really miss kahole/edamagit ! Which of course is inspired by the magit emacs package .  ↩︎

0 views

Orbital

Six people—four astronauts and two cosmonauts—circle the Earth. They may be among the last to do so, as the space station they live in is due to be dismantled. While they circle and observe, watching sunrise after sunset, seeing typhoons and dust storms wash across the surface below, another crew of astronauts takes off for the moon, passing them by. But their gaze remains stubbornly down, not out; down into the water and land and lights, into their own memories and histories, the deaths and lives that keep them tethered as certainly as gravity prevents them from falling away. A moving love letter to our one and only planet. View this post on the web , subscribe to the newsletter , or reply via email .

0 views
HeyDingus Yesterday

Ternus will succeed Cook as Apple’s CEO

I’m happy to see the rumors were true that John Ternus will succeed Tim Cook as Apple’s CEO in September: Apple announced that Tim Cook will become executive chairman of Apple’s board of directors and John Ternus, senior vice president of Hardware Engineering, will become Apple’s next chief executive officer effective on September 1, 2026. The transition, which was approved unanimously by the Board of Directors, follows a thoughtful, long-term succession planning process. ​ While no one can know right now how his leadership will differ from Cook’s, Ternus appears to be a worthy candidate: product-focused, likable, with a proven track record: Ternus’s work on Mac has helped the category become more powerful and more popular globally than at any time in its 40-year history. That includes the recent introduction of MacBook Neo, an all-new laptop that makes the Mac experience even more accessible to more people around the world. This past fall, his team’s efforts were on full display with the introduction of a redefined iPhone lineup, including the incredibly powerful iPhone 17 Pro and Pro Max, the radically thin and durable iPhone Air, and the iPhone 17, which has been an incredible upgrade for users. Under his leadership, his team also drove advancements in AirPods to make them the world’s best in-ear headphones, with unprecedented active noise cancellation, as well as the capability to become an all-in-one hearing health system that can serve as over-the-counter hearing aids. His personal quote in the press release is charming: “ I am profoundly grateful for this opportunity to carry Apple’s mission forward,” said Ternus. “ Having spent almost my entire career at Apple, I have been lucky to have worked under Steve Jobs and to have had Tim Cook as my mentor. It has been a privilege to help shape the products and experiences that have changed so much of how we interact with the world and with one another. I am filled with optimism about what we can achieve in the years to come, and I am so happy to know that the most talented people on earth are here at Apple, determined to be part of something bigger than any one of us. I am humbled to step into this role, and I promise to lead with the values and vision that have come to define this special place for half a century.” I’m on the record as being disappointed in Cook’s leadership of late, but he’s had a 15-year tenure — longer than any previous Apple CEO — with many ups and downs. His personal letter to the community ( archived ) is humanizing. I’m glad he wrote it: This is not goodbye. But at this moment of transition, I wanted to take the opportunity to say thank you. Not on behalf of the company, this time, though there is a wellspring of gratitude for you that overflows inside our walls. But simply on behalf of me. Tim. A person who grew up in a rural place in a different time and, for these magical moments, got to be the CEO of the greatest company in the world. Thank you for the confidence and kindness you’ve shown me. Thank you for saying hi to me on the street and in our stores. Thank you for cheering alongside me when we unveiled a new product or service. Thank you, most of all, for believing in me to lead the company that has always put you at the center of our work. Every day we get up and think about what we can do to make your life a little bit better. And every day, you’ve made mine the best I could have asked for. And let’s not forget Johny Srouji, who I would presume was on the short list of candidates for CEO , but has ended up as Chief Hardware Officer — a brand-new title made just for him: Apple today announced that, effective immediately, Apple executive Johny Srouji will become chief hardware officer. Srouji, who most recently served as senior vice president of Hardware Technologies, will assume an expanded role leading Hardware Engineering, which John Ternus most recently oversaw, as well as the hardware technologies organization. I have been sitting on this title for years. https:// 512pixels.net/2026/04/cook-out/ gotta catch ’ em all RE : https:// techhub.social/@Techmeme /11643 8959705663471 …and now his watch has ended. Apple announces that John Ternus, senior VP of Hardware Engineering, will become Apple’s next CEO on September 1; Tim Cook will become executive chairman (Business Wire) https://www . businesswire.com/news/home/202 60420318241/en/ http://www . techmeme.com/260420/p24#a26042 0p24 Beginning in September, Apple will have had two CEOs named John, still behind the three going by Michael or Mike. Tim Cooked. We need fewer “ Cooked” puns and more “ I’ll Ternus car right around” puns. Mom says it’s my Ternus the CEO https://www . apple.com/newsroom/2026/04/tim -cook-to-become-apple-executive-chairman-john-ternus-to-become-apple-ceo/ Assistant TO the regional CEO. HeyDingus is a blog by Jarrod Blundy about technology, the great outdoors, and other musings. If you like what you see — the blog posts , shortcuts , wallpapers , scripts , or anything — please consider leaving a tip , checking out my store , or just sharing my work. Your support is much appreciated! I’m always happy to hear from you on social , or by good ol' email .

0 views

April 2026 blend of links

If you read this via your RSS reader, you won’t notice the tiny design refresh of the site: nothing major, just a new header, a new icon/logo, an improved HTML structure, and an overall simplification and further reduction in weight to be, you guessed it, as light as possible . I’m very pleased with the new look, and I have a feeling that this one will stick for a while. Anyway, moving on to the links I found interesting in the last few weeks. Man Who Threw Molotov Cocktail At Sam Altman’s Home Claims He Was Following ChatGPT Recipe For Risotto – The Onion, still the best website on the internet. Architypes, by Anthony Nelzin-Santos – A lovely collection of photos of old storefronts. (via People and Blogs ) The Song of LinkedIn – “ The song confirms your existing beliefs with words that are brave and controversial because, man, they just don’t get it. They’re so dumb! Their whole business is lagging behind! Whose business? Behind whom? Who knows! The Song of LinkedIn doesn’t care. All it cares about is making you feel like insights are happening to you when really I’m just being a dick by making you feel like a dick for not being as big a dick as me. Synergy! ” I Verified My LinkedIn Identity. Here's What I Actually Handed Over. – Deleting my LinkedIn account a few years ago is still one of the best decisions I’ve made in my life, and today when I’m asked why I am not on LinkedIn, I tend to answer with “Why are you on LinkedIn?”, and the words “pile of garbage website” may come out too, if I’m being polite. (via 82MHz ) delphitools – The kind of website that absolutely deserves a spot on my “JavaScript allowed” list. Excellent. (via Rodrigo Ghedin ) Nobody Gets Promoted for Simplicity – “ The actual path to seniority isn’t learning more tools and patterns, but learning when not to use them. Anyone can add complexity. It takes experience and confidence to leave it out. ” (via Violet Pixel ) All my clients wanted a carousel, now it's an A.I. chatbot! – “ I've learned that when a client says simple, they don't mean easy to use. They mean not impressive enough. They mean what will people think. A lean, fast website doesn't look like it cost anything. It doesn't signal effort. It doesn't say: we take this seriously. ” Dept. of Enthusiasm – Not sure when the next issue of this newsletter will be released, but reading this extremely well-written entry got me sold. I’m also very envious of that prose. (via Meanwhile ) ‘He’d gaze at the stars and go: I’m gonna be up there one day’: Prince by those who knew him best, 10 years after his death – If you’re familiar with my list of favourite songs , you can have an idea of what Prince means to me. His sudden death in April 2016 still hurts, as if I had lost a part of me, or an old friend. Scottish comedian Robert Florence, then, shared a thought that I think about often: “ You'll always be the angel and the devil on my shoulders. ”

0 views
Giles's blog Yesterday

Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post , I managed to train one that was almost (if not quite) there. Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them. The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how useful it might be in reality. Unfortunately, in January, I realised that my methodology was bad ; because I was asking the LLM to score a model in isolation, the LLM's natural randomness would mean that results were not really comparable, at least for models that were reasonably close in quality. For example, if two models both replied to ...then one run of the instruction-following test might "find the judge LLM in a good mood" and get, say, 5% -- after all, the model tried to answer, and actually used a real person's name, even if the answer was totally wrong. But in another run, the judge might be in a "worse mood" and score it at 0%. My fix was to have two scripts: The details are here . Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series. I felt it would make more sense to wait until I'd tried a bunch of interventions and got a number of models to try. Now I have those, so let's give it a go! At the end of the previous round of IFT tests, I had this table. It's sorted by the loss on the test set (shown to 3 decimal places), and has the score that the model got from an instruction fine-tuning run: There's a loose correlation where lower loss means a higher IFT score, with two weird exceptions: the two FineWeb-Edu training runs, where they got much higher results than you'd expect from the loss. My working hypothesis was that there were two components that led to a model getting a good score: So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones might be dumb but knowledgeable. The ones in between, by contrast, could be relatively dumb too, but also not know very much. There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB model seemed surprisingly good on the IFT results when considering its loss -- but perhaps there was some kind of step function, where as soon as a model got better than (say) 3.7 on the loss, it suddenly became smart in whatever way mattered. All very hand-wavy, of course, but it was a hypothesis of sorts. Would the new models fit that pattern? It was time to find out. I didn't think it was worth adding all 14 models that I've trained in my intervention-testing to that table, so I decided to just add four of them: Now, I already had files containing responses from fine-tuned versions of the other models, so I just needed to run the first of my two fine-tuning scripts against all four of the new models. I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it used GPT-5.4. If you run the script multiple times, each time will normally give you different scores anyway; hopefully the ranking will remain roughly the same. So given that I was going to have to re-run the script to get new aggregate results, and those would not really be comparable to the original ones anyway, this seemed like a reasonable price to pay for (hopefully) a smarter judge. I ran that once, and got some results that surprised me -- so much that I decided to do three runs and see if the results stood up. They did; here's the new table, with scores for each run, the average, and the rank that each one got based on the average. You can see that relative rankings are fairly consistent across the IFT runs. But while in general the lower-loss runs get better IFT results, now there are even more exceptions to that trend than there were before. Let's look down the "IFT rank" column, which is based on the IFT average: That's a really odd situation. If the training runs using gradient accumulation rather than DDP had been consistently worse -- or vice versa -- then we could imagine some kind of connection. But in the first case, GA beat DDP, but in the second, it was the other way around. Apart from that, we do still see that the two FineWeb-Edu models are doing much better than the others. And the remaining models are all pretty close together, both in terms of loss and in terms of their ranking, apart from the Local FineWeb train, which is bad in both. It is, however, interesting that Local FineWeb-Edu extended train, which was trained on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of the IFT numbers, though. That wasn't the case in my tests previously. All of this puzzled me. The "lots of knowledge makes a model better at this" idea seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all, if it was true, you'd expect the model trained on more data to be consistently better). And the "smart, low-loss models are better" side seemed to be contradicted by and 's bad results. What might be going on here? Looking at the training code, one thing stood out to me. The process was: In practice, the early-exit code always cut in pretty quickly. I'd noticed that during my original generation of the results for the new models: I decided to regenerate responses for all of the models, and then run the new responses past the LLM judge again. But this time I would keep a record of how many epochs of training we got before the exit: It was getting even harder to see any useful pattern! One thing that did stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB model was being instruction-trained for seven epochs. It was also rather noticeable that the two FineWeb-Edu models had the same "advantage", if that's what it was. But the Local FineWeb train had seven epochs too, and got a poor score, the OpenAI models only got two each, and led the pack, and got a pretty poor result given its six epochs of training. Still, what would happen if we got rid of that confounder? I did yet another set of runs; this time, I changed the fine-tuning/generation script to always do four epochs -- no early exit. I chose four because it was the modal number in the previous trains -- no strong reason for it beyond that. Here's what came out at the end: Still no obvious pattern. What if we try seven epochs of training for all of them, so that they all get as much "benefit" (if that's what it is) as the FineWeb-Edu models? Just as confused as ever... Here's a table with all of the ranks we got from these tests: It's hard to draw much sense out of this, but a few things are clear: On the one hand, training different models for different numbers of epochs feels wrong for an evaluation like this, as they're being "treated differently". On the other hand, if it's meant to be a good evaluation of model usefulness in the real world, then individual models would be fine-tuned for different amounts of time, depending on validation loss. So perhaps it is better? But the differing results are still quite a puzzle. I figured that a modern AI could easily build me a data exploration interface, specifically for the original results and seven-epoch ones, so I asked Claude and got this rather nice one . After poring over that, though, I couldn't find a smoking gun -- for example, some kind of systematic error that was always making that pulled its score down. I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this. If we consider the loss landscape that these models are all in, they've all been trained to try to get to a place with as low loss as we could manage. When we do the instruction fine-tune on them, we're changing the landscape -- the objective of "be better at following instructions" is different to "be better at minimising loss". Now, those two landscapes could be completely different! You can imagine a task that we might set instead of instruction-following that could be completely uncorrelated with loss minimisation, or even inversely correlated. But instruction-following is relatively close; it at least shares features like "generate coherent text". So when we do the instruction fine-tuning, what we're trying to do is to move from the place where the model ended up after its pre-training, to a place where performance on the new goal -- instruction-following -- is best. Here's where I'm going to get more than a bit hand-wavy. You can easily imagine that some places where the loss was low, there might be downhill slopes pointing towards good locations in the new instruction-following landscape. With instruction fine-tuning, you'd be able to get a good IFT model. But other places with low loss might not have that advantage; maybe they're at or near a poor "local minimum" in the IFT landscape -- that is, a place where there is no downhill route to a better place. So simple fine-tuning like this might never get a good result! With this mindset, we might say that the OpenAI weights are pretty well-positioned, not just in the loss landscape but also in the IFT landscape. The FineWeb-Edu models happened to get lucky, and wind up in a place that (despite having poor loss), is well-positioned for the IFT objective. And by contrast, and were just unlucky: they got to a place where the loss landscape was not well-correlated with the IFT landscape. This seems plausible enough for me to use it as my working model for now, and see if I can work out some way to test it. Keeping track of the validation loss during the instruction fine-tuning process would certainly be a good start; unfortunately I only realised that after doing all of the tests above, and re-doing them would be quite a lot of work. One final thing is worth repeating. Our two "unlucky" models, and , each had a twin. The former was the DDP-trained counterpart of the gradient-accumulated , while the latter was the gradient-accumulated counterpart of . So while something odd clearly happened, it doesn't look like DDP or gradient accumulation by themselves are the culprit. I think that at this point, it's best for me to draw a line under this -- I have a bunch of other things I'd like to get to, and this is a bit of a side quest at this point. Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal. In some cases, it seems likely that lower-loss models can be worse for actual use. Coming up next: I'm going to wrap up this "interventions" mini-series, and move on to the final steps in my LLM from scratch journey. See you then! One that fine-tuned the model then got it to generate responses, then saved those responses in a file. One that took a bunch of files generated by the above, one for each of a set of different models, and presented them to the LLM together, so that it would (hopefully) be consistent in how it rated them relative to each other. Its raw intelligence: lower-loss models were smarter, so they were better at instruction-following after the fine-tune. Its knowledge. All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on what amounted to minimally-curated data from the Internet. But FineWeb-Edu is meant to be "the most educational" subset of FineWeb, so it presumably is more dense in useful facts. , the baseline cloud-trained model for all of the interventions . , the locally-trained version of the same -- the first model from this post . , the best model we managed to get in the cloud . , the best local model -- the second from this post . The first surprise is . It has the fourth-best loss, but it's the worst model out of all of them on the instruction fine-tuning test! It was trained on exactly the same data as all of the others apart from the OpenAI ones and the FineWeb-Edu ones. Even more perplexingly, it was as close a match to as I could make it, but got completely different results. You might remember from the post that those two runs started with the same weights and had exactly the same training config; the only difference was that they were trained on different architectures, and one used DDP with a real global batch size of 96, while the other used gradient accumulation to get the same batch size. also does much worse than you'd expect from its loss numbers; it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but much worse on the IFT test. Again, this one is essentially a clone of another: , which was the same training run but using DDP rather than gradient accumulation. The same problem -- one of a pair of closely-matched models has worse results on the IFT test. But in this case, it's the gradient accumulation model that turned out bad. Fine-tune the model for a maximum of 100 epochs over the training set. If loss on a held-back validation set went above the result for the previous epoch, we did an early exit and used the previous epoch's model for the generation of the responses. took 6 epochs until validation loss started rising. Performance on this test is correlated with loss, but it's far from the only factor. The OpenAI weights consistently lead the pack. Of our own models, , Cloud FineWeb, 8x A100 40 GiB, and Local FineWeb-Edu train do pretty well. Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently worse than the model it was based on. and are consistently bad. Cloud FineWeb, 8x A100 80 GiB is also not great.

0 views

My Everyday Carry

Thought it would be fun to do a simple everyday carry post! Here's what I typically have in my backpack. This is a very recent addition and replaced the iPod Classic I was carrying. I'll be honest, I really like this little guy. It's a MP3/FLAC/AAC player themed like a tape player. It's made of metal, super lightweight and has a good battery life. While I love the iPod, I haven't had time to mod it beyond Rockbox, so it's on the original HDD and battery. The Echo gives me a SD card slot, USB C charging, Bluetooth and a fresh battery. The UI is more clunky than an iPod, but it's good enough. I added these to the cart when buying the Echo, not expecting much with the low price, but they have blown me away. Seriously, these sound way better than they should. They do feel super cheap and flimsy, reminiscent of the headphones we had in the computer lab during grade school, but that does mean they are very lightweight. My current primary machine, recently replacing my MacBook Pro M1 as I've gone full in on Linux. I absolutely wouldn't recommend this laptop due to the horrible webcam/microphone/speakers, nearly unusable trackpad (hence the next item), USB C port that barely works and overall poor build quality. I did make it slightly more bearable by upgrading the screen . Specs below for those interested! Old mouse that I've had for ages but I love it. Feels great, battery lasts awhile, perfect substitute for my awful trackpad! Got this wallet on Amazon, honestly don't know the brand (probably one of those popup brands that disappears in a week). It's decent enough quality, has a money clip and a feature where you push a lever and your cards fan out. The best feature though is the AirTag holder, no more searching the house for my wallet! I've had this guy for awhile, but it's just recently dethroned my heavily modded GBA SP thanks to Allium OS . Allium provides a simple UI that gets out of the way. The killer feature for me is the "Guides" functionality. While playing a game, you can quickly pop up a walk through on screen. It remembers where you've scrolled through play sessions, making it perfect for RPGs. The battery and AUX jack are nice enhancements over my GBA SP as well (who knew adding a bright display and underglow to the SP would kill battery life??). I bought the Nomad on preorder before it released, and have loved it since. I use it to sketch designs, take notes and read books. I love the design (I have the crystal clear one) and the fact that it's repairable + offline + subscription free. I love my Palm Pilots, and the C is the one I most use as a everyday carry. The keyboard is absolutely a killer feature. I use this guy to track calories , store reminders, manage my calendar and write a journal. The OS is fast and just works . Man I wish they made modern Palm Pilots! It's overkill and honestly I probably didn't need it, but I love the bigger design of the Ultra. I use it for swimming, running and cycling. I tried switching to Android for awhile with a Pixel 9 Pro and I just hated it. To me, Android feels like KDE: super powerful, full of features, customizable and ugly. I wish I could love it, but I care a lot about good/consistent design and GNOME+iOS are clear winners in this category (in my opinion). That's it for me, would love to know what everyone else deems worthy of carrying on their back!

0 views

Exclusive: Microsoft To Shift GitHub Copilot Users To Token-Based Billing, Tighten Rate Limits

Note: Microsoft has now confirmed some of these details in a blog post . Leaked internal documents viewed by Where’s Your Ed At reveal that Microsoft intends to pause new signups for the student and paid individual tiers of AI coding product GitHub Copilot, tighter rate limits, and eventually move users to “token-based billing,” charging them based on what the actual cost of their token burn really is. The document says that although token-based billing has been a top priority for Microsoft, it became more urgent in recent months, with the week-over-week cost of running GitHub Copilot nearly doubling since January.  The move to token-based billing will see GitHub users charged based on their usage of the platform, and how many tokens their prompts consume — and thus, how much compute they use. It’s unclear at this time when this will begin. This is a significant move, reflecting the significant cost of running models on any AI product. Much like Anthropic, OpenAI, Cursor, and every other AI company , Microsoft has been subsidizing the cost of compute, allowing users to burn way, way more in tokens than their subscriptions cost.  The party appears to be ending for subsidized AI products, with Microsoft’s upcoming move following Anthropic’s ( per The Information ) recent changes shifting enterprise users to token-based billing as a means of reducing its costs. GitHub Copilot currently has two tiers for individual developers — a $10-per-month package called GitHub Copilot Pro, and a $39-a-month subscription called GitHub Copilot Pro+.  According to the leaked documents, both of these tiers will be impacted by the shutdown, as will the GitHub Copilot Student product, which is included within the free GitHub Education package. According to the documents, Microsoft also intends to tighten rate limits on some Copilot Business and Enterprise plans, as well as on individual plans, where limits have already been squeezed, and plans to suspend trials of paid individual plans as it attempts to “fight abuse.” Although Microsoft has regularly tweaked the rate limits for individual GitHub Copilot accounts, most recently at the start of April, the document notes that these changes weren’t enough, and that more rate limits changes are to come in the next few weeks. As part of this cost-cutting exercise, Microsoft intends to remove Anthropic’s Opus family of AI models from the $10-per-month GitHub Copilot Pro package altogether.  Microsoft most recently retired Opus 4.6 Fast at the start of April for GitHub Copilot Pro+ users , although this decision was framed as a way to “further improve service reliability” and “[streamline] our model offerings and focusing resources on the models our users use the most.” Other Opus models — namely Opus 4.6 and Opus 4.5 — will be removed from the GitHub Copilot Pro+ tier in the coming weeks, as Microsoft transitions to Anthropic’s latest Opus 4.7 model .  The move towards Opus 4.7 will likely see GitHub Copilot Pro+ users reach their usage limits faster.  Microsoft is offering a 7.5x request multiplier until April 30 — although it’s unclear what the multiplier will be after this date. This might sound like a good thing, but it actually means that each request using Opus 4.7 is actually 7.5 of them. Redditors immediately worked that out and are a little bit worried . Premium request multipliers allow GitHub to reflect the cost of compute for different models. LLMs that require the most compute will have higher premium request multipliers compared to those that are comparatively more lightweight.  For example, the GPT-5.4 Mini model has a premium request multiplier of 0.33 — meaning that every prompt is treated as one-third of a premium request — whereas the now-retired Claude Opus 4.6 Fast had a 30x multiplier, meaning each request was treated as thirty of them. The standard version of Claude Opus 4.6 has a premium request multiplier of three — meaning that, even with the promotional pricing, Claude Opus 4.7 is around 250% more expensive to use.  The announcements for all of these changes are scheduled to take place throughout the week.  If you liked this news hit and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of NVIDIA , Anthropic and OpenAI’s finances , and the AI bubble writ large . I recently put out the timely and important Hater’s Guide To The SaaSpocalypse , another on How AI Isn't Too Big To Fail , a deep (17,500 word) Hater’s Guide To OpenAI , and just last week put out the massive Hater’s Guide To Private Credit . Subscribing to premium is both great value and makes it possible to write these large, deeply-researched free pieces every week.  Internal documents reveal that Microsoft plans to temporarily suspend individual account signups to its GitHub Copilot coding product, as it transitions from requests (single interactions with Copilot) towards token-based billing.  The documents reveal that the weekly cost of running Github Copilot has doubled since the start of the year.  Microsoft also intends to tighten the rate limits on its individual and business accounts, and to remove access to certain models for those with the cheapest subscriptions.

0 views
Kev Quirk Yesterday

My Best Sub £100 Purchase

I was recently listening to an episode of The Idea Roastery about personal life gamechangers and toward the end of the episode, Herman asked Jason: What is the best purchase you've ever made for less than £100? For Jason is was an egg poacher, and for Herman it was a coffee grinder. This discussion got me thinking about what mine was, and I really wasn't sure at first. But after some thought, it hit me. It's my dog, Tia! She's getting old now, at nearly 14 years of age. But my wife and I got when she was 9 weeks old, after being taken from the litter at just 6 weeks old by some scumbag who ended up dumping her. She cost us £80, and for that £80 we've had years of love, affection, and friendship from her. She's definitely my game-changer. She's pretty cool too... I absolutely love everything about this dog. She's my best friend in the world. She's kind. She's gentle. She's the best at spooning too. Seriously, the best . As I look back at a life well lived and she heads into her twilight years, we know we don't have long left with her, but my goodness the years we have had have been incredible. So yeah, Tia is by far the best sub £100 I've ever spent, and probably will ever spend. Love you, T-bone. x Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment .

0 views