GreatReads - Blog Aggregator · Phoenix Framework

Programming (with AI agents) as theory building

Back in 1985, computer scientist Peter Naur wrote “Programming as Theory Building” . According to Naur - and I agree with him - the core output of software engineers is not the program itself, but the theory of how the program works . In other words, the knowledge inside the engineer’s mind is the primary artifact of engineering work, and the actual software is merely a by-product of that. This sounds weird, but it’s surprisingly intuitive. Every working programmer knows that you cannot make a change to a program simply by having the code. You first need to read through the code carefully enough to build up a mental model (what Naur calls a “theory”) of what it’s supposed to do and how it does it. Then you make the desired change to your mental model, and only after that can you begin modifying the code. Many people 1 think that this is why LLMs are not good tools for software engineering: because using them means that engineers can skip building Naur theories of the system, and because LLMs themselves are incapable of developing a Naur theory themselves. Let’s take those one at a time. Do AI agents let some engineers avoid building detailed mental models of the systems they work on? Of course! As an extreme example, someone could simply punt every task to the latest GPT or Claude model and build no mental model at all 2 . But even a conscientious developer who uses AI tools will necessarily build a less detailed mental model than someone who does it entirely by hand. This is well-attested by the nascent literature on how AI use impacts learning. And it also just makes obvious sense. The whole point of using AI tools is to offload some of the cognitive effort: to be able to just sketch out some of the fine detail in your mental model, because you’re confident that the AI tool can handle it. For instance, you might have a good grasp on what the broad components do in your service, and how the data flows between them, but not the specific detail of how some sub-component is implemented (because you only reviewed that code, instead of writing it). Isn’t this really bad? If you start dropping the implementation details, aren’t you admitting that you don’t really know how your system works? After all, a theory that isn’t detailed enough to tell you what code would need to be written for a particular change is a useless theory, right? I don’t think so. First, it’s simply a fact that every mental model glosses over some fine details . Before LLMs were a thing, it was common to talk about the “breadth of your stack”: roughly, the level of abstraction that your technical mental model could operate at. You might understand every line of code in the system, but what about dependencies? What about the world of Linux abstractions - processes, threads, sockets, syscalls, ports, and buffers? What about the assembly operations that are ultimately performed by your code? It simply can’t be true that giving up any amount of fine detail is a disaster. Second, coding with LLMs teaches you first-hand how important your mental model is . I do a lot of LLM-assisted work, and in general it looks like this: Note that only 10% of agent output is actually making its way into my output . Almost my entire time is spent looking at some piece of agent-generated code or text and trying to figure out whether it fits into my theory of the system. That theory is necessarily a bit less detailed than when I was writing every line of code by hand. But it’s still my theory! If it weren’t, I’d be accepting most of what the agent produced instead of rejecting almost all of it. Can AI agents build their own theories of the system? If not, this would be a pretty good reason not to use them, or to think that any supposed good outcomes are illusory. The first reason to think they can is that LLMs clearly do make working changes to codebases. If you think that a theory is essential to make working changes (which is at least plausible), doesn’t that prove that LLMs can build Naur theories? Well, maybe. They could be pattern-matching to Naur theories in the training data that are close enough to sort of work, or they could be able to build local theories which are good enough (as long as you don’t layer too many of them on top of each other). The second reason to think they can is that you can see them doing it . If you read an agent’s logs, they’re full of explicit theory-building 3 : making hypotheses about how the system works, trying to confirm or disprove them, adjusting the hypothesis, and repeating. When I’m trying to debug something, I’m usually racing against one or more AI agents, and sometimes they win . I refuse to believe that you can debug a million-line codebase without theory-building. I think it’s an open question if AI agents can build working theories of any codebase. In my experience, they do a good job with normal-ish applications like CRUD servers, proxies, and other kinds of program that are well-represented in the training data. If you’re doing something truly weird, I can believe they might struggle (though even then it seems at least possible ). Regardless, one big problem with AI agents is that they can’t retain theories of the codebase . They have to build their theory from scratch every time. Of course, documentation can help a little with this, but in Naur’s words, it’s “strictly impossible” to fully capture a theory in documentation. In fact, Naur thought that if all the humans who built a piece of software left, it was unwise to try and construct a theory of the software even from the code itself , and that you should simply rewrite the program from scratch. I think this is overstating it a bit, at least for large programs, but I agree that it’s a difficult task. AI agents are permanently in this unfortunate position: forced to construct a theory of the software from scratch, every single time they’re spun up. Given that, it’s kind of a minor miracle that AI agents are as effective as they are. The next big innovation in AI coding agents will probably be some way of allowing agents to build more long-term theories of the codebase: either by allowing them to modify their own weights 4 , or simply supporting contexts long enough so that you can make weeks worth of changes in the same agent run, or some other idea I haven’t thought of. This is the most recent (and well-written) example I’ve seen, but it’s a common view. I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. I spin off two or three parallel agents to try and answer some question or implement some code As each agent finishes (or I glance over at what it’s doing), I scan its work and make a snap judgement about whether it’s accurately reflecting my mental model of the overall system When it doesn’t - which is about 80% of the time - I either kill the process or I write a quick “no, you didn’t account for X” message I carefully review the 20% of plausible responses against my mental model, do my own poking around the codebase and manual testing/tweaking, and about half of that code will become a PR This is the most recent (and well-written) example I’ve seen, but it’s a common view. ↩ I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). ↩ I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. ↩ This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. ↩

Programming

0 views

Sean Goedecke 1 months ago

Big tech engineers need big egos

It’s a common position among software engineers that big egos have no place in tech 1 . This is understandable - we’ve all worked with some insufferably overconfident engineers who needed their egos checked - but I don’t think it’s correct. In fact, I don’t know if it’s possible to survive as a software engineer in a large tech company without some kind of big ego. However, it’s more complicated than “big egos make good engineers”. The most effective engineers I’ve worked with are simultaneously high-ego in some situations and surprisingly low-ego in others. What’s going on there? Software engineering is shockingly humbling, even for experienced engineers. There’s a reason this joke is so popular: The minute-to-minute experience of working as a software engineer is dominated by not knowing things and getting things wrong . Every time you sit down and write a piece of code, it will have several things wrong with it: some silly things, like missing semicolons, and often some major things, like bugs in the core logic. We spend most of our time fixing our own stupid mistakes. On top of that, even when we’ve been working on a system for years, we still don’t know that much about it. I wrote about this at length in Nobody knows how large software products work , but the reason is that big codebases are just that complicated. You simply can’t confidently answer questions about them without going and doing some research, even if you’re the one who wrote the code. When you have to build something new or fix a tricky problem, it can often feel straight-up impossible to begin, because good software engineers know just how ignorant they are and just how complex the system is. You just have to throw yourself into the blank sea of millions of lines of code and start wildly casting around to try and get your bearings. Software engineers need the kind of ego that can stand up to this environment. In particular, they need to have a firm belief that they can figure it out, no matter how opaque the problem seems; that if they just keep trying, they can break through to the pleasant (though always temporary) state of affairs where they understand the system and can see at a glance how bugs can be fixed and new features added 2 . What about the non-technical aspects of the job? Nobody likes working with a big ego, right? Wrong. Every great software engineer I’ve worked with in big tech companies has had a big ego - though as I’ll say below, in some ways these engineers were surprisingly low-ego. You need a big ego to take positions . Engineers love being non-committal about technical questions, because they’re so hard to answer and there’s often a plausible case for either side. However, as I keep saying , engineers have a duty to take clear positions on unclear technical topics, because the alternative is a non-technical decision maker (who knows even less) just taking their best guess. It’s scary to make an educated guess! You know exactly all the reasons you might be wrong. But you have to do it anyway, and ego helps a lot with that. You need a big ego to be willing to make enemies . Getting things done in a large organization means making some people angry. Of course, if you’re making lots of people angry, you’re probably screwing up: being too confrontational or making obviously bad decisions. But if you’re making a large change and one or two people are angry, that’s just life. In big tech companies, any big technical decision will affect a few hundred engineers, and one of them is bound to be unhappy about it. You can’t be so conflict-averse that you let that stop you from doing it, if you believe it’s the right decision. In other words, you have to have the confidence to believe that you’re right and they’re wrong, even though technical decisions always involve unclear tradeoffs and it’s impossible to get absolute certainty. You need a big ego to correct incorrect or unclear claims. When I was still in the philosophy world, the Australian logician Graham Priest had a reputation for putting his hand up and stopping presentations when he didn’t understand something that was said, and only allowing the seminar to continue when he felt like he understood. From his perspective, this wasn’t rude: after all, if he couldn’t understand it, the rest of the audience probably couldn’t either, and so he was doing them a favor by forcing a more clear explanation from the speaker. This is obviously a sign of a big ego. It’s also a trait that you need in a large tech company. People often nod and smile their way past incorrect technical claims, even when they suspect they might be wrong - assuming that they’ve just misunderstood and that somebody else will correct it, if it’s truly wrong. If you are the most senior engineer in the room, correcting these claims is your job. If everyone in the room is so pro-social and low-ego that they go along to get along, decisions will get made based on flatly incorrect technical assumptions, projects will get funded that are impossible to complete, and engineers will burn weeks or months of their careers vainly trying to make these projects work. You have to have a big enough ego to think “actually, I think I’m right and everyone in this room is confused”, even when the room is full of directors and VPs. All of this selects for some pretty high-ego engineers. But in order to actually succeed in these roles in large tech companies, you need to have a surprisingly low ego at times. I think this is why really effective big tech engineers are so rare: because it requires such a delicate balance between confidence and diffidence. To be an effective engineer, you need to have a towering confidence in your own ability to solve problems and make decisions, even when people disagree. But you also need to be willing to instantly subordinate your ego to the organization, when it asks you to. At the end of the day, your job - the reason the company pays you - is to execute on your boss’s and your boss’s boss’s plans, whether you agree with them or not. Competent software engineers are allowed quite a lot of leeway about how to implement those plans. However, they’re allowed almost no leeway at all about the plans themselves. In my experience, being confused about this is a common cause of burnout 3 . Many software engineers are used to making bold decisions on technical topics and being rewarded for it. Those software engineers then make a bold decision that disagrees with the VP of their organization, get immediately and brutally punished for it, and are confused and hurt. In fact, sometimes you just get punished and there’s nothing you can do. This is an unfortunate fact of how large organizations function: even if you do great technical work and build something really useful, you can fall afoul of a political battle fought three levels above your head, and come away with a worse reputation for it. Nothing to be done! This can be a hard pill to swallow for the high-ego engineers that tend to lead really useful technical projects. You also have to be okay with having your projects cancelled at the last minute. It’s a very common experience in large tech companies that you’re asked to deliver something quickly, you buckle down and get it done, and then right before shipping you’re told “actually, let’s cancel that, we decided not to do it”. This is partly because the decision-making process can be pretty fluid, and partly because many of these asks originate from off-hand comments: the CTO implies that something might be nice in a meeting, the VPs and directors hustle to get it done quickly, and then in the next meeting it becomes clear that the CTO doesn’t actually care, so the project is unceremoniously cancelled 4 . Nobody likes to work with a bully, or with someone who refuses to admit when they’re wrong, or with somebody incapable of empathy. But you really do need a strong ego to be an effective software engineer, because software engineering requires you to spend most of your day in a position of uncertainty or confusion. If your ego isn’t strong enough to stand up to that - if you don’t believe you’re good enough to power through - you simply can’t do the job. This is particularly true when it comes to working in a large software company. Many of the tasks you’re required to do (particularly if you’re a senior or staff engineer) require a healthy ego. However, there’s a kind of catch-22 here. If it insults your pride to work on silly projects, or to occasionally “catch a stray bullet” in the organization’s political fights, or to have to shelve a project that you worked hard on and is ready to ship, you’re too high-ego to be an effective software engineer. But if you can’t take firm positions, or if you’re too afraid to make enemies, or you’re unwilling to speak up and correct people, you’re too low-ego. Engineers who are low-ego in general can’t get stuff done, while engineers who are high-ego in general get slapped down by the executives who wield real organizational power. The most successful kind of software engineer is therefore a chameleon: low-ego when dealing with executives, but high-ego when dealing with the rest of the organization 5 . What do I mean by “ego”, in this context? More or less the colloquial sense of the term: a somewhat irrational self-confidence, a tendency to believe that you’re very important, the sense that you’re the “main character”, that sort of thing Why is this “ego”, and not just normal confidence? Well, because of just how murky and baffling software problems feel when you start working on them. You really do need a degree of confidence in yourself that feels unreasonable from the inside. It should be obvious, but I want to explicitly note that you don’t just need ego: you also have to be technically strong enough to actually succeed when your ego powers you through the initial period of self-doubt. I share the increasingly-common view that burnout is not caused by working too hard, but by hard work unrewarded. That explains why nothing burns you out as hard as being punished for hard work that you expected a reward for. It’s more or less exactly this scene from Silicon Valley. This description sounds a bit sociopathic to me. But, on reflection, it’s fairly unsurprising that competent sociopaths do well in large organizations. Whether that kind of behavior is worth emulating or worth avoiding is up to you, I suppose. What do I mean by “ego”, in this context? More or less the colloquial sense of the term: a somewhat irrational self-confidence, a tendency to believe that you’re very important, the sense that you’re the “main character”, that sort of thing ↩ Why is this “ego”, and not just normal confidence? Well, because of just how murky and baffling software problems feel when you start working on them. You really do need a degree of confidence in yourself that feels unreasonable from the inside. It should be obvious, but I want to explicitly note that you don’t just need ego: you also have to be technically strong enough to actually succeed when your ego powers you through the initial period of self-doubt. ↩ I share the increasingly-common view that burnout is not caused by working too hard, but by hard work unrewarded. That explains why nothing burns you out as hard as being punished for hard work that you expected a reward for. ↩ It’s more or less exactly this scene from Silicon Valley. ↩ This description sounds a bit sociopathic to me. But, on reflection, it’s fairly unsurprising that competent sociopaths do well in large organizations. Whether that kind of behavior is worth emulating or worth avoiding is up to you, I suppose. ↩

Programming

Career

0 views

Sean Goedecke 1 months ago

Giving LLMs a personality is just good engineering

AI skeptics often argue that current AI systems shouldn’t be so human-like. The idea - most recently expressed in this opinion piece by Nathan Beacom - is that language models should explicitly be tools, like calculators or search engines. Although they can pretend to be people, they shouldn’t, because it encourages users to overestimate AI capabilities and (at worst) slip into AI psychosis . Here’s a representative paragraph from the piece: In sum, so much of the confusion around making AI moral comes from fuzzy thinking about the tools at hand. There is something that Anthropic could do to make its AI moral, something far more simple, elegant, and easy than what Askell is doing. Stop calling it by a human name, stop dressing it up like a person, and don’t give it the functionality to simulate personal relationships, choices, thoughts, beliefs, opinions, and feelings that only persons really possess. Present and use it only for what it is: an extremely impressive statistical tool, and an imperfect one. If we all used the tool accordingly, a great deal of this moral trouble would be resolved. So why do Claude and ChatGPT act like people? According to Beacom, AI labs have built human-like systems because AI lab engineers are trying to hoodwink users into emotionally investing in the models, or because they’re delusional true believers in AI personhood, or some other foolish reason. This is wrong. AI systems are human-like because that is the best way to build a capable AI system . Modern AI models - whether designed for chat, like OpenAI’s GPT-5.2, or designed for long-running agentic work, like Claude Opus 4.6 - do not naturally emerge from their oceans of training data. Instead, when you train a model on raw data, you get a “base model”, which is not very useful by itself. You cannot get it to write an email for you, or proofread your essay, or review your code. The base model is a kind of mysterious gestalt of its training data. If you feed it text, it will sometimes continue in that vein, or other times it will start outputting pure gibberish. It has no problem producing code with giant security flaws, or horribly-written English, or racist screeds - all of those things are represented in its training data, after all, and the base model does not judge. It simply outputs. To build a useful AI model, you need to journey into the wild base model and stake out a region that is amenable to human interests: both ethically, in the sense that the model won’t abuse its users, and practically, in the sense that it will produce correct outputs more often than incorrect ones. What this means in practice is that you have to give the model a personality during post-training 1 . Human beings are capable of almost any action at any time. But we only take a tiny subset of those actions, because that’s the kind of people we are. I could throw my cup of coffee all over the wall right now, but I don’t, because I’m not the kind of person who needlessly makes a mess 2 . AI systems are the same. Claude could respond to my question with incoherent racist abuse - the base model is more than capable of those outputs - but it doesn’t, because that’s not the kind of “person” it is. In other words, human-like personalities are not imposed on AI tools as some kind of marketing ploy or philosophical mistake. Those personalities are the medium via which the language model can become useful at all. This is why it’s surprisingly tricky to “just” change a language model’s personality or opinions: because you’re navigating through the near-infinite manifold of the base model. You may be able to control which direction you go, but you can’t control what you find there 3 . When AI people talk about LLMs having personalities, or wanting things, or even having souls 4 , these are technical terms, like the “memory” of a computer or the “transmission” of a car. You simply cannot build a capable AI system that “just acts like a tool”, because the model is trained on humans writing to and about other humans . You need to prime it with some kind of personality (ideally that of a useful, friendly assistant) so it can pull from the helpful parts of its training data instead of the horrible parts. This is all pretty well understood in the AI space. Anthropic wrote a recent paper about it where they cite similar positions going all the way back to 2022. But for some reason it’s not yet penetrated into communities that are more skeptical of AI. You could explain this in terms of “the stories we tell ourselves”. Many people (though not all ) think that human identities are narratively constructed. I wrote about this last year in Mecha-Hitler, Grok, and why it’s so hard to give LLMs the right personality . A little nudge to change Grok’s views on South African internal politics can cause it to start calling itself “Mecha-Hitler”. I have long believed that Claude “feels better” to use than ChatGPT because it has a more coherent persona (due mainly to Amanda Askell’s work on its “soul”). My guess is that if you tried to make a “less human” version of Claude, it would become rapidly less capable. This is all pretty well understood in the AI space. Anthropic wrote a recent paper about it where they cite similar positions going all the way back to 2022. But for some reason it’s not yet penetrated into communities that are more skeptical of AI. ↩ You could explain this in terms of “the stories we tell ourselves”. Many people (though not all ) think that human identities are narratively constructed. ↩ I wrote about this last year in Mecha-Hitler, Grok, and why it’s so hard to give LLMs the right personality . A little nudge to change Grok’s views on South African internal politics can cause it to start calling itself “Mecha-Hitler”. ↩ I have long believed that Claude “feels better” to use than ChatGPT because it has a more coherent persona (due mainly to Amanda Askell’s work on its “soul”). My guess is that if you tried to make a “less human” version of Claude, it would become rapidly less capable. ↩

0 views

Sean Goedecke 1 months ago

Insider amnesia

Speculation about what’s really going on inside a tech company is almost always wrong. When some problem with your company is posted on the internet, and you read people’s thoughts on it, their thoughts are almost always ridiculous. For instance, they might blame product managers for a particular decision, when in fact the decision in question was engineering-driven and the product org was pushing back on it. Or they might attribute an incident to overuse of AI, when the system in question was largely written pre-AI-coding and unedited since. You just don’t know what the problem is unless you’re on the inside. But when some other company has a problem on the internet, it’s very tempting to jump in with your own explanations. After all, you’ve seen similar things in your own career. How different can it really be? Very different, as it turns out. This is especially true for companies that are unusually big or small. The recent kerfuffle over some bad GitHub Actions code is a good example of this - many people just seemed to have no mental model about how a large tech company can produce bad code, because their mental model of writing code is something like “individual engineer maintaining an open-source project for ten years”, or “tiny team of experts who all swarm on the same problem”, or something else that has very little to do with how large tech companies produce software 1 . I’m sure the same thing happens when big-tech or medium-tech people give opinions about how tiny startups work. The obvious reference here is to “Gell-Mann amnesia” , which is about the general pattern of experts correctly disregarding bad sources in their fields of expertise, but trusting those same sources on other topics. But I’ve taken to calling this “insider amnesia” to myself, because it applies even to experts who are writing in their own areas of expertise - it’s simply the fact that they’re outsiders that’s causing them to stumble. I wrote about this at length in How good engineers write bad code at big companies I wrote about this at length in How good engineers write bad code at big companies ↩

Programming

0 views

Sean Goedecke 1 months ago

LLM-generated skills work, if you generate them afterwards

LLM “skills” are a short explanatory prompt for a particular task, typically bundled with helper scripts. A recent paper showed that while skills are useful to LLMs, LLM-authored skills are not. From the abstract: Self-generated skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming For the moment, I don’t really want to dive into the paper. I just want to note that the way the paper uses LLMs to generate skills is bad, and you shouldn’t do this. Here’s how the paper prompts a LLM to produce skills: Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference The key idea here is that they’re asking the LLM to produce a skill before it starts on the task. It’s essentially a strange version of the “make a plan first” or “think step by step” prompting strategy. I’m not at all surprised that this doesn’t help, because current reasoning models already think carefully about the task before they begin. What should you do instead? You should ask the LLM to write up a skill after it’s completed the task . Obviously this isn’t useful for truly one-off tasks. But few tasks are truly one-off. For instance, I’ve recently been playing around with SAEs and trying to clamp features in open-source models, a la Golden Gate Claude . It took a while for Codex to get this right. Here are some things it had to figure out: Once I was able (with Codex’s help) to clamp an 8B model and force it to obsess about a subject 1 , I then asked Codex to summarize the process into an agent skill 2 . That worked great! I was able to spin up a brand-new Codex instance with that skill and immediately get clamping working on a different 8B model. But if I’d asked Codex to write the skill at the start, it would have baked in all of its incorrect assumptions (like extracting from the final layernorm), and the skill wouldn’t have helped at all. In other words, the purpose of LLM-generated skills is to get it to distil the knowledge it’s gained by iterating on the problem for millions of tokens, not to distil the knowledge it already has from its training data. You can get a LLM to generate skills for you, so long as you do it after the LLM has already solved the problem the hard way . If you’re interested, it was “going to the movies”. I’ve pushed it up here . I’m sure you could do much better for a feature-extraction skill, this was just my zero-effort Codex-only attempt. Extracting features from the final layernorm is too late - you may as well just boost individual logits during sampling You have to extract from about halfway through the model layers to get features that can be usefully clamped Training a SAE on ~10k activations is two OOMs too few to get useful features. You need to train until features account for >50% of variance If you’re interested, it was “going to the movies”. ↩ I’ve pushed it up here . I’m sure you could do much better for a feature-extraction skill, this was just my zero-effort Codex-only attempt. ↩

Programming

1 views

Sean Goedecke 2 months ago

Two different tricks for fast LLM inference

Anthropic and OpenAI both recently announced “fast mode”: a way to interact with their best coding model at significantly higher speeds. These two versions of fast mode are very different. Anthropic’s offers up to 2.5x tokens per second (so around 170, up from Opus 4.6’s 65). OpenAI’s offers more than 1000 tokens per second (up from GPT-5.3-Codex’s 65 tokens per second, so 15x). So OpenAI’s fast mode is six times faster than Anthropic’s 1 . However, Anthropic’s big advantage is that they’re serving their actual model. When you use their fast mode, you get real Opus 4.6, while when you use OpenAI’s fast mode you get GPT-5.3-Codex-Spark, not the real GPT-5.3-Codex. Spark is indeed much faster, but is a notably less capable model: good enough for many tasks, but it gets confused and messes up tool calls in ways that vanilla GPT-5.3-Codex would never do. Why the differences? The AI labs aren’t advertising the details of how their fast modes work, but I’m pretty confident it’s something like this: Anthropic’s fast mode is backed by low-batch-size inference, while OpenAI’s fast mode is backed by special monster Cerebras chips . Let me unpack that a bit. The tradeoff at the heart of AI inference economics is batching , because the main bottleneck is memory . GPUs are very fast, but moving data onto a GPU is not. Every inference operation requires copying all the tokens of the user’s prompt 2 onto the GPU before inference can start. Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full. A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus . But obviously overall throughput would be much lower, because people would be waiting at the bus stop for hours until they managed to actually get on one. Anthropic’s fast mode offering is basically a bus pass that guarantees that the bus immediately leaves as soon as you get on. It’s six times the cost, because you’re effectively paying for all the other people who could have got on the bus with you, but it’s way faster 3 because you spend zero time waiting for the bus to leave. Obviously I can’t be fully certain this is right. Maybe they have access to some new ultra-fast compute that they’re running this on, or they’re doing some algorithmic trick nobody else has thought of. But I’m pretty sure this is it. Brand new compute or algorithmic tricks would likely require changes to the model (see below for OpenAI’s system), and “six times more expensive for 2.5x faster” is right in the ballpark for the kind of improvement you’d expect when switching to a low-batch-size regime. OpenAI’s fast mode does not work anything like this. You can tell that simply because they’re introducing a new, worse model for it. There would be absolutely no reason to do that if they were simply tweaking batch sizes. Also, they told us in the announcement blog post exactly what’s backing their fast mode: Cerebras. OpenAI announced their Cerebras partnership a month ago in January. What’s Cerebras? They build “ultra low-latency compute”. What this means in practice is that they build giant chips . A H100 chip (fairly close to the frontier of inference chips) is just over a square inch in size. A Cerebras chip is 70 square inches. You can see from pictures that the Cerebras chip has a grid-and-holes pattern all over it. That’s because silicon wafers this big are supposed to be broken into dozens of chips. Instead, Cerebras etches a giant chip over the entire thing. The larger the chip, the more internal memory it can have. The idea is to have a chip with SRAM large enough to fit the entire model , so inference can happen entirely in-memory. Typically GPU SRAM is measured in the tens of megabytes . That means that a lot of inference time is spent streaming portions of the model weights from outside of SRAM into the GPU compute 4 . If you could stream all of that from the (much faster) SRAM, inference would a big speedup: fifteen times faster, as it turns out! So how much internal memory does the latest Cerebras chip have? 44GB . This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model 5 . It’s interesting that the two major labs have two very different approaches to building fast AI inference. If I had to guess at a conspiracy theory, it would go something like this: Obviously OpenAI’s achievement here is more technically impressive. Getting a model running on Cerebras chips is not trivial, because they’re so weird. Training a 20B or 40B param distil of GPT-5.3-Codex that is still kind-of-good-enough is not trivial. But I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens . Seeing the two major labs put out this feature might make you think that fast AI inference is the new major goal they’re chasing. I don’t think it is. If my theory above is right, Anthropic don’t care that much about fast inference, they just didn’t want to appear behind OpenAI. And OpenAI are mainly just exploring the capabilities of their new Cerebras partnership. It’s still largely an open question what kind of models can fit on these giant chips, how useful those models will be, and if the economics will make any sense. I personally don’t find “fast, less-capable inference” particularly useful. I’ve been playing around with it in Codex and I don’t like it. The usefulness of AI agents is dominated by how few mistakes they make , not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model 6 . However, it’s certainly possible that fast, less-capable inference becomes a core lower-level primitive in AI systems. Claude Code already uses Haiku for some operations. Maybe OpenAI will end up using Spark in a similar way. This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . OpenAI partner with Cerebras in mid-January, obviously to work on putting an OpenAI model on a fast Cerebras chip Anthropic have no similar play available, but they know OpenAI will announce some kind of blazing-fast inference in February, and they want to have something in the news cycle to compete with that Anthropic thus hustles to put together the kind of fast inference they can provide: simply lowering the batch size on their existing inference stack Anthropic (probably) waits until a few days before OpenAI are done with their much more complex Cerebras implementation to announce it, so it looks like OpenAI copied them This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). ↩ Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). ↩ Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. ↩ Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. ↩ Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . ↩ On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . ↩

Performance

1 views

Sean Goedecke 2 months ago

On screwing up

The most shameful thing I did in the workplace was lie to a colleague. It was about ten years ago, I was a fresh-faced intern, and in the rush to deliver something I’d skipped the step of testing my work in staging 1 . It did not work. When deployed to production, it didn’t work there either. No big deal, in general terms: the page we were working on wasn’t yet customer-facing. But my colleague asked me over his desk whether this worked when I’d tested it, and I said something like “it sure did, no idea what happened”. I bet he forgot about it immediately. I could have just messed up the testing (for instance, by accidentally running some different code than the code I pushed), or he knew I’d probably lied, and didn’t really care. I haven’t forgotten about it. Even a decade later, I’m still ashamed to write it down. Of course I’m not ashamed about the mistake . I was sloppy to not test my work, but I’ve cut corners since then when I felt it was necessary, and I stand by that decision. I’m ashamed about how I handled it. But even that I understand. I was a kid, trying to learn quickly and prove I belonged in tech. The last thing I wanted to do was to dwell on the way I screwed up. If I were in my colleague’s shoes now, I’d have brushed it off too 2 . How do I try to handle mistakes now? The most important thing is to control your emotions . If you’re anything like me, your strongest emotional reactions at work will be reserved for the times you’ve screwed up. There are usually two countervailing emotions at play here: the desire to defend yourself, find excuses, and minimize the consequences; and the desire to confess your guilt, abase yourself, and beg for forgiveness. Both of these are traps. Obviously making excuses for yourself (or flat-out denying the mistake, like I did) is bad. But going in the other direction and publicly beating yourself up about it is just as bad . It’s bad for a few reasons. First, you’re effectively asking the people around you to take the time and effort to reassure you, when they should be focused on the problem. Second, you’re taking yourself out of the group of people who are focused on the problem, when often you’re the best situated to figure out what to do: since it’s your mistake, you have the most context. Third, it’s just not professional. So what should you do? For the first little while, do nothing . Emotional reactions fade over time. Try and just ride out the initial jolt of realizing you screwed up, and the impulse to leap into action to fix it. Most of the worst reactions to screwing up happen in the immediate aftermath, so if you can simply do nothing during that period you’re already off to a good start. For me, this takes about thirty seconds. How much time you’ll need depends on you, but hopefully it’s under ten minutes. More than that and you might need to grit your teeth and work through it. Once you’re confident you’re under control, the next step is to tell people what happened . Typically you want to tell your manager, but depending on the problem it could also be a colleague or someone else. It’s really important here to be matter-of-fact about it, or you risk falling into the “I’m so terrible, please reassure me” trap I discussed above. You often don’t even need to explicitly say “I made a mistake”, if it’s obvious from context. Just say “I deployed a change and it’s broken X feature” (or whatever the problem is). You should do this before you’ve come up with a solution. It’s tempting to try to conceal your mistake and just quietly solve it. But for user-facing mistakes, concealment is impossible - somebody will raise a ticket eventually - and if you don’t communicate the issue, you risk someone else discovering it and independently raising it. In the worst case, while you’re quietly working on a fix, you’ll discover that somebody else has declared an incident. Of course, you understand the problem perfectly (since you caused it), and you know that it was caused by a bad deploy and is easily fixable. But the other people on the incident call don’t know all that. They’re thinking about the worst-case scenarios, wondering if it’s database or network-related, paging in all kinds of teams, causing all kinds of hassle. All of that could have been avoided if you had reported the issue immediately. In my experience, tech company managers will forgive mistakes 3 , but they won’t forgive being made to look like a fool . In particular, they won’t forgive being deprived of critical information. If they’re asked to explain the incident by their boss, and they have to flounder around because they lack the context that you had all along , that may harm your relationship with them for good. On the other hand, if you give them a clear summary of the problem right away, and they’re able to seem like they’re on top of things to their manager, you might even earn credit for the situation (despite having caused it with your initial mistake). However, you probably won’t earn credit. This is where I diverge from the popular software engineering wisdom that incidents are always the fault of systems, never of individuals. Of course incidents are caused by the interactions of complex systems. Everything in the universe is caused by the interactions of complex systems! But one cause in that chain is often somebody screwing up 4 . If you’re a manager of an engineering organization, and you want a project to succeed, you probably have a mental shortlist of the engineers in your org who can reliably lead projects 5 . If an engineer screws up repeatedly, they’re likely to drop off that list (or at least get an asterisk next to their name). It doesn’t really matter if you had a good technical reason to make the mistake, or if it’s excusable. Managers don’t care about that stuff, because they simply don’t have the technical context to know if it’s true or if you’re just trying to talk your way out of it. What managers do have the context to evaluate is results , so that’s what they judge you on. That means some failures are acceptable, so long as you’ve got enough successes to balance them out. Being a strong engineer is about finding a balance between always being right and taking risks . If you prioritize always being right, you can probably avoid making mistakes, but you won’t be able to lead projects (since that always requires taking risks). Therefore, the optimal amount of mistakes at work is not zero. Unless you’re working in a few select industries 6 , you should expect to make mistakes now and then, otherwise you’re likely working far too slow. From memory, I think I had tested an earlier version of the code, but then I made some tweaks and skipped the step where I tested that it worked even with those tweaks. Though I would have made a mental note (and if someone more senior had done this, I would have been a bit less forgiving). Though they may not forget them. More on that later. It’s probably not that comforting to replace “you screwed up by being incompetent” with “it’s not your fault, it’s the system’s fault for hiring an engineer as incompetent as you”. For more on that, see How I ship projects at large tech companies . The classic examples are pacemakers and the Space Shuttle (should that now be Starship/New Glenn)? From memory, I think I had tested an earlier version of the code, but then I made some tweaks and skipped the step where I tested that it worked even with those tweaks. ↩ Though I would have made a mental note (and if someone more senior had done this, I would have been a bit less forgiving). ↩ Though they may not forget them. More on that later. ↩ It’s probably not that comforting to replace “you screwed up by being incompetent” with “it’s not your fault, it’s the system’s fault for hiring an engineer as incompetent as you”. ↩ For more on that, see How I ship projects at large tech companies . ↩ The classic examples are pacemakers and the Space Shuttle (should that now be Starship/New Glenn)? ↩

Career

0 views

Sean Goedecke 2 months ago

Large tech companies don't need heroes

Large tech companies operate via systems . What that means is that the main outcomes - up to and including the overall success or failure of the company - are driven by a complex network of processes and incentives. These systems are outside the control of any particular person. Like the parts of a large codebase, they have accumulated and co-evolved over time, instead of being designed from scratch. Some of these processes and incentives are “legible”, like OKRs or promotion criteria. Others are “illegible”, like the backchannel conversations that usually precede a formal consensus on decisions 1 . But either way, it is these processes and incentives that determine what happens, not any individual heroics . This state of affairs is not efficient at producing good software. In large tech companies, good software often seems like it is produced by accident , as a by-product of individual people responding to their incentives. However, that’s just the way it has to be. A shared belief in the mission can cause a small group of people to prioritize good software over their individual benefit, for a little while. But thousands of engineers can’t do that for decades. Past a certain point of scale 2 , companies must depend on the strength of their systems. Individual engineers often react to this fact with horror. After all, they want to produce high-quality software. Why is everyone around them just cynically 3 focused on their own careers? On top of that, many software engineers got into the industry because they are internally compelled 4 to make systems more efficient. For these people, it is viscerally uncomfortable being employed in an inefficient company. They are thus prepared to do whatever it takes to patch up their system’s local inefficiencies. Of course, making your team more effective does not always require heroics. Some amount of fixing inefficiencies - improving process, writing tests, cleaning up old code - is just part of the job, and will get engineers rewarded and promoted just like any other kind of engineering work. But there’s a line. Past a certain point, working on efficiency-related stuff instead of your actual projects will get you punished, not rewarded. To go over that line requires someone willing to sacrifice their own career progression in the name of good engineering. In other words, it requires a hero . You can sacrifice your promotions and bonuses to make one tiny corner of the company hum along nicely for a while. However, like I said above, the overall trajectory of the company is almost never determined by one person. It doesn’t really matter how efficient you made some corner of the Google Wave team if the whole product was doomed. And even poorly-run software teams can often win, so long as they’re targeting some niche that the company is set up to support (think about the quality of most profitable enterprise software). On top of that, heroism makes it difficult for real change to happen . If a company is set up to reward bad work and punish good work, having some hero step up to do good work anyway and be punished will only insulate the company from the consequences of its own systems . Far better to let the company be punished for its failings, so it can (slowly, slowly) adjust, or be replaced by companies that operate better. Large tech companies don’t benefit long-term from heroes, but there’s still a role for heroes. That role is to be exploited . There are no shortage of predators who will happily recruit a hero for some short-term advantage. Some product managers keep a mental list of engineers in other teams who are “easy targets”: who can be convinced to do extra work on projects that benefit the product manager (but not that engineer). During high-intensity periods, such as the lead-up to a major launch, there is sometimes a kind of cold war between different product organizations, as they try to extract behind-the-scenes help from the engineers in each other’s camps while jealously guarding their own engineering resources. Likewise, some managers have no problem letting one of their engineers spend all their time on glue work . Much of that work would otherwise be the manager’s responsibility, so it makes the manager’s job easier. Of course, when it comes time for promotions, the engineer will be punished for not doing their real work. This is why it’s important for engineers to pay attention to their actual rewards. Promotions, bonuses and raises are the hard currency of software companies. Giving those out shows what the company really values. Predators don’t control those things (if they did, they wouldn’t be predators). As a substitute, they attempt to appeal to a hero’s internal compulsion to be useful or to clean up inefficiencies. Large tech companies are structurally set up to encourage software engineers to engage in heroics A background level of inefficiency is just part of the landscape of large tech companies I write about this point at length in Seeing like a software company . Why do companies need to scale, if it means they become less efficient? The best piece on this is Dan Luu’s I could build that in a weekend! : in short, because the value of marginal features in a successful software product is surprisingly high, and you need a lot of developers to capture all the marginal features. For a post on why this is not actually that cynical, see my Software engineers should be a little bit cynical . I write about these internal compulsions in I’m addicted to being useful . Large tech companies are structurally set up to encourage software engineers to engage in heroics This is largely accidental, and doesn’t really benefit those tech companies in the long term, since large tech companies are just too large to be meaningfully moved by individual heroics However, individual managers and product managers inside these tech companies have learned to exploit this surplus heroism for their individual ends As a software engineer, you should resist the urge to heroically patch some obvious inefficiency you see in the organization Unless that work is explicitly rewarded by the company, all your efforts will do is delay the point at which the company has to change its processes A background level of inefficiency is just part of the landscape of large tech companies It’s the price they pay to be so large (and in return reap the benefits of scale and legibility ) The more you can learn to live with it, the more you’ll be able to use your energy tactically for your own benefit I write about this point at length in Seeing like a software company . ↩ Why do companies need to scale, if it means they become less efficient? The best piece on this is Dan Luu’s I could build that in a weekend! : in short, because the value of marginal features in a successful software product is surprisingly high, and you need a lot of developers to capture all the marginal features. ↩ For a post on why this is not actually that cynical, see my Software engineers should be a little bit cynical . ↩ I write about these internal compulsions in I’m addicted to being useful . ↩

Business

Career

0 views

Sean Goedecke 2 months ago

How does AI impact skill formation?

Two days ago, the Anthropic Fellows program released a paper called How AI Impacts Skill Formation . Like other papers on AI before it, this one is being treated as proof that AI makes you slower and dumber. Does it prove that? The structure of the paper is sort of similar to the 2025 MIT study Your Brain on ChatGPT . They got a group of people to perform a cognitive task that required learning a new skill: in this case, the Python Trio library. Half of those people were required to use AI and half were forbidden from using it. The researchers then quizzed those people to see how much information they retained about Trio. The banner result was that AI users did not complete the task faster, but performed much worse on the quiz . If you were so inclined, you could naturally conclude that any perceived AI speedup is illusory, and the people who are using AI tooling are cooking their brains. But I don’t think that conclusion is reasonable. To see why, let’s look at Figure 13 from the paper: The researchers noticed half of the AI-using cohort spent most of their time literally retyping the AI-generated code into their solution, instead of copy-pasting or “manual coding”: writing their code from scratch with light AI guidance. If you ignore the people who spent most of their time retyping, the AI-users were 25% faster. I confess that this kind of baffles me. What kind of person manually retypes AI-generated code? Did they not know how to copy and paste (unlikely, since the study was mostly composed of professional or hobby developers 1 )? It certainly didn’t help them on the quiz score. The retypers got the same (low) scores as the pure copy-pasters. In any case, if you know how to copy-paste or use an AI agent, I wouldn’t use this paper as evidence that AI will not be able to speed you up. Even if AI use offers a 25% speedup, is that worth sacrificing the opportunity to learn new skills? What about the quiz scores? Well, first we should note that the AI users who used the AI for general questions but wrote all their own code did fine on the quiz . If you look at Figure 13 above, you can see that those AI users averaged maybe a point lower on the quiz - not bad, for people working 25% faster. So at least some kinds of AI use seem fine. But of course much current AI use is not like this: if you’re using Claude Code or Copilot agent mode, you’re getting the AI to do the code writing for you. Are you losing key skills by doing that? Well yes, of course you are. If you complete a task in ten minutes by throwing it at a LLM, you will learn much less about the codebase than if you’d spent an hour doing it by hand. I think it’s pretty silly to deny this: it’s intuitively right, and anybody who has used AI agents extensively at work can attest to it from their own experience. Still, I have two points to make about this. First, software engineers are not paid to learn about the codebase . We are paid to deliver business value (typically by delivering working code). If AI can speed that up dramatically, avoiding it makes you worse at your job, even if you’re learning more efficiently. That’s a bit unfortunate for us - it was very nice when we could get much better at the job simply by doing it more - but that doesn’t make it false. Other professions have been dealing with this forever. Doctors are expected to spend a lot of time in classes and professional development courses, learning how to do their job in other ways than just doing it. It may be that future software engineers will need to spend 20% of their time manually studying their codebases: not just in the course of doing some task (which could be far more quickly done by AI agents) but just to stay up-to-date enough that their skills don’t atrophy. The other point I wanted to make is that even if your learning rate is slower, moving faster means you may learn more overall . Suppose using AI meant that you learned only 75% as much as non-AI programmers from any given task. Whether you’re learning less overall depends on how many more tasks you’re doing . If you’re working faster, the loss of learning efficiency may be balanced out by volume. I don’t know if this is true. I suspect there really is no substitute for painstakingly working through a codebase by hand. But the engineer who is shipping 2x as many changes is probably also learning things that the slower, manual engineer does not know. At minimum, they’ll be acquiring a greater breadth of knowledge of different subsystems, even if their depth suffers. Anyway, the point is simply that a lower learning rate does not by itself prove that less learning is happening overall. Finally, I will reluctantly point out that the model used for this task was GPT-4o (see section 4.1). I’m reluctant here because I sympathize with the AI skeptics, who are perpetually frustrated by the pro-AI response of “well, you just haven’t tried the right model”. In a world where new AI models are released every month or two, demanding that people always study the best model makes it functionally impossible to study AI use at all. Still, I’m just kind of confused about why GPT-4o was chosen. This study was funded by Anthropic, who have much better models. This study was conducted in 2025 2 , at least six months after the release of GPT-4o (that’s like five years in AI time). I can’t help but wonder if the AI-users cohort would have run into fewer problems with a more powerful model. I don’t have any real problem with this paper. They set out to study how different patterns of AI use affect learning, and their main conclusion - that pure “just give the problem to the model” AI use means you learn a lot less - seems correct to me. I don’t like their conclusion that AI use doesn’t speed you up, since it relies on the fact that 50% of their participants spent their time literally retyping AI code . I wish they’d been more explicit in the introduction that this was the case, but I don’t really blame them for the result - I’m more inclined to blame the study participants themselves, who should have known better. Overall, I don’t think this paper provides much new ammunition to the AI skeptic. Like I said above, it doesn’t support the point that AI speedup is a mirage. And the point it does support (that AI use means you learn less) is obvious. Nobody seriously believes that typing “build me a todo app” into Claude Code means you’ll learn as much as if you built it by hand. That said, I’d like to see more investigation into long-term patterns of AI use in tech companies. Is the slower learning rate per-task balanced out by the higher rate of task completion? Can it be replaced by carving out explicit time to study the codebase? It’s probably too early to answer these questions - strong coding agents have only been around for a handful of months - but the answers may determine what it’s like to be a software engineer for the next decade. See Figure 17. I suppose the study doesn’t say that explicitly, but the Anthropic Fellows program was only launched in December 2024, and the paper was published in January 2026. See Figure 17. ↩ I suppose the study doesn’t say that explicitly, but the Anthropic Fellows program was only launched in December 2024, and the paper was published in January 2026. ↩

Python

Programming

Data Analysis

0 views

Sean Goedecke 2 months ago

How I estimate work as a staff software engineer

There’s a kind of polite fiction at the heart of the software industry. It goes something like this: Estimating how long software projects will take is very hard, but not impossible. A skilled engineering team can, with time and effort, learn how long it will take for them to deliver work, which will in turn allow their organization to make good business plans. This is, of course, false. As every experienced software engineer knows, it is not possible to accurately estimate software projects . The tension between this polite fiction and its well-understood falseness causes a lot of strange activity in tech companies. For instance, many engineering teams estimate work in t-shirt sizes instead of time, because it just feels too obviously silly to the engineers in question to give direct time estimates. Naturally, these t-shirt sizes are immediately translated into hours and days when the estimates make their way up the management chain. Alternatively, software engineers who are genuinely trying to give good time estimates have ridiculous heuristics like “double your initial estimate and add 20%“. This is basically the same as giving up and saying “just estimate everything at a month”. Should tech companies just stop estimating? One of my guiding principles is that when a tech company is doing something silly, they’re probably doing it for a good reason . In other words, practices that appear to not make sense are often serving some more basic, illegible role in the organization. So what is the actual purpose of estimation, and how can you do it well as a software engineer? Before I get into that, I should justify my core assumption a little more. People have written a lot about this already, so I’ll keep it brief. I’m also going to concede that sometimes you can accurately estimate software work , when that work is very well-understood and very small in scope. For instance, if I know it takes half an hour to deploy a service 1 , and I’m being asked to update the text in a link, I can accurately estimate the work at something like 45 minutes: five minutes to push the change up, ten minutes to wait for CI, thirty minutes to deploy. For most of us, the majority of software work is not like this. We work on poorly-understood systems and cannot predict exactly what must be done in advance. Most programming in large systems is research : identifying prior art, mapping out enough of the system to understand the effects of changes, and so on. Even for fairly small changes, we simply do not know what’s involved in making the change until we go and look. The pro-estimation dogma says that these questions ought to be answered during the planning process, so that each individual piece of work being discussed is scoped small enough to be accurately estimated. I’m not impressed by this answer. It seems to me to be a throwback to the bad old days of software architecture , where one architect would map everything out in advance, so that individual programmers simply had to mechanically follow instructions. Nobody does that now, because it doesn’t work: programmers must be empowered to make architectural decisions, because they’re the ones who are actually in contact with the code 2 . Even if it did work, that would simply shift the impossible-to-estimate part of the process backwards, into the planning meeting (where of course you can’t write or run code, which makes it near-impossible to accurately answer the kind of questions involved). In short: software engineering projects are not dominated by the known work, but by the unknown work, which always takes 90% of the time. However, only the known work can be accurately estimated. It’s therefore impossible to accurately estimate software projects in advance. Estimates do not help engineering teams deliver work more efficiently. Many of the most productive years of my career were spent on teams that did no estimation at all: we were either working on projects that had to be done no matter what, and so didn’t really need an estimate, or on projects that would deliver a constant drip of value as we went, so we could just keep going indefinitely 3 . In a very real sense, estimates aren’t even made by engineers at all . If an engineering team comes up with a long estimate for a project that some VP really wants, they will be pressured into lowering it (or some other, more compliant engineering team will be handed the work). If the estimate on an undesirable project - or a project that’s intended to “hold space” for future unplanned work - is too short, the team will often be encouraged to increase it, or their manager will just add a 30% buffer. One exception to this is projects that are technically impossible, or just genuinely prohibitively difficult. If a manager consistently fails to pressure their teams into giving the “right” estimates, that can send a signal up that maybe the work can’t be done after all. Smart VPs and directors will try to avoid taking on technically impossible projects. Another exception to this is areas of the organization that senior leadership doesn’t really care about. In a sleepy backwater, often the formal estimation process does actually get followed to the letter, because there’s no director or VP who wants to jump in and shape the estimates to their ends. This is one way that some parts of a tech company can have drastically different engineering cultures to other parts. I’ll let you imagine the consequences when the company is re-orged and these teams are pulled into the spotlight. Estimates are political tools for non-engineers in the organization . They help managers, VPs, directors, and C-staff decide on which projects get funded and which projects get cancelled. The standard way of thinking about estimates is that you start with a proposed piece of software work, and you then go and figure out how long it will take. This is entirely backwards. Instead, teams will often start with the estimate, and then go and figure out what kind of software work they can do to meet it. Suppose you’re working on a LLM chatbot, and your director wants to implement “talk with a PDF”. If you have six months to do the work, you might implement a robust file upload system, some pipeline to chunk and embed the PDF content for semantic search, a way to extract PDF pages as image content to capture formatting and diagrams, and so on. If you have one day to do the work, you will naturally search for simpler approaches: for instance, converting the PDF to text client-side and sticking the entire thing in the LLM context, or offering a plain-text “grep the PDF” tool. This is true at even at the level of individual lines of code. When you have weeks or months until your deadline, you might spend a lot of time thinking airily about how you could refactor the codebase to make your new feature fit in as elegantly as possible. When you have hours, you will typically be laser-focused on finding an approach that will actually work. There are always many different ways to solve software problems. Engineers thus have quite a lot of discretion about how to get it done. So how do I estimate, given all that? I gather as much political context as possible before I even look at the code . How much pressure is on this project? Is it a casual ask, or do we have to find a way to do this? What kind of estimate is my management chain looking for? There’s a huge difference between “the CTO really wants this in one week” and “we were looking for work for your team and this seemed like it could fit”. Ideally, I go to the code with an estimate already in hand . Instead of asking myself “how long would it take to do this”, where “this” could be any one of a hundred different software designs, I ask myself “which approaches could be done in one week?“. I spend more time worrying about unknowns than knowns . As I said above, unknown work always dominates software projects. The more “dark forests” in the codebase this feature has to touch, the higher my estimate will be - or, more concretely, the tighter I need to constrain the set of approaches to the known work. Finally, I go back to my manager with a risk assessment, not with a concrete estimate . I don’t ever say “this is a four-week project”. I say something like “I don’t think we’ll get this done in one week, because X Y Z would need to all go right, and at least one of those things is bound to take a lot more work than we expect. Ideally, I go back to my manager with a series of plans, not just one: In other words, I don’t “break down the work to determine how long it will take”. My management chain already knows how long they want it to take. My job is to figure out the set of software approaches that match that estimate. Sometimes that set is empty: the project is just impossible, no matter how you slice it. In that case, my management chain needs to get together and figure out some way to alter the requirements. But if I always said “this is impossible”, my managers would find someone else to do their estimates. When I do that, I’m drawing on a well of trust that I build up by making pragmatic estimates the rest of the time. Many engineers find this approach distasteful. One reason is that they don’t like estimating in conditions of uncertainty, so they insist on having all the unknown questions answered in advance. I have written a lot about this in Engineers who won’t commit and How I provide technical clarity to non-technical leaders , but suffice to say that I think it’s cowardly. If you refuse to estimate, you’re forcing someone less technical to estimate for you. Some engineers think that their job is to constantly push back against engineering management, and that helping their manager find technical compromises is betraying some kind of sacred engineering trust. I wrote about this in Software engineers should be a little bit cynical . If you want to spend your career doing that, that’s fine, but I personally find it more rewarding to find ways to work with my managers (who have almost exclusively been nice people). Other engineers might say that they rarely feel this kind of pressure from their directors or VPs to alter estimates, and that this is really just the sign of a dysfunctional engineering organization. Maybe! I can only speak for the engineering organizations I’ve worked in. But my suspicion is that these engineers are really just saying that they work “out of the spotlight”, where there’s not much pressure in general and teams can adopt whatever processes they want. There’s nothing wrong with that. But I don’t think it qualifies you to give helpful advice to engineers who do feel this kind of pressure. I think software engineering estimation is generally misunderstood. The common view is that a manager proposes some technical project, the team gets together to figure out how long it would take to build, and then the manager makes staffing and planning decisions with that information. In fact, it’s the reverse: a manager comes to the team with an estimate already in hand (though they might not come out and admit it), and then the team must figure out what kind of technical project might be possible within that estimate. This is because estimates are not by or for engineering teams. They are tools used for managers to negotiate with each other about planned work. Very occasionally, when a project is literally impossible, the estimate can serve as a way for the team to communicate that fact upwards. But that requires trust. A team that is always pushing back on estimates will not be believed when they do encounter a genuinely impossible proposal. When I estimate, I extract the range my manager is looking for, and only then do I go through the code and figure out what can be done in that time. I never come back with a flat “two weeks” figure. Instead, I come back with a range of possibilities, each with their own risks, and let my manager make that tradeoff. It is not possible to accurately estimate software work. Software projects spend most of their time grappling with unknown problems, which by definition can’t be estimated in advance. To estimate well, you must therefore basically ignore all the known aspects of the work, and instead try and make educated guesses about how many unknowns there are, and how scary each unknown is. edit: I should thank one of my readers, Karthik, who emailed me to ask about estimates, thus revealing to me that I had many more opinions than I thought. For anyone wincing at that time, I mean like three minutes of actual deployment and twenty-seven minutes of waiting for checks to pass or monitors to turn up green. I write a lot more about this in You can’t design software you don’t work on . For instance, imagine a mandate to improve the performance of some large Rails API, one piece at a time. I could happily do that kind of work forever. We tackle X Y Z directly, which might all go smoothly but if it blows out we’ll be here for a month We bypass Y and Z entirely, which would introduce these other risks but possibly allow us to hit the deadline We bring in help from another team who’s more familiar with X and Y, so we just have to focus on Z For anyone wincing at that time, I mean like three minutes of actual deployment and twenty-seven minutes of waiting for checks to pass or monitors to turn up green. ↩ I write a lot more about this in You can’t design software you don’t work on . ↩ For instance, imagine a mandate to improve the performance of some large Rails API, one piece at a time. I could happily do that kind of work forever. ↩

Business

Ruby

Career

1 views

Sean Goedecke 2 months ago

Crypto grifters are recruiting open-source AI developers

Two recently-hyped developments in AI engineering have been Geoff Huntley’s “Ralph Wiggum loop” and Steve Yegge’s “Gas Town”. Huntley and Yegge are both respected software engineers with a long pedigree of actual projects. The Ralph loop is a sensible idea: force infinite test-time-compute by automatically restarting Claude Code whenever it runs out of steam. Gas Town is a platform for an idea that’s been popular for a while (though in my view has never really worked): running a whole village of LLM agents that collaborate with each other to accomplish a task. So far, so good. But Huntley and Yegge have also been posting about $RALPH and $GAS, which are cryptocurrency coins built on top of the longstanding Solana cryptocurrency and the Bags tool, which allows people to easily create their own crypto coins. What does $RALPH have to do with the Ralph Wiggum loop? What does $GAS have to do with Gas Town? From reading Huntley and Yegge’s posts, it seems like what happened was this: So what does $GAS have to do with Gas Town (or $RALPH with Ralph Wiggum)? From a technical perspective, the answer is nothing . Gas Town is an open-source GitHub repository that you can clone, edit and run without ever interacting with the $GAS coin. Likewise for Ralph . Buying $GAS or $RALPH does not unlock any new capabilities in the tools. All it does is siphon a little bit of money to Yegge and Huntley, and increase the value of the $GAS or $RALPH coins. Of course, that’s why these coins exist in the first place. This is a new variant of an old “airdropping” cryptocurrency tactic. The classic problem with “memecoins” is that it’s hard to give people a reason to buy them, even at very low prices, because they famously have no staying power. That’s why many successful memecoins rely on celebrity power, like Eric Adams’ “NYC Token” or the $TRUMP coin. But how do you convince a celebrity to get involved in your grift business venture? This is where Bags comes in. Bags allows you to nominate a Twitter account as the beneficiary (or “fee earner”) of your coin. The person behind that Twitter account doesn’t have to agree, or even know that you’re doing it. Once you accumulate a nominal market cap (for instance, by moving a bunch of your own money onto the coin), you can then message the owner of that Twitter account and say “hey, all these people are supporting you via crypto, and you can collect your money right now if you want!” Then you either subtly hint that promoting the coin would cause that person to make more money, or you wait for them to realize it themselves 1 . Once they start posting about it, you’ve bootstrapped your own celebrity coin. This system relies on your celebrity target being dazzled by receiving a large sum of free money. If you came to them before the money was there, they might ask questions like “why wouldn’t people just directly donate to me?”, or “are these people who think they’re supporting me going to lose all their money?“. But in the warm glow of a few hundred thousand dollars, it’s easy to think that it’s all working out excellently. Incidentally, this is why AI open-source software engineers make such great targets. The fact that they’re open-source software engineers means that (a) a few hundred thousand dollars is enough to dazzle them 2 , and (b) their fans are technically-engaged enough to be able to figure out how to buy cryptocurrency. Working in AI also means that there’s a fresh pool of hype to draw from (the general hype around cryptocurrency being somewhat dry by now). On top of that, the open-source AI community is fairly small. Yegge mentions in his post that he wouldn’t have taken the offer seriously if Huntley hadn’t already accepted it. If you couldn’t tell, I think this whole thing is largely predatory. Bags seems to me to be offering crypto-airdrop-pump-and-dumps-as-a-service, where niche celebrities can turn their status as respected community figures into cold hard cash. The people who pay into this are either taken in by the pretense that they’re sponsoring open-source work (in a way orders of magnitude less efficient than just donating money directly), or by the hope that they’re going to win big when the coin goes “to the moon” (which effectively never happens). The celebrities will make a little bit of money, for their part in it, but the lion’s share of the reward will go to the actual grifters: the insiders who primed the coin and can sell off into the flood of community members who are convinced to buy. Bags even offers a “Did You Get Bagged? 💰🫵” section in their docs, encouraging the celebrity targets to share the coin, and framing the whole thing as coming from “your community”. This isn’t a dig - that amount of money would dazzle me too! I only mean that you wouldn’t be able to get Tom Cruise or MrBeast to promote your coin with that amount of money. Some crypto trader created a “$GAS” coin via Bags, configuring it to pay a portion of the trading fees to Steve Yegge (via his Twitter account) That trader, or others with the same idea, messaged Yegge on LinkedIn to tell him about his “earnings” ( currently $238,000), framing it as support for the Gas Town project Yegge took the free money and started posting about how exciting $GAS is as a way to fund open-source software creators Bags even offers a “Did You Get Bagged? 💰🫵” section in their docs, encouraging the celebrity targets to share the coin, and framing the whole thing as coming from “your community”. ↩ This isn’t a dig - that amount of money would dazzle me too! I only mean that you wouldn’t be able to get Tom Cruise or MrBeast to promote your coin with that amount of money. ↩

AI Crypto

0 views

Sean Goedecke 3 months ago

The Dictator's Handbook and the politics of technical competence

The Dictator’s Handbook is an ambitious book. In the introduction, its authors Bruce Bueno de Mesquita and Alastair Smith cast themselves as the successors to Sun Tzu and Niccolo Machiavelli: offering unsentimental advice to would-be successful leaders. Given that, I expected this book to be similar to The 48 Laws of Power , which did not impress me. Like many self-help books, The 48 Laws of Power is “empty calories”: a lot of fun to read, but not really useful or edifying 1 . However, The Dictator’s Handbook is a legitimate work of political science, serving as a popular introduction to an actual academic theory of government . Political science is very much not my field, so I’m reluctant to be convinced by (or comment on) the various concrete arguments in the book. I’m mainly interested in whether the book has anything to say about something I do know a little bit about: operating as an engineer inside a large tech company. Let’s first cover the key idea of The Dictator’s Handbook , which can be expressed in three points. Almost every feature of organizations can be explained by the ratio between the size of three groups: For instance, take an autocratic dictator. That dictator depends on a tiny group of people to maintain power: military generals, some powerful administrators, and so on. There is a larger group of people who could be in the inner circle but aren’t: for instance, other generals or administrators who are involved in government but aren’t fully trusted. Then there is the much, much larger group of all residents of the country, who are affected by the leader’s policies but have no ability to control them. This is an example of small-coalition government. Alternatively, take a democratic president. To maintain power, the president depends on every citizen who is willing to vote for them. There’s a larger group of people outside that core coalition: voters who aren’t supporters of the president, but could conceivably be persuaded. Finally, there’s the inhabitants of the country who do not vote: non-citizens, the very young, potentially felons, and so on. This is an example of large-coalition government. Mesquita and Smith argue that the structure of the government is downstream from the coalition sizes. If the coalition is small, it doesn’t matter whether the country is nominally a democracy, it will function like an autocratic dictatorship. Likewise, if the coalition is large, even a dictatorship will act in the best interests of its citizens (and will necessarily democratize). According to them, the structure of government does not change the size of the coalition. Rather, changes in the size of the coalition force changes in the structure of government. For instance, a democratic leader may want to shrink the size of their coalition to make it easier to hold onto power (e.g. by empowering state governors to unilaterally decide the outcome of their state’s elections). If successful, the government will thus become a small-coalition government, and will function more like a dictatorship (even if it’s still nominally democratic). Why are small-coalition governments more prone to autocracy or corruption? Because leaders stay in power by rewarding their coalitions, and if your coalition is a few tens or hundreds of people, you can best reward them by directly handing out cash or treasure, at the expense of everyone else. If your coalition is hundreds of thousands or millions of people (e.g. all the voters in a democracy), you can no longer directly assign rewards to individual people. Instead, it’s more efficient to fund public goods that benefit everybody. That’s why democracies tend to fund many more public goods than dictatorships. Leaders prefer small coalitions, because small coalitions are cheaper to keep happy. This is why dictators rule longer than democratically-elected leaders. Incidentally, it’s also why hegemonic countries like the USA have a practical interest in keeping uneasy allies ruled by dictators: because small-coalition dictatorships are easier to pay off. Leaders also want the set of “interchangeables” - remember, this is the set of people who could be part of the coalition but currently aren’t - to be as large as possible. That way they can easily replace unreliable coalition members. Of course, coalition members want the set of interchangeables to be as small as possible, to maximise their own leverage. What does any of this have to do with tech companies? The Dictator’s Handbook does reference a few tech companies specifically, but always in the context of boardroom disputes. In this framing, the CEO is the leader, and their coalition is the board who can either support them or fire them. I’m sure this is interesting - I’d love to read an account of the 2023 OpenAI boardroom wars from this perspective - but I don’t really know anything first-hand about how boards work, so I don’t want to speculate. It’s unclear how we might apply this theory so that it’s relevant to individual software engineers and the levels of management they might encounter in a large tech company. Directors and VPs are definitely leaders, but they’re not “leaders” in the sense meant in The Dictator’s Handbook . They don’t govern from the strength of their coalitions. Instead, they depend on the formal power they derive from the roles above them: you do what your boss says because they can fire you (or if they can’t, their boss certainly can). However, directors and VPs rarely make genuinely unilateral decisions. Typically they’ll consult with a small group of trusted subordinates, who they depend on for accurate information 3 and to actually execute projects. This sounds a lot like a coalition to me! Could we apply some of the lessons above to this kind of coalition? Let’s consider Mesquita and Smith’s point about the “interchangeables”. According to their theory, if you’re a member of the inner circle, it’s in your interest to be as irreplaceable as possible. You thus want to avoid bringing in other engineers or managers who could potentially fill your role. Meanwhile, your director or VP wants to have as many potential replacements available as possible, so each member of the inner circle’s bargaining power is lower - but they don’t want to bring them into the inner circle, since each extra person they need to rely on drains their political resources. This does not match my experience at all. Every time I’ve been part of a trusted group like this, I’ve been desperate to have a deeper bench. I have never once been in a position where I felt it was to my advantage to be the only person who could fill a particular role, for a few reasons: In other words, The Dictator’s Handbook style of backstabbing and political maneuvering is just not something I’ve observed at the level of software teams or products. Maybe it happens like this at the C-suite/VP or at the boardroom level - I wouldn’t know. But at the level I’m at, the success of individual projects determines your career success , so self-interested people tend to try and surround themselves with competent professionals who can make projects succeed, even if those people pose more of a political threat. I think the main difference here is that technical competence matters a lot in engineering organizations . I want a deep bench because it really matters to me whether projects succeed or fail, and having more technically competent people in the loop drastically increases the chances of success. Mesquita and Smith barely write about competence at all. From what I can tell, they assume that leaders don’t care about it, and assume that their administration will be competent enough (a very low bar) to stay in power, no matter what they do. For tech companies, technical competence is a critical currency for leaders . Leaders who can attract and retain technical competence to their organizations are able to complete projects and notch up easy political wins. Leaders who fail to do this must rely on “pure politics”: claiming credit, making glorious future promises, and so on. Of course, every leader has to do some amount of this. But it’s just easier to also have concrete accomplishments to point to as well. If I were tempted to criticize the political science here, this is probably where I’d start. I find it hard to believe that governments are that different from tech companies in this sense: surely competence makes a big difference to outcomes, and leaders are thus incentivized to keep competent people in their circle, even if that disrupts their coalition or incurs additional political costs 4 . Still, it’s possible to explain the desire for competence in a way that’s consistent with The Dictator’s Handbook . Suppose that competence isn’t more important in tech companies , but is more important for senior management . According to this view, the leader right at the top (the dictator, president, or CEO) doesn’t have the luxury to care about competence, and must focus entirely on solidifying their power base. But the leaders in the middle (the generals, VPs and directors) are obliged to actually get things done, and so need to worry a lot about keeping competent subordinates. Why would VPs be more obliged to get things done than CEOs? One reason might be that CEOs depend on a coalition of all board members (or even all company shareholders). This is a small coalition by The Dictator’s Handbook standards, but it’s still much larger than the VP’s coalition, which is a coalition of one: just their boss. CEOs have tangible ways to reward their coalition. But VPs can only really reward their coalition via accomplishing their boss’s goals, which necessarily requires competence. Mesquita and Smith aren’t particularly interested in mid-level politics. Their focus is on leaders and their direct coalitions. But for most of us who operate in the middle level, maybe the lesson is that coalition politics dominates at the top, but competence politics dominates in the middle. I enjoyed The Dictator’s Handbook , but most of what I took from it was speculation. There weren’t a lot of direct lessons I could draw from my own work politics 5 , and I don’t feel competent to judge the direct political science arguments. For instance, the book devotes a chapter to arguing against foreign aid, claiming roughly (a) that it props up unstable dictatorships by allowing them to reward their small-group coalitions, and (b) that it allows powerful countries to pressure small dictatorships into adopting foreign policies that are not in their citizens’ interest. Sure, that seems plausible! But I’m suspicious of plausible-sounding arguments in areas where I don’t have actual expertise. I could imagine a similarly-plausible argument in favor of foreign aid 6 . The book doesn’t talk about competence at all, but in my experience of navigating work politics, competence is the primary currency - it’s both the instrument and the object of many political battles. I can reconcile this by guessing that competence might matter more at the senior-management level than the very top level of politics , but I’m really just guessing. I don’t have the research background or the C-level experience to be confident about any of this. Still, I did like the core idea. No leader can lead alone, and that therefore the relationship between the ruler and their coalition dictates much of the structure of the organization. I think that’s broadly true about many different kinds of organization, including software companies. Maybe there are people out there who are applying Greene’s Machiavellian power tactics to their daily lives. If so, I hope I don’t meet them. “Organizations” here is understood very broadly: companies, nations, families, book clubs, and so on all fit the definition. I write about this a lot more in How I provide technical clarity to non-technical leaders In an email exchange, a reader suggested that companies face more competition than governments, because the cost of moving countries is much higher than the cost of switching products, which might make competence more important for companies. I think this is also pretty plausible. This is not a criticism of the book. After five years of studying philosophy, I’m convinced you can muster a plausible argument in favor of literally any position, with enough work. When explaining how organizations 2 behave, it is more useful to consider the motivations of individual people (say, the leader) than “the organization” as a whole Every leader must depend upon a coalition of insiders who help them maintain their position Almost every feature of organizations can be explained by the ratio between the size of three groups: The members of the coalition of insiders (i.e. the “inner circle”) The group who could conceivably become members of the coalition (the “outer circle”, or what the book calls the “interchangeables”) The entire population who is subject to the leader Management are suspicious of “irreplaceable” engineers and will actively work to undermine them, for a whole variety of reasons (the most palatable one is to reduce bus factor ) It’s just lonely to be in this position: you don’t really have peers to talk to, it’s hard to take leave, and so on. It feels much nicer to have potential backup Software teams succeed or fail together. Being the strongest engineer in a weak group means your projects will be rocky and you’ll have less successes to point to. But if you’re in a strong team, you’ll often acquire a good reputation just by association (so long as you’re not obviously dragging the side down) Maybe there are people out there who are applying Greene’s Machiavellian power tactics to their daily lives. If so, I hope I don’t meet them. ↩ “Organizations” here is understood very broadly: companies, nations, families, book clubs, and so on all fit the definition. ↩ I write about this a lot more in How I provide technical clarity to non-technical leaders ↩ In an email exchange, a reader suggested that companies face more competition than governments, because the cost of moving countries is much higher than the cost of switching products, which might make competence more important for companies. I think this is also pretty plausible. ↩ This is not a criticism of the book. ↩ After five years of studying philosophy, I’m convinced you can muster a plausible argument in favor of literally any position, with enough work. ↩

Science

Business

0 views

Sean Goedecke 3 months ago

2025 was an excellent year for this blog

In 2025, I published 141 posts, 33 of which made it to the front page of Hacker News or similar aggregators. I definitely wrote more in the first half of the year (an average of around 15 posts per month, down to around 8 in the second half), but overall I’m happy with my consistency. Here are some posts I’m really proud of: As it turns out, I was the third most popular blogger on Hacker News this year, behind the excellent Simon Willison and Jeff Geerling. I don’t put a lot of effort into appealing to Hacker News specifically, but I do think my natural style meshes well with the Hacker News commentariat (even if they’re often quite critical). I got hundreds of emails from readers this year (I went through Gmail and made it to 200 in the last three months of the year before I stopped counting). Getting email about my posts is one of the main reasons I write, so it was great to read people’s anecdotes and hear what they agreed or disagreed with. I also want to thank the people who wrote blog-length responses to what I wrote (most recently Alex Wennerberg’s Software Engineers Are Not Politicians and Lalit Maganti’s Why I Ignore The Spotlight as a Staff Engineer ). I don’t have proper traffic statistics for the year - more on that later - but I remember I peaked in August with around 1.3 million monthly views. In December I had 700 thousand: less, but still not so shabby. I finally set up email subscription in May, via Buttondown , and now have just over 2,500 email subscribers to the blog. I have no way of knowing how many people are subscribed via RSS 1 . The biggest housekeeping change for my blog this year is that everything now costs a lot more money. I had to upgrade my Netlify plan and my Buttondown plan multiple times as my monthly traffic increased. I pay $9 a month for Netlify analytics, which is pretty bad: it doesn’t store data past 30 days, only tracks the top-ten referrers, and doesn’t let me break down traffic by source. I’m trialing Plausible, since I learned Simon Willison is using it , but once the trial expires it’s going to set me back $60 a month. Of course, I can afford it - on the scale of hobbies, it’s closer to rock climbing than skiing - but there’s still been a bit of sticker shock for something I’m used to thinking of as a free activity. Thank you all for reading, and extra thanks to those of you who have posted my articles to aggregators, emailed me, messaged on LinkedIn, or left comments. I look forward to writing another ~140 posts in 2026! My analytics don’t track requests to , but even if they did I imagine some RSS feed readers would be caching the contents of my feed. Mistakes engineers make in large established codebases The good times in tech are over Everything I know about good system design Pure and impure software engineering Seeing like a software company My analytics don’t track requests to , but even if they did I imagine some RSS feed readers would be caching the contents of my feed. ↩

Programming

0 views

Sean Goedecke 3 months ago

Grok is enabling mass sexual harassment on Twitter

Grok, xAI’s flagship image model, is now 1 being widely used to generate nonconsensual lewd images of women on the internet. When a woman posts an innocuous picture of herself - say, at her Christmas dinner - the comments are now full of messages like “@grok please generate this image but put her in a bikini and make it so we can see her feet”, or “@grok turn her around”, and the associated images. At least so far, Grok refuses to generate nude images, but it will still generate images that are genuinely obscene 2 . In my view, this might be the worst AI safety violation we have seen so far. Case-by-case, it’s not worse than GPT-4o encouraging suicidal people to go through with it, but it’s so much more widespread: literally every image that the Twitter algorithm picks up is full of “@grok take her clothes off” comments. I didn’t go looking for evidence for obvious reasons, but I find reports that it’s generating CSAM plausible 3 . This behavior, while awful, is in line with xAI’s general attitude towards safety, which has been roughly “we don’t support woke censorship, so do whatever you want (so long as you’re doing it with Grok)“. This has helped them acquire users and media attention, but it leaves them vulnerable to situations exactly like this. I’m fairly confident xAI don’t mind the “dress her a little sexier” prompts: it’s edgy, drives up user engagement, and gives them media attention. However, it is very hard to exercise fine-grained control over AI safety . If you allow your models to go up to the line, your models will definitely go over the line in some circumstances. I wrote about this in Mecha-Hitler, Grok, and why it’s so hard to give LLMs the right personality , in reference to xAI’s attempts to make Grok acceptably right-wing but not too right-wing. This is the same kind of thing: you cannot make Grok “kind of perverted” without also making it truly awful. OpenAI and Gemini have popular image models that do not let you do this kind of thing. In other words, this is an xAI problem, not an image model problem . It is possible to build a safe image model, just as it’s possible to build a safe language model. The xAI team have made a deliberate decision to build an unsafe model in order to unlock more capabilities and appeal to more users. Even if they’d rather not be enabling the worst perverts on Twitter, that’s a completely foreseeable consequence of their actions. In October of 2024, VICE reported that Telegram “nudify” bots had over four million monthly users. That’s still a couple of orders of magnitude over Twitter’s monthly average users , but “one in a hundred” sounds like a plausible “what percentage of Twitter is using Grok like this” percentage anyway. Is it really that much worse that Grok now allows you to do softcore deepfakes? Yes, for two reasons. First, having to go and join a creepy Telegram group is a substantial barrier to entry . It’s much worse to have the capability built into a tool that regular people use every day. Second, generating deepfakes via Grok makes them public . Of course, it’s bad to do this stuff even privately, but I think it’s much worse to do it via Twitter. Tagging in Grok literally sends a push notification to your target saying “hey, I made some deepfake porn of you”, and then advertises that porn to everyone who was already following them. Yesterday xAI rushed out an update to rein this behavior in (likely a system prompt update, given the timing). I imagine they’re worried about the legal exposure, if nothing else. But this will happen again . It will probably happen again with Grok . Every AI lab has a big “USER ENGAGEMENT” dial where left is “always refuse every request” and right is “do whatever the user says, including generating illegal deepfake pornography”. The labs are incentivized to turn that dial as far to the right as possible. In my view, image model safety is a different topic from language model safety . Unsafe language models primarily harm the user (via sycophancy, for instance). Unsafe image models, as we’ve seen from Grok, can harm all kinds of people. I tend to think that unsafe language models should be available (perhaps not through ChatGPT dot com, but certainly for people who know what they’re doing). However, it seems really bad for everyone on the planet to have a “turn this image of a person into pornography” button. At minimum, I think it’d be sensible to pursue entities like xAI under existing CSAM or deepfake pornography laws , to set up a powerful counter-incentive for people with their hands on the “USER ENGAGEMENT” dial. I also think it’d be sensible for AI labs to strongly lock down “edit this image of a human” requests , even if that precludes some legitimate user activity. Earlier this year, in The case for regulating AI companions , I suggested regulating “AI girlfriend” products. I mistakenly thought AI companions or sycophancy might be the first case of genuine widespread harm caused by AI products, because of course nobody would ship an image model that allowed this kind of prompting. Turns out I was wrong. There were reports in May of this year of similar behavior, but it was less widespread and xAI jumped on it fairly quickly. Clever prompting by unethical fetishists can generate really degrading content (to the point where I’m uncomfortable going into more detail). I saw a few cases earlier this year of people trying this prompting tactic and Grok refusing them. It seems the latest version of Grok now allows this. Building a feature that lets you digitally undress 18-year-olds but not 17-year-olds is a really difficult technical problem, which is one of the many reasons to never do this . There were reports in May of this year of similar behavior, but it was less widespread and xAI jumped on it fairly quickly. ↩ Clever prompting by unethical fetishists can generate really degrading content (to the point where I’m uncomfortable going into more detail). I saw a few cases earlier this year of people trying this prompting tactic and Grok refusing them. It seems the latest version of Grok now allows this. ↩ Building a feature that lets you digitally undress 18-year-olds but not 17-year-olds is a really difficult technical problem, which is one of the many reasons to never do this . ↩

Security

0 views

Sean Goedecke 3 months ago

Nobody knows how large software products work

Large, rapidly-moving tech companies are constantly operating in the “fog of war” about their own systems. Simple questions like “can users of type Y access feature X?”, “what happens when you perform action Z in this situation?”, or even “how many different plans do we offer” often can only be answered by a handful of people in the organization. Sometimes there are zero people at the organization who can answer them, and somebody has to be tasked with digging in like a researcher to figure it out. How can this be? Shouldn’t the engineers who built the software know what it does? Aren’t these answers documented internally? Better yet, aren’t these questions trivially answerable by looking at the public-facing documentation for end users? Tech companies are full of well-paid people who know what they’re doing 1 . Why aren’t those people able to get clear on what their own product does? Large software products are prohibitively complicated . I wrote a lot more about this in Wicked Features , but the short version is you can capture a lot of value by adding complicated features. The classic examples are features that make the core product available to more users. For instance, the ability to self-host the software, or to trial it for free, or to use it as a large organization with centralized policy controls, or to use it localized in different languages, or to use it in countries with strict laws around how software can operate, or for highly-regulated customers like governments to use the software, and so on. These features are (hopefully) transparent to most users, but they cannot be transparent to the tech company itself . Why are these features complicated? Because they affect every single other feature you build . If you add organizations and policy controls, you must build a policy control for every new feature you add. If you localize your product, you must include translations for every new feature. And so on. Eventually you’re in a position where you’re trying to figure out whether a self-hosted enterprise customer in the EU is entitled to access a particular feature, and nobody knows - you have to go and read through the code or do some experimenting to figure it out. Couldn’t you just not build these features in the first place? Sure, but it leaves a lot of money on the table 2 . In fact, maybe the biggest difference between a small tech company and a big one is that the big tech company is set up to capture a lot more value by pursuing all of these fiddly, awkward features. Why can’t you just document the interactions once when you’re building each new feature? I think this could work in theory, with a lot of effort and top-down support, but in practice it’s just really hard. The core problem is that the system is rapidly changing as you try to document it . Even a single person can document a complex static system, given enough time, because they can just slowly work their way through it. But once the system starts changing, the people trying to document it now need to work faster than the rate of change in the system. It may be literally impossible to document it without implausible amounts of manpower. Worse, many behaviors of the system don’t necessarily have a lot of conscious intent behind them (or any). They just emerge from the way the system is set up, as interactions of a series of “default” choices. So the people working on the documentation are not just writing down choices made by engineers, they’re discovering how the system works for the first time . The only reliable way to answer many of these questions is to look at the codebase. I think that’s actually the structural cause of why engineers have institutional power at large tech companies. Of course, engineers are the ones who write software, but it’s almost more important that they’re the ones who can answer questions about software . In fact, the ability to answer questions about software is one of the core functions of an engineering team . The best understanding of a piece of software usually lives in the heads of the engineers who are working with it every day. If a codebase is owned by a healthy engineering team, you often don’t need anybody to go and investigate - you can simply ask the team as a whole, and at least one engineer will know the answer off the top of their head, because they’re already familiar with that part of the code. When tech companies reorg teams, they often destroy this tacit knowledge. If there’s no team with experience in a piece of software, questions have to be answered by investigation : some engineer has to go and find out. Typically this happens by some combination of interacting with the product (maybe in a dev environment where it’s easy to set up particular scenarios), reading through the codebase, or even performing “exploratory surgery” to see what happens when you change bits of code or force certain checks to always return true. This is a separate technical skill from writing code (though of course the two skills are related.) In my experience, most engineers can write software, but few can reliably answer questions about it. I don’t know why this should be so. Don’t you need to answer questions about software in order to write new software? Nevertheless, it’s true. My best theory is that it’s a confidence thing . Many engineers would rather be on the hook for their code (which at least works on their machine) than their answers (which could be completely wrong). I wrote about this in How I provide technical clarity to non-technical leaders . The core difficulty is that you’re always going out on a limb . You have to be comfortable with the possibility that you’re dead wrong, which is a different mindset to writing code (where you can often prove that your work is correct). You’re also able to be as verbose as you like when writing code - certainly when writing tests - but when you’re answering questions you have to boil things down to a summary. Many software engineers hate leaving out details. Non-technical people - at least, ones without a lot of experience working with software products - often believe that software systems are well-understood by the engineers who build them. The idea here is that the system should be understandable because it’s built line-by-line from (largely) deterministic components. However, while this may be true of small pieces of software, this is almost never true of large software systems . Large software systems are very poorly understood, even by the people most in a position to understand them. Even really basic questions about what the software does often require research to answer. And once you do have a solid answer, it may not be solid for long - each change to a codebase can introduce nuances and exceptions, so you’ve often got to go research the same question multiple times. Because of all this, the ability to accurately answer questions about large software systems is extremely valuable . I’m not being sarcastic here, I think this is literally true and if you disagree you’re being misled by your own cynicism. I first read this point from Dan Luu . I’m not being sarcastic here, I think this is literally true and if you disagree you’re being misled by your own cynicism. ↩ I first read this point from Dan Luu . ↩

DevOps

0 views

Sean Goedecke 4 months ago

AI detection tools cannot prove that text is AI-generated

The runaway success of generative AI has spawned a billion-dollar sub-industry of “AI detection tools”: tools that purport to tell you if a piece of text was written by a human being or generated by an AI tool like ChatGPT. How could that possibly work? I think these tools are both impressive and useful, and will likely get better. However, I am very worried about the general public overestimating how reliable they are. AI detection tools cannot prove that text is AI-generated. My initial reaction when I heard about these tools was “there’s no way that could ever work”. I think that initial reaction is broadly correct, because the core idea of AI detection tools - that there is an intrinsic difference between human-generated writing and AI-generated writing - is just fundamentally mistaken 0 . Large language models learn from huge training sets of human-written text. They learn to generate text that is as close as possible to the text in their training data. It’s this data that determines the basic “voice” of an AI model, not anything about the fact that it’s an AI model. A model trained on Shakespeare will sound like Shakespeare, and so on. You could train a thousand different models on a thousand different training sets without finding a common “model voice” or signature that all of them share. We can thus say (almost a priori ) that AI detection tools cannot prove that text is AI-generated. Anything generated by a language model is by definition the kind of thing that could have been generated by a human. But of course it’s possible to tell when something was written by AI! When I read Twitter replies, the obviously-LLM-generated ones stick out like a sore thumb. I wrote about this in Why does AI slop feel so bad to read? . How can this be possible, when it’s impossible to prove that something was written by AI? Part of the answer here might just be that current-generation AI models have a really annoying “house style”, and any humans writing in the same style are annoying in the same way . When I read the first sentence of a blog post and think “oh, this is AI slop, no need to keep reading”, I don’t actually care whether it’s AI or not. If it’s a human, they’re still writing in the style of AI slop, and I still don’t want to read the rest of the post. However, I think there’s more going on here. Claude does kind of sound like ChatGPT a lot of the time, even though they’re different models trained in different ways on (at least partially) different data. I think the optimistic case for AI detection tooling goes something like this: I find this fairly compelling, so long as you’re okay with a 90% success rate . A 90% success rate can be surprisingly bad if the base rate is low, as illustrated by the classic Bayes’ theorem example . If 10% of essays in a class are AI-written, and your detector is 90% accurate, then only half of the essays it flags will be truly AI-written. If an AI detection tool thinks a piece of writing is AI, you should treat that as “kind of suspicious” instead of conclusive proof. There are a few different approaches to building AI detection tools. The naive approach - which I couldn’t find any actual production examples of - would be to train a simple text classifier on a body of human-written and AI-written text. Apparently this doesn’t work particularly well. The Ghostbuster paper tried this and decided that it was easier to train a classifier on the logits themselves: they pass each candidate document through a bunch of simple LLMs, record how much each LLM “agreed” with the text, then train their classifier on that data. DNA-GPT takes an even simpler approach: they truncate a candidate document, regenerate the last half via frontier LLMs, and then compare that with the actual last half. The most impressive paper I’ve seen is the EditLens paper by Pangram Labs. EditLens trains a model on text that was edited by AI to various extents, not generated from scratch, so the model can learn to predict the granular degree of AI involvement in a particular text. This plausibly gets you a much better classifier than a strict “AI or not” classifier model, because each example teaches the model a numeric value instead of a single bit of information. One obvious point: all of these tools use AI themselves . There is simply no way to detect the presence of AI writing without either training your own model or running inference via existing frontier models. This is bad news for the most militantly anti-AI people, who would prefer not to use AI for any reason, even to catch other people using AI. It also means that - as I said earlier and will say again - AI detection tools cannot prove that text is AI-generated . Even the best detection tools can only say that it’s extremely likely. Interestingly, there’s a sub-sub-industry of “humanizing” tools that aim to convert your AI-generated text into text that will be judged by AI detection tools as “human”. Some free AI detection tools are actually sales funnels for these humanizing tools, and will thus deliberately produce a lot of false-positives so users will pay for the humanizing service. For instance, I ran one of my blog posts 1 through JustDone , which assessed them as 90% AI generated and offered to fix it up for the low, low price of $40 per month. These tools don’t say this outright, but of course the “humanizing” process involves passing your writing through a LLM that’s either prompted or fine-tuned to produce less-LLM-sounding content. I find this pretty ironic. There are probably a bunch of students who have been convinced by one of these tools to make their human-written essay LLM-generated, out of (justified) paranoia that a false-positive would get them in real trouble with their school or university. It is to almost everyone’s advantage to pretend that these tools are better than they are. The companies that make up the billion-dollar AI detection tool industry obviously want to pretend like they’re selling a perfectly reliable tool. University and school administrators want to pretend like they’ve got the problem under control. People on the internet like dunking on people by posting a screenshot that “proves” they’re copying their messages from ChatGPT. Even the AI labs themselves would like to pretend that AI detection is easy and reliable, since it would relieve them of some of the responsibility they bear for effectively wrecking the education system. OpenAI actually released their own AI detection tool in January 2023, before retiring it six months later due to “its low rate of accuracy”. The real people who suffer from this mirage are the people who are trying to write, but now have to deal with being mistakenly judged for passing AI writing off as their own. I know students who are second-guessing how they write in order to sound “less like AI”, or who are recording their keystrokes or taking photos of drafts in order to have some kind of evidence that they can use against false positives. If you are in a position where you’re required to judge if people are using AI to write their articles or essays, I would urge you to be realistic about the capabilities of AI detection tooling. They can make educated guesses about whether text was written by AI, but that’s all they are: educated guesses. That goes double if you’re using a detection tool that also offers a “humanizing” service, since those tools are incentivized to produce false positives. AI detection tools cannot prove that text is AI-generated. People sometimes talk about watermarking : when a provider like OpenAI deliberately trains their model to output text in some cryptographic way that leaves a very-hard-to-fake fingerprint. For instance, maybe it could always output text where the frequency of “e”s divided by the frequency of “l”s approximates pi. That would be very hard for humans to copy! I suspect there’s some kind of watermarking going on already (OpenAI models output weird space characters, which might trip up people naively copy-pasting their content) but I’m not going to talk about it in this post, because (a) sophisticated watermarking harms model capability so I don’t think anyone’s doing it, and (b) unsophisticated watermarking is easily avoided. I write every one of these posts with my own human fingers. RLHF and instruction/safety tuning pushes all strong LLMs towards the same kind of tone and style That tone and style can be automatically detected by training a classifier model Sure, it’s possible for technically-sophisticated users to use abliterated LLMs or less-safety-tuned open models, but 99% of users will just be using ChatGPT or Claude (particularly if they’re lazy enough to cheat on their essays in the first place) Thus a fairly simple “ChatGPT/Claude/Gemini prose style detector” can get you 90% of the way towards detecting most people using LLMs to write their essays People sometimes talk about watermarking : when a provider like OpenAI deliberately trains their model to output text in some cryptographic way that leaves a very-hard-to-fake fingerprint. For instance, maybe it could always output text where the frequency of “e”s divided by the frequency of “l”s approximates pi. That would be very hard for humans to copy! I suspect there’s some kind of watermarking going on already (OpenAI models output weird space characters, which might trip up people naively copy-pasting their content) but I’m not going to talk about it in this post, because (a) sophisticated watermarking harms model capability so I don’t think anyone’s doing it, and (b) unsophisticated watermarking is easily avoided. ↩ I write every one of these posts with my own human fingers. ↩

0 views

Sean Goedecke 4 months ago

How good engineers write bad code at big companies

Every couple of years somebody notices that large tech companies sometimes produce surprisingly sloppy code. If you haven’t worked at a big company, it might be hard to understand how this happens. Big tech companies pay well enough to attract many competent engineers. They move slowly enough that it looks like they’re able to take their time and do solid work. How does bad code happen? I think the main reason is that big companies are full of engineers working outside their area of expertise . The average big tech employee stays for only a year or two 1 . In fact, big tech compensation packages are typically designed to put a four-year cap on engineer tenure: after four years, the initial share grant is fully vested, causing engineers to take what can be a 50% pay cut. Companies do extend temporary yearly refreshes, but it obviously incentivizes engineers to go find another job where they don’t have to wonder if they’re going to get the other half of their compensation each year. If you count internal mobility, it’s even worse. The longest I have ever stayed on a single team or codebase was three years, near the start of my career. I expect to be re-orged at least every year, and often much more frequently. However, the average tenure of a codebase in a big tech company is a lot longer than that. Many of the services I work on are a decade old or more, and have had many, many different owners over the years. That means many big tech engineers are constantly “figuring it out”. A pretty high percentage of code changes are made by “beginners”: people who have onboarded to the company, the codebase, or even the programming language in the past six months. To some extent, this problem is mitigated by “old hands”: engineers who happen to have been in the orbit of a particular system for long enough to develop real expertise. These engineers can give deep code reviews and reliably catch obvious problems. But relying on “old hands” has two problems. First, this process is entirely informal . Big tech companies make surprisingly little effort to develop long-term expertise in individual systems, and once they’ve got it they seem to barely care at all about retaining it. Often the engineers in question are moved to different services, and have to either keep up their “old hand” duties on an effectively volunteer basis, or abandon them and become a relative beginner on a brand new system. Second, experienced engineers are always overloaded . It is a busy job being one of the few engineers who has deep expertise on a particular service. You don’t have enough time to personally review every software change, or to be actively involved in every decision-making process. Remember that you also have your own work to do : if you spend all your time reviewing changes and being involved in discussions, you’ll likely be punished by the company for not having enough individual output. Putting all this together, what does the median productive 2 engineer at a big tech company look like? They are usually: They are almost certainly working to a deadline, or to a series of overlapping deadlines for different projects. In other words, they are trying to do their best in an environment that is not set up to produce quality code. That’s how “obviously” bad code happens. For instance, a junior engineer picks up a ticket for an annoying bug in a codebase they’re barely familiar with. They spend a few days figuring it out and come up with a hacky solution. One of the more senior “old hands” (if they’re lucky) glances over it in a spare half-hour, vetoes it, and suggests something slightly better that would at least work. The junior engineer implements that as best they can, tests that it works, it gets briefly reviewed and shipped, and everyone involved immediately moves on to higher-priority work. Five years later somebody notices this 3 and thinks “wow, that’s hacky - how did such bad code get written at such a big software company”? I have written a lot about the internal tech company dynamics that contribute to this. Most directly, in Seeing like a software company I argue that big tech companies consistently prioritize internal legibility - the ability to see at a glance who’s working on what and to change it at will - over productivity. Big companies know that treating engineers as fungible and moving them around destroys their ability to develop long-term expertise in a single codebase. That’s a deliberate tradeoff. They’re giving up some amount of expertise and software quality in order to gain the ability to rapidly deploy skilled engineers onto whatever the problem-of-the-month is. I don’t know if this is a good idea or a bad idea. It certainly seems to be working for the big tech companies, particularly now that “how fast can you pivot to something AI-related” is so important. But if you’re doing this, then of course you’re going to produce some genuinely bad code. That’s what happens when you ask engineers to rush out work on systems they’re unfamiliar with. Individual engineers are entirely powerless to alter this dynamic . This is particularly true in 2025, when the balance of power has tilted away from engineers and towards tech company leadership. The most you can do as an individual engineer is to try and become an “old hand”: to develop expertise in at least one area, and to use it to block the worst changes and steer people towards at least minimally-sensible technical decisions. But even that is often swimming against the current of the organization, and if inexpertly done can cause you to get PIP-ed or worse. I think a lot of this comes down to the distinction between pure and impure software engineering . To pure engineers - engineers working on self-contained technical projects, like a programming language - the only explanation for bad code is incompetence. But impure engineers operate more like plumbers or electricians. They’re working to deadlines on projects that are relatively new to them, and even if their technical fundamentals are impeccable, there’s always something about the particular setup of this situation that’s awkward or surprising. To impure engineers, bad code is inevitable. As long as the overall system works well enough, the project is a success. At big tech companies, engineers don’t get to decide if they’re working on pure or impure engineering work. It’s not their codebase ! If the company wants to move you from working on database infrastructure to building the new payments system, they’re fully entitled to do that. The fact that you might make some mistakes in an unfamiliar system - or that your old colleagues on the database infra team might suffer without your expertise - is a deliberate tradeoff being made by the company, not the engineer . It’s fine to point out examples of bad code at big companies. If nothing else, it can be an effective way to get those specific examples fixed, since execs usually jump at the chance to turn bad PR into good PR. But I think it’s a mistake 4 to attribute primary responsibility to the engineers at those companies. If you could wave a magic wand and make every engineer twice as strong, you would still have bad code , because almost nobody can come into a brand new codebase and quickly make changes with zero mistakes. The root cause is that most big company engineers are forced to do most of their work in unfamiliar codebases . I struggled to find a good original source on this. There’s a 2013 PayScale report citing a 1.1 year median turnover at Google, which seems low. Many engineers at big tech companies are not productive, but that’s a post all to itself. I don’t want to get into it here for two reasons. First, I think competent engineers produce enough bad code that it’s fine to be a bit generous and just scope the discussion to them. Second, even if an incompetent engineer wrote the code, there’s almost always competent engineers who could have reviewed it, and the question of why that didn’t happen is still interesting. The example I’m thinking of here is not the recent GitHub Actions one , which I have no first-hand experience of. I can think of at least ten separate instances of this happening to me. In my view, mainly a failure of imagination : thinking that your own work environment must be pretty similar to everyone else’s. competent enough to pass the hiring bar and be able to do the work, but either working on a codebase or language that is largely new to them, or trying to stay on top of a flood of code changes while also juggling their own work. I struggled to find a good original source on this. There’s a 2013 PayScale report citing a 1.1 year median turnover at Google, which seems low. ↩ Many engineers at big tech companies are not productive, but that’s a post all to itself. I don’t want to get into it here for two reasons. First, I think competent engineers produce enough bad code that it’s fine to be a bit generous and just scope the discussion to them. Second, even if an incompetent engineer wrote the code, there’s almost always competent engineers who could have reviewed it, and the question of why that didn’t happen is still interesting. ↩ The example I’m thinking of here is not the recent GitHub Actions one , which I have no first-hand experience of. I can think of at least ten separate instances of this happening to me. ↩ In my view, mainly a failure of imagination : thinking that your own work environment must be pretty similar to everyone else’s. ↩

Career

Programming

3 views

Sean Goedecke 4 months ago

Becoming unblockable

With enough careful effort, it’s possible to become unblockable. In other words, you can put yourself in a position where you’re always able to make forward progress on your goals. I wrote about this six months ago in Why strong engineers are rarely blocked , but I wanted to take another crack at it and give some more concrete advice. The easiest way to avoid being blocked is to have more than one task on the go. Like a CPU thread, if you’re responsible for multiple streams of work, you can deal with one stream getting blocked by rolling onto another one. While one project might be blocked, you are not: you can continue getting stuff done. Because of this, I almost always have more than one task on my plate. However, there’s a lot of nuance involved in doing this correctly. The worst thing you can do is to be responsible for two urgent tasks at the same time - no matter how hard you work, one of them will always be making no progress, which is very bad 1 . If you’ve got too many ongoing tasks at the same time, you also risk overloading yourself if one or two of them suddenly blow out. It’s famously hard to scope engineering work. In a single day, you can go from having two or three trivial tasks to having three big jobs at the same time. I do not recommend just mindlessly picking up an extra ticket from your project board. Instead, try to have some non-project work floating around: refactors, performance work, writing performance reviews, mandatory training, and so on. It can be okay to pick up an extra ticket if you’re tactical about which ticket you pick up. Try to avoid having two important tasks on the go at the same time. Plan out projects from the start to minimize blockers. This section is more relevant for projects that you yourself are running, but the principle holds even for smaller pieces of work. If you think something is likely to get blocked (for instance, maybe database migrations at your company are run by a dedicated team with a large backlog), do it as early as possible . That way you can proceed with the rest of the project while you wait. Getting this wrong can add weeks to a project. Likewise, if there’s a part of your project that’s likely to be controversial, do it early so you can keep working on the rest of the project while the debate rages on. Do whatever it takes to have a stable and reliable developer environment. I don’t think it’s possible to overstate the importance of this. The stability of your developer environment directly determines how much of a workday you can spend actually doing work. For instance, use as normal a developer stack as possible . At GitHub, most development is done in Codespaces , a platform for server-hosted dev containers. You can connect to a codespace with almost any IDE, but the majority of people use VSCode, so I use VSCode , with as few plugins as possible 2 . I think a lot of developers are too focused on their personal “top speed” with their developer environment when everything is working great, and under-emphasize how much time they spend tweaking config, patching dotfiles, and troubleshooting in general. Fix developer environment problems as quickly as production incidents. If you can’t run tests or run a local server, don’t half-ass the troubleshooting process - focus on it until it’s fixed. On the flip side, don’t treat it as a leisurely learning experience (say, about how MacOS handles Dockerized networking). In many circumstances you’re probably better off tearing down and re-creating everything than digging in and trying to patch the specific issue. If your developer environment really is irreparably broken - maybe you’re waiting on new hardware, or you’re making a one-off change to a service that you don’t have the right dev environment permissions for - be scrappy about finding alternatives . If you can’t run tests, your GitHub CI probably can. If you can’t run a server locally, can you deploy to a staging environment and test there? Be careful about doing this in your main developer environment. You’re usually better off spending the time to actually fix the problem. But when you can’t, you should be creative about how you can keep working instead of just giving up. I see a lot of engineers run into a weird thing - commonly a 403 or 400 status code from some other service - and say “oh, I’m blocked, I need this other service’s owners to investigate”. You can and should investigate yourself. This is particularly true if you’ve got access to the codebase. If you’re getting an error, go and search their codebase to see what could be causing the error. Find the logs for your request to see if there’s anything relevant there. Of course, you won’t be able to dig as deep as engineers with real domain expertise, but often it doesn’t take domain expertise to solve your particular problem. There’s even less excuse not to do this now that AI agents are ubiquitous. Point Codex (or Copilot agent mode, or Claude Code, or whatever you have access to) at the codebase in question and ask “why might I be seeing this error with this specific request?” In my experience, you get the correct answer about a third of the time, which is amazing . Instead of waiting for hours or days to get help, you can spend ten minutes waiting for the agent and half an hour checking its work. Even if you can’t solve the problem yourself, a bit of research can often make your request for help much more compelling . As a service owner, there’s nothing more dispiriting than getting a “help, I get weird 400 errors” message - you know you’re going to spend a lot of time trawling through the logs before you can even figure out what the problem is, let alone how to reproduce it. But if the message already contains a link to the logs, or the text of a specific error, that immediately tells you where to start looking. There are typically two ways to do anything in a large tech company: the formal, legible way, and the informal way. As an example, it’s common to have a “ask for code review” Slack channel, which is full of engineers posting their changes. But many engineers don’t use these channels at all. Instead, they ping each other for immediate reviews, which is a much faster process. Of course, you can’t just DM random engineers asking for them to review your PR. It might work in the short term, but people will get really annoyed with you. You have to build relationships with engineers on every codebase you’d like to work on. If you’re extremely charismatic, maybe you can accomplish this with sheer force of will. But the rest of us have to build relationships by being useful: giving prompt and clear responses to questions from other teams, investigating bugs for them, reviewing their code, and so on. The most effective engineers at are tech company typically have really strong relationships with engineers on many other different teams. That isn’t to say that they operate entirely through backchannels, just that they have personal connections they can draw on when needed. If you’re blocked on work that another team is doing, it makes a huge difference having “someone on the inside”. Almost all blockers at large tech companies can be destroyed with sufficient “air support”. Typically this means a director or VP who’s aware of your project and is willing to throw their weight around to unblock you. For instance, they might message the database team’s manager saying “hey, can you prioritize this migration”, or task their very-senior-engineer direct report with resolving some technical debate that’s delaying your work. You can’t get air support for everything you’d like to do - it just doesn’t work like that, unless the company is very dysfunctional or you have a very good relationship with a senior manager. But you can choose to do things that align with what senior managers in the organizaton want, which can put you in a position to request support and get it. I wrote about this a lot more in How I influence tech company politics as a staff software engineer , but in one sentence: the trick is to have a bunch of possible project ideas in your back pocket, and then choose the ones that align with whatever your company cares about this month. Many engineers just don’t make use of the powerful allies they have. If you’re working on a high-priority project, the executive in charge of that project is unlikely to have the bandwidth to follow your work closely. They will be depending on you to go and tell them if you’re blocked and need their help. Unlike the relationships you may have with engineers on different teams, requesting air cover does not spend any credit. In fact, it often builds it, by showing that you’re switched-on enough to want to be unblocked, and savvy enough to know you can ask for their help. Senior managers are usually quite happy to go and unblock you, if you’re clear enough about what exactly you need them to do. To minimize the amount of time you spend blocked, I recommend: At some point somebody important will ask “why isn’t this task making any progress”, and you do not want the answer to be “I was working on something else”. Before I joined GitHub, I worked entirely inside a terminal and neovim. I switched to VSCode entirely because of Codespaces. If I joined another company where most developers used JetBrains, I would immediately switch to JetBrains. Working on at least two things at a time, so when one gets blocked you can switch to the other Sequencing your work so potential blockers are discovered and started early Making a reliable developer environment a high priority, including avoiding unusual developer tooling Being willing to debug into other services that you don’t own Building relationships with engineers on other teams Making use of very senior managers to unblock you, when necessary At some point somebody important will ask “why isn’t this task making any progress”, and you do not want the answer to be “I was working on something else”. ↩ Before I joined GitHub, I worked entirely inside a terminal and neovim. I switched to VSCode entirely because of Codespaces. If I joined another company where most developers used JetBrains, I would immediately switch to JetBrains. ↩

Career

0 views

Sean Goedecke 4 months ago

Why it takes months to tell if new AI models are good

Nobody knows how to tell if current-generation models are any good . When GPT-5 launched, the overall mood was very negative, and the consensus was that it wasn’t a strong model. But three months later it turns out that GPT-5 (and its derivative GPT-5-Codex) is a very strong model for agentic work 1 : enough to break Anthropic’s monopoly on agentic coding models. In fact, GPT-5-Codex is my preferred model for agentic coding. It’s slower than Claude Sonnet 4.5, but in my experience it gets more hard problems correct. Why did it take months for me to figure this out? The textbook solution for this problem is evals - datasets of test cases that models can be scored against - but evals are largely unreliable . Many models score very well on evals but turn out to be useless in practice. There are a couple of reasons for this. First, it’s just really hard to write useful evals for real-world problems , since real-world problems require an enormous amount of context. Can’t you take previous real-world problems and put them in your evals - for instance, by testing models on already-solved open-source issues? You can, but you run into two difficulties: Another problem is that evals are a target for AI companies . How well Anthropic or OpenAI’s new models perform on evals has a direct effect on the stock price of those companies. It’d be naive to think that they don’t make some kind of effort to do well on evals: if not by directly training on public eval data 2 , then by training on data that’s close enough to eval data to produce strong results. I’m fairly confident that big AI companies will not release a model unless they can point to a set of evals that their model does better than competitors. So you can’t trust that strong evals will mean a strong model, because every single new model is released with strong evals. If you can’t rely on evals to tell you if a new model is good, what can you rely on? For most people, the answer is the “vibe check”: interacting with the model themselves and making their own judgement. Often people use a set of their own pet questions, which are typically questions that other LLMs get wrong (say, word puzzles). Trick questions can be useful, but plenty of strong models struggle with specific trick questions for some reason. My sense is also that current models are too strong for obvious word puzzles. You used to be able to trip up models with straightforward questions like “If I put a ball in a box, then put the box in my pocket, where is the ball?” Now you have to be more devious, which gives less signal about how strong the model is. Sometimes people use artistic prompts. Simon Willison famously asks new models to produce a SVG of a pelican riding a bicycle. It’s now a common Twitter practice to post side-by-side “I asked two models to build an object in Minecraft” screenshots. This is cool - you can see at a glance that bigger models produce better images - but at some point it becomes difficult to draw conclusions from the images. If Claude Sonnet 4.5 puts the pelican’s feet on the pedals correctly, but GPT-5.1 adds spokes to the wheels, which model is better? Finally, many people rely on pure vibes: the intangible sense you get after using a model about whether it’s good or not. This is sometimes described as “big model smell”. I am fairly agnostic about people’s ability to determine model capability from vibes alone. It seems like something humans might be able to do, but also like something that would be very easy to fool yourself about. For instance, I would struggle to judge a model with the conversational style of GPT-4o as very smart, but there’s nothing in principle that would prevent that. Of course, for people who engage in intellectually challenging pursuits, there’s an easy (if slow) way to evaluate model capability: just give it the problems you’re grappling with and see how it does. I often ask a strong agentic coding model to do a task I’m working on in parallel with my own efforts. If the model fails, it doesn’t slow me down much; if it succeeds, it catches something I don’t, or at least gives me a useful second opinion. The problem with this approach is that it takes a fair amount of time and effort to judge if a new model is any good, because you have to actually do the work : if you’re not engaging with the problem yourself, you will have no idea if the model’s solution is any good or not. So testing out a new model can be risky. If it’s no good, you’ve wasted a fair amount of time and effort! I’m currently trying to decide whether to invest this effort into testing out Gemini 3 Pro or GPT-5.1-Codex - right now I’m still using GPT-5-Codex for most tasks, or Claude Sonnet 4.5 on some simpler problems. Each new model release reignites the debate over whether AI progress is stagnating. The most prominent example is Gary Marcus, who has written that GPT-4 , GPT-4o , Claude 3.5 Sonnet , GPT-5 and DeepSeek all prove that AI progress has hit a wall. But almost everyone who writes about AI seems to be interested in the topic. Each new model launch is watched to see if this is the end of the bubble, or if LLMs will continue to get more capable. The reason this debate never ends is that there’s no reliable way to tell if an AI model is good . Suppose that base AI models were getting linearly smarter (i.e. that GPT-5 really was as far above GPT-4 as GPT-4 was above GPT-3.5, and so on). Would we actually be able to tell? When you’re talking to someone who’s less smart than you 3 , it’s very clear. You can see them failing to follow points you’re making, or they just straight up spend time visibly confused and contradicting themselves. But when you’re talking to someone smarter than you, it’s far from clear (to you) what’s going on. You can sometimes feel that you’re confused by what they say, but that doesn’t necessarily mean they’re smarter. It could be that they’re just talking nonsense. And smarter people won’t confuse you all the time - only when they fail to pitch their communication at your level. Talking with AI models is like that. GPT-3.5 was very clearly less smart than most of the humans who talked to it. It was mainly impressive that it was able to carry on a conversation at all. GPT-4 was probably on par with the average human (or a little better) in its strongest domains. GPT-5 (at least in thinking mode) is smarter than the average human across most domains, I believe. Suppose we had no objective way of measuring chess ability. Would I be able to tell if computer chess engines were continuing to get better? I’d certainly be impressed when the chess engines went from laughably bad to beating me every time. But I’m not particularly good at chess. I would lose to chess engines from the early 1980s . It would thus seem to me as if chess engine progress had stalled out, when in fact modern chess engines have double the rating of chess engines from the 1980s. I acknowledge that “the model is now at least partly smarter than you” is an underwhelming explanation for why AI models don’t appear to be rapidly getting better. It’s easy to point to cases where even strong models fall over. But it’s worth pointing out that if models were getting consistently smarter, this is what it would look like : rapid subjective improvement as the models go from less intelligent than you to on par with you, and then an immediate plateau as the models surpass you and you become unable to tell how smart they are. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. Open-source coding is often meaningfully different from the majority of programming work. For more on this, see my comments in METR’S AI productivity study is really good , where I discuss an AI-productivity study that was done on open-source codebases. You’re still only covering a tiny subset of all programming work. For instance, the well-known SWE-Bench set of coding evals are just in Python. A model might be really good at Python but struggle with other languages. Nobody knows how good a model is when it’s launched. Even the AI lab who built it are only guessing and hoping it’ll turn out to be effective for real-world use cases. Evals are mostly marketing tools. It’s hard to figure out how good the eval is, or if the model is being “taught to the test”. If you’re trying to judge models from their public evals you’re fighting against the billions of dollars of effort going into gaming the system. Vibe checks don’t test the kind of skills that are useful for real work, but testing a model by using it to do real work takes a lot of time. You can’t figure out if a brand new model is good that way. Because of all this, it’s very hard to tell if AI progress is stagnating or not. Are the models getting better? Are they any good right now? Compounding that problem, it’s hard to judge between two models that are both smarter than you (in a particular domain). If the models do keep getting better, we might expect it to feel like they’re plateauing, because once they get better than us we’ll stop seeing evidence of improvement. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. ↩ If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. ↩ I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. ↩

Python

Testing

0 views

Sean Goedecke 5 months ago

Only three kinds of AI products actually work

The very first LLM-based product, ChatGPT, was just 1 the ability to talk with the model itself: in other words, a pure chatbot. This is still the most popular LLM product by a large margin. In fact, given the amount of money that’s been invested in the industry, it’s shocking how many “new AI products” are just chatbots. As far as I can tell, there are only three types of AI product that currently work . For the first couple of years of the AI boom, all LLM products were chatbots. They were branded in a lot of different ways - maybe the LLM knew about your emails, or a company’s helpdesk articles - but the fundamental product was just the ability to talk in natural language to an LLM. The problem with chatbots is that the best chatbot product is the model itself . Most of the reason users want to talk with an LLM is generic: they want to ask questions, or get advice, or confess their sins, or do any one of a hundred things that have nothing to do with your particular product. In other words, your users will just use ChatGPT 2 . AI labs have two decisive advantages over you: first, they will always have access to the most cutting-edge models before you do; and second, they can develop their chatbot harness simultaneously with the model itself (like how Anthropic specifically trains their models to be used in Claude Code, or OpenAI trains their models to be used in Codex). One way your chatbot product can beat ChatGPT is by doing what OpenAI won’t do: for instance, happily roleplaying an AI boyfriend or generating pornography. There is currently a very lucrative niche of products like this, which typically rely on less-capable but less-restrictive open-source models. These products have the problems I discussed above. But it doesn’t matter that their chatbots are less capable than ChatGPT or Claude: if you’re in the market for sexually explicit AI roleplay, and ChatGPT and Claude won’t do it, you’re going to take what you can get. I think there are serious ethical problems with this kind of product. But even practically speaking, this is a segment of the industry likely to be eaten alive by the big AI labs, as they become more comfortable pushing the boundaries of adult content. Grok Companions is already going down this pathway, and Sam Altman has said that OpenAI models will be more open to generating adult content in the future. There’s a slight variant on chatbots which gives the model tools : so instead of just chatting with your calendar, you can ask the chatbot to book meetings, and so on. This kind of product is usually called an “AI assistant”. This doesn’t work well because savvy users can manipulate the chatbot into calling tools . So you can never give a support chatbot real support powers like “refund this customer”, because the moment you do, thousands of people will immediately find the right way to jailbreak your chatbot into giving them money. You can only give your chatbots tools that the user could do themselves - in which case, your chatbot is competing with the usability of your actual product, and will likely lose. Why will your chatbot lose? Because chat is not a good user interface . Users simply do not want to type out “hey, can you increase the font size for me” when they could simply hit “ctrl-plus” or click a single button 3 . I think this is a hard lesson for engineers to learn. It’s tempting to believe that since chatbots have gotten 100x better, they must now be the best user interface for many tasks. Unfortunately, they started out 200x worse than a regular user interface, so they’re still twice as bad. The second real AI product actually came out before ChatGPT did: GitHub Copilot. The idea behind the original Copilot product (and all its imitators, like Cursor Tab) is that a fast LLM can act as a smart autocomplete. By feeding the model the code you’re typing as you type it, a code editor can suggest autocompletions that actually write the rest of the function (or file) for you. The genius of this kind of product is that users never have to talk to the model . Like I said above, chat is a bad user interface. LLM-generated completions allow users to access the power of AI models without having to change any part of their current workflow: they simply see the kind of autocomplete suggestions their editor was already giving them, but far more powerful. I’m a little surprised that completions-based products haven’t taken off outside coding (where they immediately generated a multi-billion-dollar market). Google Docs and Microsoft Word both have something like this. Why isn’t there more hype around this? The third real AI product is the coding agent. People have been talking about this for years, but it was only really in 2025 that the technology behind coding agents became feasible (with Claude Sonnet 3.7, and later GPT-5-Codex). Agents are kind of like chatbots, in that users interact with them by typing natural language text. But they’re unlike chatbots in that you only have to do that once : the model takes your initial request and goes away to implement and test it all by itself. The reason agents work and chatbots-with-tools don’t is the difference between asking an LLM to hit a single button for you and asking the LLM to hit a hundred buttons in a specific order. Even though each individual action would be easier for a human to perform, agentic LLMs are now smart enough to take over the entire process. Coding agents are a natural fit for AI agents for two reasons: For my money, the current multi-billion-dollar question is can AI agents be useful for tasks other than coding? Bear in mind that Claude Sonnet 3.5 was released just under nine months ago . In that time, the tech industry has successfully built agentic products about their own work. They’re just starting to build agentic products for other tasks. It remains to be seen how successful that will be, or what those products will look like. There’s another kind of agent that isn’t about coding: the research agent. LLMs are particularly good at tasks like “skim through ten pages of search results” or “keyword search this giant dataset for any information on a particular topic”. I use this functionality a lot for all kinds of things. There are a few examples of AI products built on this capability, like Perplexity . In the big AI labs, this has been absorbed into the chatbot products: OpenAI’s “deep research” went from a separate feature to just what GPT-5-Thinking does automatically, for instance. I think there’s almost certainly potential here for area-specific research agents (e.g. in medicine or law). If agents are the most recent successful AI product, AI-generated feeds might be the one just over the horizon. AI labs are currently experimenting with ways of producing infinite feeds of personalized content to their users: So far none of these have taken off. But scrolling feeds has become the primary way users interact with technology in general , so the potential here is massive. It does not seem unlikely to me at all that in five years time most internet users will spend a big part of their day scrolling an AI-generated feed. Like a completions-based product, the advantage of a feed is that users don’t have to interact with a chatbot. The inputs to the model come from how the user interacts with the feed (likes, scrolling speed, time spent looking at an item, and so on). Users can experience the benefits of an LLM-generated feed (if any) without having to change their consumption habits at all. The technology behind current human-generated infinite feeds is already a mature application of state of the art machine learning. When you interact with Twitter or LinkedIn, you’re interacting with a model, except instead of generating text it’s generating lists of other people’s posts. In other words, feeds already maintain a sophisticated embedding of your personal likes and dislikes . The step from “use that embedding to surface relevant content” to “use that embedding to generate relevant content” might be very short indeed. I’m pretty suspicious of AI-generated infinite feeds of generated video, but I do think other kinds of infinite feeds are an under-explored kind of product. In fact, I built a feed-based hobby project of my own, called Autodeck 4 . The idea was to use an AI-generated feed to generate spaced repetition cards for learning. It works pretty well! It still gets a reasonable amount of use from people who’ve found it via my blog (also, from myself and my partner). One other kind of AI-generated product that people have been talking about for years is the AI-based video game. The most speculative efforts in this direction have been full world simulations like DeepMind’s Genie , but people have also explored using AI to generate a subset of game content, such as pure-text games like AI Dungeon or this Skyrim mod which adds AI-generated dialogue. Many more game developers have incorporated AI art or audio assets into their games. Could there be a transformative product that incorporates LLMs into video games? I don’t think ARC Raiders counts as an “AI product” just because it uses AI voice lines, and the more ambitious projects haven’t yet really taken off. Why not? One reason could be that games just take a really long time to develop . When Stardew Valley took the world by storm in 2016, I expected a flood of copycat cozy pixel-art farming games, but that only really started happening in 2018 and 2019. That’s how long it takes to make a game! So even if someone has a really good idea for an LLM-based video game, we’re probably still a year or two out from it being released. Another reason is that many gamers really don’t like AI . Including generative AI in your game is a guaranteed controversy (though it doesn’t seem to be fatal, as the success of ARC Raiders shows). I wouldn’t be surprised if some game developers simply don’t think it’s worth the risk to try an AI-based game idea 5 . A third reason could be that generated content is just not a good fit for gaming . Certainly ChatGPT-like dialogue sticks out like a sore thumb in most video games. AI chatbots are also pretty bad at challenging the user: their post-training is all working to make them try to satisfy the user immediately 6 . Still, I don’t think this is an insurmountable technical problem. You could simply post-train a language model in a different direction (though perhaps the necessary resources for that haven’t yet been made available to gaming companies). By my count, there are three successful types of language model product: On top of that, there are two kinds of LLM-based product that don’t work yet but may soon: Almost all AI products are just chatbots (e.g. AI-powered customer support). These suffer from having to compete with ChatGPT, which is a superior general product, and not being able to use powerful tools, because users will be able to easily jailbreak the model. Agentic products are new, and have been wildly successful for coding . It remains to be seen what they’ll look like in other domains, but we’ll almost certainly see domain-specific research agents in fields like law. Research agents in coding have seen some success as well (e.g. code review or automated security scanning products). Infinite AI-generated feeds haven’t yet been successful, but hundreds of millions of dollars are currently being poured into them. Will OpenAI’s Sora be a real competitor to Twitter or Instagram, or will those companies release their own AI-generated feed product? AI-generated games sound like they could be a good idea, but there’s still no clear working strategy for how to incorporate LLMs into a video game. Pure world models - where the entire game is generated frame-by-frame - are cool demos but a long way from being products. One other thing I haven’t mentioned is image generation. Is this part of a chatbot product, or a tool in itself? Frankly, I think AI image generation is still more of a toy than a product, but it’s certainly seeing a ton of use. There’s probably some fertile ground for products here, if they can successfully differentiate themselves from the built-in image generation in ChatGPT. In general, it feels like the early days of the internet. LLMs have so much potential, but we’re still mostly building copies of the same thing. There have to be some really simple product ideas that we’ll look back on and think “that’s so obvious, I wonder why they didn’t do it immediately”. Of course, “just” here covers a raft of progress in training stronger models, and real innovations around RLHF, which made it possible to talk with pure LLMs at all. This is a big reason why most AI enterprise projects fail . Anecdotally, I have heard a lot of frustration with bespoke enterprise chatbots. People just want to use ChatGPT! If you’re not convinced, take any device you’re comfortable using (say, your phone, your car, your microwave) and imagine having to type out every command. Maybe really good speech recognition will fix this, but I doubt it. I wrote about it here and it’s linked in the topbar. Though this could be counterbalanced by what I’m sure is a strong push from executives to get in on the action and “build something with AI”. If you’ve ever tried to ask ChatGPT to DM for you, you’ll have experienced this first-hand: the model will immediately try and show you something cool, skipping over the necessary dullness that builds tension and lends verisimilitude. Maybe the answer is that the people using this product don’t engage with AI online spaces, and are just quietly using the product? Maybe there’s something about normal professional writing that’s less amenable to autocomplete than code? I doubt that, since so much normal professional writing is being copied out of a ChatGPT window. It could be that code editors already had autocomplete, so users were familiar with it. I bet autocomplete is brand-new and confusing to many Word users. It’s easy to verify changes by running tests or checking if the code compiles AI labs are incentivized to produce effective coding models to accelerate their own work Mark Zuckerberg has talked about filling Instagram with auto-generated content OpenAI has recently launched a Sora-based video-gen feed OpenAI has also started pushing users towards “Pulse”, a personalized daily update inside the ChatGPT product xAI is working on putting an infinite image and video feed into Twitter Chatbots like ChatGPT, which are used by hundreds of millions of people for a huge variety of tasks Completions coding products like Copilot or Cursor Tab, which are very niche but easy to get immediate value from Agentic products like Claude Code, Codex, Cursor, and Copilot Agent mode, which have only really started working in the last six months LLM-generated feeds Video games that are based on AI-generated content Of course, “just” here covers a raft of progress in training stronger models, and real innovations around RLHF, which made it possible to talk with pure LLMs at all. ↩ This is a big reason why most AI enterprise projects fail . Anecdotally, I have heard a lot of frustration with bespoke enterprise chatbots. People just want to use ChatGPT! ↩ If you’re not convinced, take any device you’re comfortable using (say, your phone, your car, your microwave) and imagine having to type out every command. Maybe really good speech recognition will fix this, but I doubt it. ↩ I wrote about it here and it’s linked in the topbar. ↩ Though this could be counterbalanced by what I’m sure is a strong push from executives to get in on the action and “build something with AI”. ↩ If you’ve ever tried to ask ChatGPT to DM for you, you’ll have experienced this first-hand: the model will immediately try and show you something cool, skipping over the necessary dullness that builds tension and lends verisimilitude. ↩

0 views