Latest Posts (20 found)
Sean Goedecke Yesterday

AI datacenters in space do not have a cooling problem

This year Elon Musk has started banging the drum about building AI datacenters in space. As the only person who owns a successful space company and a (moderately) successful AI company, this is a sensible way to boost his profile and net worth. Is it a sensible way to build datacenters? The first comment underneath most discussions of this always goes along these lines: “you obviously can’t build AI datacenters in space, because heat dissipation is really hard in space, and AI datacenters generate a lot of heat”. In general I am distrustful of snappy answers like these. It reminds me of the “AI datacenters obviously don’t use a lot of water, because cooling fluid circulates in a closed-loop system” argument: if it were true, there wouldn’t be a debate at all, just one side who understand the obvious point and another side who are stupid. Some arguments are like this! However, more often there’s a complicating factor that makes the snappy answer incorrect. In the water-use case, it’s that the closed-loop system has to itself be cooled by an open-loop evaporative chiller. What about the space datacenter case? First, let’s give the argument a fair shake. Although space is itself very cold, cooling is tricky because everything you’d want to cool is surrounded by vacuum. Heat transfer works in three ways: Vacuum is an excellent insulator because it defeats the first two methods of heat transfer. If there are no (or very few) atoms surrounding an object, those atoms can’t move around or collide. That’s why vacuum is used as an insulator in thermoses, travel mugs, and so on. So how can space datacenters get rid of their heat? By doubling down on the third method of heat transfer. Although it’s much harder to do heat transfer via moving atoms around in space, it’s actually easier to do heat transfer via emitting radiation. Any good emitter is also a good absorber. A perfectly black object is the most efficient emitter, but it’s also the most efficient way to absorb photons from external sources, which is why black objects get hotter in the sun 1 . In space, the sun’s light is much easier to avoid, because there aren’t objects everywhere for it to bounce off. A shaded radiator can dump quite a lot of heat. It would still require putting more radiators in space than we’ve ever done before. There are plenty of writeups out there if you want to read through the numbers. This is a recent one that estimates ~2500 square metres of radiation area would be needed to serve 1MW of datacenter energy (much less than what it’d need in solar panels) 2 . A serious AI datacenter is around 100MW 3 , so we’d need 250,000 square metres of radiation area. The largest current radiator in space is probably the ISS, at around a thousand square metres. Is scaling that up by 250x a lot? Yes, but it’s not necessarily ridiculous . We currently have zero industrial operations happening in space, so there’s been no need to push the boundaries here. In the grand scheme of things, 250,000 square metres is not that big. By my very rough estimates, that’s between 100-500 Starship launches: a couple of years at SpaceX’s current launch cadence, or a few months at their (very optimistic) estimate of future launch cadence. Of course, you don’t just need radiators to put a datacenter in space. You need a similar quantity of solar panels, the GPUs themselves, and all kinds of other supporting equipment. If a GPU dies in an Earth datacenter, you can go in and swap it out; if it dies in space, you just have to leave it dead and keep going with less capacity. It’s still wildly impractical to build AI datacenters in space. But it’s not impossible , and it’s certainly not impossible because of the cooling, which is a relatively minor component of the total mass that would have to be launched into space. In theory, black clothing would keep you slightly colder at night. Nobody ever talks about how impossible it would be to power space datacenters, despite the fact that you’d need to launch over triple the solar panel area into space than radiation area. I guess because people know solar panels exist and that the sun shines in space. The first gigawatt AI data centers are coming online this year, but 100MW is a fair estimate for a current pretty-large-but-not-enormous AI datacenter. Hot (i.e. fast-moving) atoms bump into other atoms, making them move and thus heating them up Hot atoms physically move from one location to another (e.g. in a fluid or gas), staying hot and thus making their new location hotter Hot objects emit photons (electromagnetic radiation), cooling themselves down and heating up other objects those photons collide with In theory, black clothing would keep you slightly colder at night. ↩ Nobody ever talks about how impossible it would be to power space datacenters, despite the fact that you’d need to launch over triple the solar panel area into space than radiation area. I guess because people know solar panels exist and that the sun shines in space. ↩ The first gigawatt AI data centers are coming online this year, but 100MW is a fair estimate for a current pretty-large-but-not-enormous AI datacenter. ↩

0 views
Sean Goedecke 2 days ago

Thinking Machines and interaction models

Thinking Machines just released Interaction Models . This is their first real AI model release 1 after a year of work and two billion dollars of capital. What is an “interaction model”? First, it’s not a frontier model . Thinking Machines is not yet competing with OpenAI, Anthropic and Google. Instead, they’re working on the problem of better real-time interaction with models. Some parts of what they’re doing are not new at all, other parts are slightly-questionable benchmark gaming, and still other parts represent a genuine technological advancement. I’ll try to lay it all out. If you’ve used ChatGPT in audio mode, you know that you can’t talk to it exactly how you’d talk to a human. There’s a big latency gap between when you finish talking and when the model jumps in. The model won’t interrupt you like a human, and doesn’t react to you interrupting it like a human would either. And of course you can’t give the model visual feedback like facial expressions. That’s because ChatGPT is either speaking or listening at any given time . When you’re talking, it’s in “listening” mode; when it’s talking, it’s in “speaking” mode, and isn’t absorbing any information from you. It relies on VAD (“voice activity detection”) to figure out if you’re talking. The alternative (and what “interaction models” do) is a fully-duplex system, where the model is constantly both in listening and speaking mode at the same time. Of course, the model can’t literally do this. Like all language models, it’s either doing prefill (ingesting prompt tokens) or decode (producing completion tokens). But what fully-duplex models can do is switch from listening to speaking mode in tiny chunks, called “micro-turns”. Instead of listening for ten seconds (or however long it takes you to stop talking), then speaking for ten seconds (or however long it takes to pass the model output through TTS), the model can listen for 200ms, then output for 200ms, then listen for 200ms, and so on. While the user is speaking, the model will know to output silence - most of the time. But if it decides it’s good to interrupt you or speak at the same time as you, it’s capable of doing that. So far, so unoriginal. There are plenty of examples of fully duplex audio systems that the Thinking Machines blog post already cites: Moshi , PersonaPlex , Nemotron-VoiceChat , and so on. But at least this outlines the space that “interaction models” are playing in: not “superintelligence from a frontier model”, but “better real-time conversational interaction” 2 . Given that, what is Thinking Machines doing that’s new? For existing fully-duplex models, you talk to the model itself. That’s a fairly big problem, since fully-duplex models have to be fast: fast enough that they can operate in tiny 200ms turns 3 . A model that fast cannot be particularly intelligent. Thinking Machines’ solution is to introduce an actual smart model - any regular language model will do here - in the background that the interaction model can delegate tasks to. In practice this is probably implemented as a tool call. The interaction model keeps chatting while the smart model works away, and then the smart model output is directly integrated into the interaction model’s context in the same way as audio and video input (a genuinely cool idea, I think). This is kind of neat, though it remains to be seen how well it works in practice. Will the model do a lot of “oh wait, the last thing I said was dumb, never mind” self-correction as the smarter model output trickles in? Will the fast interaction model be smart enough to delegate the right tasks at the right time? In general, the “start with a fast dumb model and have it hand off tasks” approach has been tricky for the AI labs to get right for a variety of reasons. If I’m being uncharitable, I might say that bolting on a strong reasoning model was an easy way for Thinking Machines to post impressive values for competitive benchmarks like FD-bench V3 (where they barely beat GPT-realtime-2.0) and BigBench Audio (where introducing the reasoning model bumps their score from 76% to 96%, only 0.1% below GPT-realtime-2.0). If I’m being charitable, I might say that a model fast enough for realtime conversation will have to have some way to punt hard tasks to a slower, smarter model. Both of those things are probably true. It’s also worth noting that Thinking Machines have also bolted on video input to their fully-duplex model. This is more exciting than it sounds, because face-to-face human conversation is very dependent on being able to read human expressions. In theory, this could unlock the ability to have genuine human-like conversations. The other reason why this is exciting is that it means Thinking Machines have been able to make a pretty big fully-duplex model (maybe twice the size of Moshi in terms of active parameters, and 40x the size in terms of total parameters). In fact, this is probably the biggest real technical achievement here. Other fully-duplex models are already doing micro-turns and interruptions, and could delegate reasoning fairly easily if they wanted to, but they aren’t doing video because they can’t . Being able to make a fully-duplex model the size of DeepSeek V4-Flash is pretty impressive. Much of the Thinking Machines blog post is dedicated to explaining how they’ve managed to do this: ingesting data in a more lightweight way, optimizing their inference libraries for tiny prefill/decode chunks, various decisions to make inference deterministic (a long-held hobbyhorse for Thinking Machines). There’s a lot of pressure on Thinking Machines to produce a genuine AI advancement. It doesn’t seem like they’re willing or able to compete in the frontier-model space (which makes sense, I wouldn’t want to either). Given that, I can see why they’re highlighting the parts of interaction models that are impressive to laypeople - all the fully-duplex interaction stuff - even though those parts are not truly innovative. So what are Interaction Models? A scaled-up, multimodal version of existing fully-duplex models like Moshi, with a real model bolted on for extra intelligence (and maybe better benchmarks). The scale and video parts are new and cool, and something like the overall approach has to be right. In general, I’m glad that we’ve got well-funded and high-profile AI labs tackling problems other than “build a smarter frontier model”. I think there’s a lot of low-hanging fruit waiting to be picked in other areas of AI research. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. People do seem to really like Tinker , which is their tooling for researchers who want to fine-tune models, but it’s not exactly the hot new frontier model that people were expecting. ↩ I think it’s at least a little shady that the Interaction Models video demo is making a big deal about some features (like real-time simultaneous translation) that are just features of fully-duplex audio models, not anything specific to their system. ↩ Even 200ms is a bit long. You can see from the demo that there’s an uncomfortable half-second lag sometimes as the model finishes its prefill slice and has to move to the decode slice. ↩

0 views
Sean Goedecke 5 days ago

AI makes weak engineers less harmful

Like other kinds of puzzle-solving, software engineering ability is strongly heavy-tailed. The strongest engineers produce way more useful output than the average, and the weakest engineers often are actively net-negative: instead of moving projects along, they create problems that their colleagues have to spend time solving. That’s why many tech companies try to build a small, ludicrously well-paid team instead of a large team of more average engineers, and why so far this seems to be a winning strategy. Being effective in a large tech company is often about managing this phenomenon: trying to arrange things so that the most competent people land on projects you want to succeed, and the least competent are shunted out of the way 1 . For instance, if you’re technical lead on a project, you more or less have to ensure 2 that the most critical pieces are in the hands of people who won’t screw them up (whether by directly assigning the work, or by making sure someone can “sit on the shoulder” of the engineer who you’re worried about). Claude Code changed this. Frontier LLMs don’t have the taste or the system familiarity of a strong engineer, but they have absolutely raised the floor for weak engineers. Instead of getting a pull request that could never possibly work or would cause immediate problems, the worst you’ll now see is a standard LLM pull request: wrong in some ways, baffling in others, but at least functional on the line-by-line level and not so obviously incorrect that someone with no knowledge of the codebase could point it out. That is a huge improvement! You can try this out yourself. If you attempt to deliberately make mistakes while working with a coding agent, you’ll find that the agent pushes back hard against many obvious errors (i.e. caching user data with a non-user-specific key, writing an infinite loop that might never terminate, or leaking open files). Of course, the agent will still miss subtle errors, particularly ones that require understanding other parts of the codebase. Working with the least effective engineers is now sometimes like working with a Claude Opus or Codex instance that you communicate with over Slack. Occasionally it’s literally that: your colleague is simply pasting your messages into Claude Code and pasting you the response. This is annoying, but it’s a much better experience than working with this kind of engineer directly. After all, you probably already work with a bunch of LLM instances. The Slack interface is not ideal - unlike using Claude Code directly, you sometimes wait hours or days for a response, and you don’t get visibility into the agent’s thought processes - but it’s still helpful on the margin. More compute being thrown at your problem is better than less. Of course, this isn’t a great state of affairs for the engineer in question, who is almost certainly learning less than if they were making their own (bad) decisions. It’s also a bad state of affairs for the company, who is paying a human salary and getting a Copilot subscription (which they’re likely also paying for) 3 . After the current push to figure out what value AI is adding to engineers, I suspect there will be a push to figure out what value engineers are adding to AI , and the engineers who aren’t adding much may find themselves out of a job. You can’t talk to Claude-over-Slack like you’d talk to normal Claude. If you tend to handle LLMs roughly (insulting them, or just being very curt), you’ll have to change your communication style. A human is going to read your messages, after all, even if you’re really interacting with a LLM. There’s no point being rude. But if, like me, you say please-and-thank-you to the models 4 , you can treat your LLM-using coworker as just another Copilot window or Codex tab. It’s far better than having to treat them as an unwitting saboteur. Not all net-negative engineers use AI tools like this. Many are strongly convinced in their own wrong opinions about how to build good software, or mistrust AI in general, or believe that relying heavily on LLMs is not a good way to improve 5 . But no strong engineers use AI tools like this. Even when they’re being lazy or sloppy, a capable engineer will have enough baseline taste to catch obvious AI-generated errors. So the phenomenon of engineers 6 becoming thin wrappers around Claude Code is limited to the kind of engineers for whom this is an improvement in their work product. More charitably: many “least competent” engineers are just out of their comfort zone, and can be fine or even excel under the right circumstances (though in my view the best engineers are able to do good work in a wide variety of environments). Also, I don’t currently work with a lot of incompetent people. Much of this is based on past experience or talking to other engineers in the industry. Since your managers are doing the same thing, this can sometimes feel like Moneyball: you’re trying to identify underappreciated talent who are strong enough to help you win without being so high-profile that your boss poaches them to lead something else. I suppose it’s better to pay for nothing than to pay for net-negative output, but it still doesn’t seem good . I think this is actually the right way to hold Claude Opus 4.7. Is this true? I think relying on LLMs is not a great way for most engineers to improve, but if LLM output is consistently better than your own, it might be different. So long as you’re paying attention to where the LLM does better, it could actually be a good way to learn. I don’t have as much experience (or anecdotes) about non-engineers falling into this trap, but this post has convinced me that it might be worse. More charitably: many “least competent” engineers are just out of their comfort zone, and can be fine or even excel under the right circumstances (though in my view the best engineers are able to do good work in a wide variety of environments). Also, I don’t currently work with a lot of incompetent people. Much of this is based on past experience or talking to other engineers in the industry. ↩ Since your managers are doing the same thing, this can sometimes feel like Moneyball: you’re trying to identify underappreciated talent who are strong enough to help you win without being so high-profile that your boss poaches them to lead something else. ↩ I suppose it’s better to pay for nothing than to pay for net-negative output, but it still doesn’t seem good . ↩ I think this is actually the right way to hold Claude Opus 4.7. ↩ Is this true? I think relying on LLMs is not a great way for most engineers to improve, but if LLM output is consistently better than your own, it might be different. So long as you’re paying attention to where the LLM does better, it could actually be a good way to learn. ↩ I don’t have as much experience (or anecdotes) about non-engineers falling into this trap, but this post has convinced me that it might be worse. ↩

0 views
Sean Goedecke 6 days ago

Notes on incidents

Incidents are boring. Most of what you actually do during an incident is wait: for some other team to investigate, or for a deploy to finish, or for the result of some change to become apparent, or for someone else who’s been paged to come online. It’s stressful, but there’s often just not that much to do. Most incidents resolve on their own. People love to share war stories about incidents where some hero engineer improvised a clever fix that instantly repaired the system. That rarely happens. Well-designed software systems tend to come good by themselves, and many modern systems are at least partly well-designed, by virtue of being built out of really solid pieces. If a server process is crashing or leaking memory, Kubernetes will kill the pod and bring it back up. If a service is overloaded and jammed up, clients will (hopefully) trigger circuit breakers and back off until it can recover. Temporary spikes in expensive operations will often just fill up a queue instead of taking the entire system down. Most incident calls I’ve been on - well over half - would have come good by themselves in roughly the same time without any human intervention. Most incident-resolving actions make incidents worse. Engineers jump too quickly to resolve incidents. Oh, the queue size is huge? Don’t worry, I’m here in a production console to clear the queue! Unfortunately, some of the jobs I just nuked were doing important billing work and aren’t automatically re-queued, so this queue-latency incident just became a billing incident as well. Another classic in this genre is “engineer forces a series of redeploys to “fix” a concerning-looking metric, and the concurrent deploys cause far more stress on the system than whatever was causing the metric to look weird”. For that reason, the first thing you should do in an incident is nothing . When I was paged late at night, I used to have a habit of pouring myself a glass of scotch before I joined the call. This was only partly for the tranquilizing effects of alcohol: the main reason was to have a ritual I could go through to convince myself that I wasn’t rushing, and that it was OK to take a few breaths and relax before jumping into the problem 1 . Making a cup of tea or going for a walk around the house would probably have served as well. Effective incident-resolving actions are often dull. Typically the action needed to resolve the incident - assuming it doesn’t resolve on its own - is to temporarily disable some problematic feature until the system recovers. This is never a complex code change. Typically someone spends five minutes putting together the patch, and then an hour waiting for reviews, CI, and deploying. If you’re very lucky, you’ll get to write a “wrap a cache around it” code change. In an incident, there is no substitute for knowledge of the system. Five strong engineers can troubleshoot on an incident call and get nowhere, while one half-drunk engineer who’s familiar with the codebase can swan in and immediately fix the problem. This is because the kinds of actions that resolve incidents are so simple: if you’ve been the one working on the project, you likely already know exactly what feature flag to check and disable, or what code change to revert. Resolving incidents requires courage. Incident calls can be scary. When engineers are scared, they often reach for consensus: hedging their statements, asking the group if they agree a particular course of action is safe, deferring to each other, and so on. But if you’re the one with knowledge of the system, you have to be decisive. Say “I’m going to do X”, wait thirty seconds, then do it. While it’s usually net-negative to have a powerful manager fidgeting on the incident call, this is one of the rare cases where it can be helpful - executives are very comfortable saying “okay, do it now” about technical courses of action they don’t fully understand. Resolving incidents buys a lot of political credit. One thing that I think surprises a lot of engineers who are new to on-call is how grateful managers and executives are for even really simple fixes (i.e. “turn off the feature flag”). This is because incidents are one of the few times that non-technical leadership are directly confronted with their lack of control over the technical sphere. When the team is building a product, your VP has a lot of freedom to guide the process and make decisions. But when there’s an active incident, they have to just sit there and trust that their technical employees are going to pull them out of the fire. It’s a scary situation, particularly for someone who’s used to exercising a degree of power in the workplace. However, always resolving incidents is (by itself) not a durable position of power. This is a little counter-intuitive. Surely if you’re always resolving incidents, you’re indispensable? The problem is that incident-resolving work is almost always so techical as to be completely opaque to executives. They know the incident has resolved, but they don’t know if you did a heroic effort or merely did the obvious thing. They also can’t point to your successes as theirs (which is always the most reliable way to get VPs and directors on your side), because incidents are expected to be fixed , and it’s always better not to have had the incident at all . I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. ↩

0 views
Sean Goedecke 1 weeks ago

Why hasn't longer-horizon training slowed AI progress?

Dwarkesh Patel 1 recently posted an award for the best answers to four key questions about AI. It’s partly a challenge and partly a job interview, since some of the winners will get offered a role as a “research collaborator”. I don’t want the job, but I do want to write down my answer to his first question: why hasn’t AI progress slowed down more? There are a few reasons we might think AI progress would slow down. The particular reason Dwarkesh is interested in goes like this. Training a model (specifically reinforcement learning) requires the model to perform a task and then get “graded” on the output. As models get more powerful and tasks become harder, they take longer and require more FLOPs 2 to complete, and thus more FLOPs to train: thus training harder models will take longer. But intuitively, AI progress hasn’t slowed down that much. The famous METR horizon-length graph shows that AI systems are capable of more and more complex tasks over time, and that this process is accelerating, not slowing down. Why would that be? Firstly, it might just be the case that newer models are benefiting from orders of magnitude more FLOPs . Of course, AI labs aren’t standing up orders of magnitude more GPUs (they’re trying, but there are hard physical limits on how fast you can scale up a physical datacenter). But it’s certainly possible that they’re learning to use their existing FLOPs orders of magnitude more efficiently. The efficiency of complex software systems - and the training code for a frontier AI model certainly qualifies - is not typically determined by the number of genius ideas in it. It is determined by the number of boneheaded mistakes. Take this story 3 of how the initial GPT-4 training run used FP16 when summing many small values, which will completely mess up your results if the sum of those values is large. How much training-efficiency-per-FLOP does solving bugs like that buy? Plausibly enough to outweigh any inherent lack of efficiency from training more powerful models. Secondly, intuitions about the speed of AI progress are weird and unreliable . Humans measure AI progress - and intelligence in general - on a really uneven scale. It’s easy to tell when an AI (or a person) is less smart than you, because you can just see them making mistakes. It’s very hard to tell if they’re smarter, because in that case you’re the one making mistakes. You have to rely on more subtle context clues: do they get better long-term results than you, or do they often confuse you in situations where you later end up agreeing with them, and so on. The jump from GPT-3 to GPT-4 seemed huge because GPT-4 was dumber than almost all humans, and GPT-4 was sometimes as smart as a human. However, frontier models are now smart enough to be in the realm of ambiguity on many topics. It’s thus much harder to tell the “real” rate at which they’re getting smarter. Maybe the rate of growth of “raw intelligence” really has slowed down! I don’t know how we’d be in a position to know for sure. Thirdly, many traits other than intelligence determine the capabilities of AI models . Take the jump in October last year where OpenAI and Anthropic models were suddenly “agentic” (i.e. they could reliably perform complex tasks end-to-end). That might be intelligence, but it might also just be a greater working memory, or more rote familiarity with the basic tools of a LLM harness, or more ability to attend to the context window, or even simply a personality more suited to tools like Claude Code or Codex. Of course, all of these traits are plausibly “intelligence”. But they’re traits you might instil by various clever tricks (or even just tweaking the system prompt), not by brute-forcing more FLOPs. It’s illustrative here to consider the mistake made by Apple’s infamous The Illusion of Thinking paper, where the researchers asked various models to brute-force solve Tower of Hanoi puzzles with different numbers of disks, using the results to score how good at reasoning the models were. But of course when you read the output, all of the failures were cases of the model realizing that many hundreds of steps were required, and refusing to even try. These same models could trivially write code to perform the steps, or correctly go through any smaller subset of the steps. The problem wasn’t intelligence, it was persistence : these models lacked the willingness to dig in and keep powering through steps until they got to an answer 5 . Even inside an AI lab, I don’t think anyone has a good understanding of how many “real” FLOPs are being thrown at a training run (not counting FLOPs that are wasted on bugs). We also don’t have a clear sense of whether AI progress really is slowing down or not. Mythos seems impressive, and coding agents are really good now, but once the models get close to human intelligence it becomes really tricky to monitor. Finally, almost everyone judges intelligence by capabilities, but capabilities are produced by a constellation of many traits (intelligence is just one of them). I think this stuff is really complicated. A general theory like “RL takes more flops-per-reward as tasks get longer, therefore training will gradually slow down” sounds good, but in practice AI development is dominated by lightning strikes: silly bugs that make training a hundred times worse, clever ideas that make models a hundred times more useful, and spiky capabilities that can produce dazzling results in some areas but zero improvement in others. We are still very early . If you’re reading this you probably know who Dwarkesh is, but if you don’t: he’s a well-known tech-adjacent podcaster whose gimmick is that he actually does extensive research before each guest and asks specific technical questions. A FLOP is a floating-point operation, i.e. a matrix multiplication, i.e. “time on a GPU”. I saw this in a tweet and only realized that the source was Dwarkesh when I was researching for this post. What if AI progress stalls for technical reasons, and everyone gives up on training new models? In that world, open source models will eventually catch up, and AI labs won’t be in a privileged position. Incidentally, this is my pet theory about why models got much better at agentic tasks last year: training on longer and longer agentic traces meant that models started to “believe they could do it”, and made them much less likely to just give up and take shortcuts or refuse to continue. If you’re reading this you probably know who Dwarkesh is, but if you don’t: he’s a well-known tech-adjacent podcaster whose gimmick is that he actually does extensive research before each guest and asks specific technical questions. ↩ A FLOP is a floating-point operation, i.e. a matrix multiplication, i.e. “time on a GPU”. ↩ I saw this in a tweet and only realized that the source was Dwarkesh when I was researching for this post. ↩ What if AI progress stalls for technical reasons, and everyone gives up on training new models? In that world, open source models will eventually catch up, and AI labs won’t be in a privileged position. ↩ Incidentally, this is my pet theory about why models got much better at agentic tasks last year: training on longer and longer agentic traces meant that models started to “believe they could do it”, and made them much less likely to just give up and take shortcuts or refuse to continue. ↩

0 views
Sean Goedecke 1 weeks ago

Why I don't like the "staff engineer archetypes"

The most influential piece of writing about staff engineers in the last decade has to be Will Larson’s Staff engineer archetypes . He argues that the “staff engineer” title covers at least four very different roles: the team lead, the architect, the solver, and the right hand. This taxonomy gets cited a lot as advice for people who are trying to become effective staff engineers. For both of my promotions to staff engineer, my manager at the time linked me to the “staff engineer archetypes” and asked me to consider which of these archetypes I was aiming towards. These archetypes definitely exist 1 . However, I think it’s bad practical advice to tell engineers to try and target them. To see why, let’s take the “team lead” archetype. Larson describes this as an informal technical leadership role: not necessarily an explicit authority figure, but someone who’s good at scoping work, planning projects, and maintaining the kind of relationships (e.g. with other teams) needed to successfully ship . If you want to fill this role, shouldn’t you start trying to do these things? No! You don’t become a technical leader by trying really hard to be a technical leader, much like you don’t become a writer by trying really hard “to be a writer”. You become a technical leader by doing good technical work until your skills and relationships emerge organically. I wrote about this process in Ratchet effects determine engineer reputation at large companies . To get good at shipping large complex projects, you must start by shipping tiny pieces of work, until you’re familiar enough with the system and you’ve built enough trust to take on slightly larger pieces. At each stage, if you do good work - “good work” here means “deliver shareholder value ” - you will very naturally be given opportunities to work on more complex and important things. If you try to jump ahead, you’re going to run into all kinds of problems: The other archetypes are like this as well. If you want to become a successful architect, you do not get there by studying software architecture in the abstract, because you can’t design software you don’t work on . The “solver” and “right hand” archetypes both rely on having an enormous amount of trust and influence. You can’t aim for those archetypes directly, because trust and influence accumulate over time. In fact, the idea of “aiming for” a particular staff engineer archetype reflects a misunderstanding of what the staff engineer role is. What is the defining attribute of the staff engineering role, then? A staff engineer has to be useful to the company. Of course, a senior or mid-level software engineer ought to be useful too, but all they have to do is execute on the job in front of them. If they end up not providing value (maybe their project turns out to be unimportant, or they don’t get the support needed to succeed) that’s their manager’s problem, not theirs 3 . In contrast, staff engineers are expected to deliver value regardless: to make the project work, or to find something else useful to do if the project truly can’t be salvaged. This is an unfair expectation. Often projects really do fail through no fault of your own, and sometimes it just isn’t possible to conjure useful work from thin air. That’s actually by design: the staff engineer role is supposed to be unfair . Something many engineers don’t realize is that all senior management and executive leadership roles are unfair too, in the same way. That’s just part of the deal: executives are given power and great compensation, and in return they get thrown off the boat in bad weather 4 . “Staff engineer” is the first engineering role where you are held largely responsible for outcomes you don’t control. Developing a “staff engineer mindset” thus has very little to do with the archetypes. Instead, you should: At the beginning, you won’t look much like any of the staff engineer archetypes. You will look like being a level-headed engineer who can be trusted to move projects forward with a minimum of fuss, and who can be re-tasked to different work without complaining. You’ll also look like someone who’s paying a lot of attention to what their manager’s actual priorities are, and who is thinking hard about how to fulfil those priorities (instead of their own goals). If you do this for long enough, you’ll eventually find yourself in one of the staff engineer archetypes. However, it probably won’t be the one you’re “aiming for”. The whole point of being a staff engineer is that you’re willing to fill whatever archetype the company needs at the time. In his original staff engineer post, Larson is pretty clear that these archetypes are more of an anthropological description of some of the varied niches staff engineers fill, not a how-to guide for succeeding in the role 5 . At the time, the “staff engineer” role was fairly new and people were still trying to figure out what it even meant. Pointing out that there were a few very different ways to succeed in the role was a genuinely novel observation. The staff engineer archetypes are a good list of ways an engineer can be very useful to their organization - but only once they’ve built a deep relationship of trust with their organization’s leadership. Advice on how to succeed as a staff engineer should be about how to build that trust , not about what to do once you have it. One caveat that is too pedantic for the body of the post: each tech company has a different structure of roles. Some don’t have the formal “staff” title at all, while others have “staff” as a fairly early rung on the ladder and a panoply of “senior staff”, “senior principal staff”, and so on roles above it. Like all “staff engineer” discourse, this post is not about the word itself but about the point in the engineering job ladder where progression becomes significantly more difficult. Impressing your VP’s trusted lieutenants can actually be a good way to build trust in the medium-term, but you’d better hope you’ve built enough understanding of the system to do it right. If this process goes badly, your reputation in the org might be torched for years. In theory, at least. In practice it’s always better to be useful (again, in the sense of “delivering shareholder value”). This is why very senior leadership sometimes seem so unempathetic towards engineering complaints: their work environment operates by very different rules and norms to that of most engineers. I keep meaning to try and write about this and never succeeding. This draft is the closest thing I have to a deeper exploration of the point. For the record, my how-to guides are here and here . Important projects are usually assigned top-down, not bottom-up, so you’ll either be trying to muscle out the planned engineering lead for a project or to pitch your own (complex, important) engineering task to senior management. Either way, good luck with that! You likely won’t have a good enough relationship with senior management to know what their real priorities are. If you’re not yet trusted to execute, you may get assigned “minders” (often current staff engineers) who will ghost-lead the project through you 2 . You’ll likely make poor technical decisions . Develop the habit of constantly asking yourself “is this useful to the company” (and answering correctly). Lose the habit of worrying about if you’re being treated “fairly”. Instead, try to think about your role in terms of incentives and consequences. One caveat that is too pedantic for the body of the post: each tech company has a different structure of roles. Some don’t have the formal “staff” title at all, while others have “staff” as a fairly early rung on the ladder and a panoply of “senior staff”, “senior principal staff”, and so on roles above it. Like all “staff engineer” discourse, this post is not about the word itself but about the point in the engineering job ladder where progression becomes significantly more difficult. ↩ Impressing your VP’s trusted lieutenants can actually be a good way to build trust in the medium-term, but you’d better hope you’ve built enough understanding of the system to do it right. If this process goes badly, your reputation in the org might be torched for years. ↩ In theory, at least. In practice it’s always better to be useful (again, in the sense of “delivering shareholder value”). ↩ This is why very senior leadership sometimes seem so unempathetic towards engineering complaints: their work environment operates by very different rules and norms to that of most engineers. I keep meaning to try and write about this and never succeeding. This draft is the closest thing I have to a deeper exploration of the point. ↩ For the record, my how-to guides are here and here . ↩

0 views
Sean Goedecke 2 weeks ago

Software engineering may no longer be a lifetime career

I don’t think there’s compelling evidence that using AI makes you less intelligent overall 1 . However, it seems pretty obvious that using AI to perform a task means you don’t learn as much about performing that task . Some software engineers think this is a decisive argument against the use of AI. Their argument goes something like this: I don’t necessarily agree with (2). On the one hand, moving from assembly language to C made programmers less effective in some ways and more effective in others. On the other hand, the transition from writing code by hand to using AI is arguably a bigger shift, so who knows? But it doesn’t matter. Even if we grant that (2) is correct, this is still a bad argument . Until around 2024, the best way to learn how to do software engineering was just doing software engineering . That was really lucky for us! It meant that we could parlay a coding hobby into a lucrative career, and that the people who really liked the work would just get better and better over time. However, that was never an immutable fact of what software engineering is. It was just a fortunate coincidence. It would really suck for software engineers if using AI made us worse at our jobs in the long term (or even at general reasoning, though I still don’t believe that’s true). But we might still be obliged to use it, if it provided enough short-term benefits , for the same reason that construction workers are obliged to lift heavy objects: because that’s what we’re being paid to do. If you work in construction, you need to lift and carry a series of heavy objects in order to be effective. But lifting heavy objects puts long-term wear on your back and joints, making you less effective over time. Construction workers don’t say that being a good construction worker means not lifting heavy objects. They say “too bad, that’s the job” 2 . If AI does turn out to make you dumber, why can’t we just keep writing code by hand? You can! You just might not be able to earn a salary doing so, for the same reason that there aren’t many jobs out there for carpenters who refuse to use power tools. If the models are good enough, you will simply get outcompeted by engineers willing to trade their long-term cognitive ability for a short-term lucrative career 3 . I hope that this isn’t true. It would be really unfortunate for software engineers. But it would be even more unfortunate if it were true and we refused to acknowledge it. The career of a pro athlete has a maximum lifespan of around fifteen years. You have the opportunity to make a lot of money until around your mid-thirties, at which point your body just can’t keep up with it. A common tragic figure today is the professional athlete who believes the show will go on forever and doesn’t prepare for the day they can’t do it anymore. We may be in the first generation of software engineers in the same position. If so, it’s probably a good idea to plan accordingly. If you’re thinking “wait, there’s research on this”, you can likely read my take on the paper you’re thinking of here , here or here . Of course, construction workers do have layers of techniques for avoiding lifting heavy objects when possible (cranes, dollies, forklifts, and so on). There’s a natural analogy here to a set of techniques for staying mentally engaged that software engineers are yet to discover. In theory labor unions could slow this process down (and have forced employers to slow down this race-to-the-bottom in other industries). But I’m pessimistic about tech labor unions for all the usual reasons: the job is too highly-paid, you can work (and thus scab) from anywhere on the planet, and so on. Using AI means you don’t learn as much from your work AI-users thus become less effective engineers over time, as their technical skills atrophy Therefore we shouldn’t use AI in our work If you’re thinking “wait, there’s research on this”, you can likely read my take on the paper you’re thinking of here , here or here . ↩ Of course, construction workers do have layers of techniques for avoiding lifting heavy objects when possible (cranes, dollies, forklifts, and so on). There’s a natural analogy here to a set of techniques for staying mentally engaged that software engineers are yet to discover. ↩ In theory labor unions could slow this process down (and have forced employers to slow down this race-to-the-bottom in other industries). But I’m pessimistic about tech labor unions for all the usual reasons: the job is too highly-paid, you can work (and thus scab) from anywhere on the planet, and so on. ↩

0 views
Sean Goedecke 3 weeks ago

Luddites and AI datacenters

Is it time to start burning down datacenters? Some people think so. An Indianapolis city council member had his house recently shot up for supporting datacenters, and Sam Altman’s home was firebombed (and then shot ) shortly afterwards. People from all sides of the argument are sounding the alarm about imminent violence. The obvious historical comparison is Luddism , the 19th-century phenomenon where English weavers and knitters destroyed the machines that were automating their work, and (in some cases) killed the machines’ owners. Anti-AI people are reclaiming the term to describe themselves, and many of the leading lights of the anti-AI movement (like Brian Merchant or Gavin Mueller ) have written books arguing more or less that the Luddites were right, and we ought to follow their example in order to resist AI automation 1 . Like many people, I have heard a lot about Luddism and Luddites, but only in the context of it being a general term for someone who is anti-technology. I was interested in learning more about the actual historical movement: what kind of people participated, what it was, and what it accomplished. I read Merchant’s and Mueller’s books, plus others 2 , to try and figure all of this out. Who were the actual, historical Luddites? What can we learn from them about burning down datacenters? The Luddites were a decentralized movement of artisans in the 1810s who engaged in violent protest - smashing machines, threatening violence, and ultimately killing people - over the fact that their jobs were being automated away. They were not rich, but they were certainly not unskilled labor: these were people who had apprenticed for seven years . They were mostly working from home, producing cloth from raw material given to them by their employer, often with tools rented from that same employer. They were working short weeks (three days, per William Gardiner) at their own discretion. In the early 1800s, their skilled labor was becoming unnecessary. With the help of expensive machines, unskilled labor could now produce lower-quality cloth, so employers were beginning to pass over these artisans in favor of cheaper employees: children, unapprenticed workers, and women 3 . Combined with the bad economic position of England at the time (at war with France, and thus deliberately cutting off much European trade), times were beginning to be very tough indeed. Starvation was a real threat. Cloth artisans were groups of capable men who were used to getting their own way, knew each other very well, and were broadly respected in their communities. It was thus a natural response for them to organize into what was effectively a militant union. The Luddites would send anonymous threatening letters to their old (or current) bosses, warning them to stop using their machines. If they didn’t comply, they would raid the workshop or factory, smashing the machines up. They typically did not harm people, though they certainly delivered threats of bodily harm or even murder, and the raids were violent enough (e.g. shooting through windows) to have risked accidental deaths. In at least two instances where a factory owner was seen as unusually cruel, the Luddites did attempt assassinations: one unsuccessful, and one successful one that eventually prompted a crackdown that ended the movement for good. Luddism was fully decentralized. Different communities could and did decide to engage in machine-raiding independently, particularly when news spread of the tactic succeeding. Although each community had its own influential men, there was never a single “leader of Luddism”. King Ludd himself was a folk-tale figure. This made it an absolute nightmare for the British government to try and suppress them: putting down one Luddist group did nothing to prevent other groups from continuing to operate. I was surprised by how difficult it was for the government to get a hold of any of the local Luddist ringleaders. The government was willing to offer huge rewards to informers: at one point up to 40x the yearly wage. However, there were no takers for several years. Armies of spies were recruited and tasked with infiltrating Luddist groups, with absolutely no success. Why was it so hard? Firstly, because the working class was so overwhelmingly pro-Luddist. People universally blamed the economic situation on the government and the factory owners (rightfully so, since the government had chosen to go to war and the factory owners had chosen to embrace automation). Secondly, the communities in question were so insular and tightly-knit that informers would have to rat on their friends and relatives. The handful of people who did eventually inform lived out the rest of their lives as pariahs. Because each group was so insular, any spies trying to infiltrate the movement would have been complete strangers to the community, and would thus have a very hard time gaining the trust of a group of men who had known each other for their whole lives. The spies that did exist were restricted to the occasional inter-group Luddist meetings, where people didn’t all know each other so closely. But it’s unclear how important those meetings were, since Luddist groups didn’t need to coordinate to achieve their goals. According to Merchant, the spies spent much of their time embellishing tales of an imminent revolution to encourage their employers to keep the money flowing. In the absence of reliable information, the British government was forced to use force. And they did, sending 12,000 troops 4 into the northern counties. This served mainly as an intimidation tactic, since there was no standing Luddite army to fight, and the soldiers spent most of their time marching back and forth or being abused by the townspeople. More successful was the imposition of a full police state in Yorkshire, under the magistrate Joseph Radcliffe, who was empowered to randomly grab people off the street and interrogate them for days. That pressure eventually convinced a handful of people to give up their local Luddist organizers, who were tried and inevitably hanged. Their deaths (and the ensuing climate of fear) ended the high-water mark of Luddist activity. Even then, Luddist raids continued on and off for six more years before petering out. This is a tricky question. In one sense the answer is obviously no: the movement was crushed, many of their leaders were executed, the textile industry continued to be automated, and today there are no longer thousands of jobs for skilled British weavers, knitters, spinners and dyers. The pro-automation side won. However, they did achieve a number of short-lived victories. Their early threats often succeeded in preventing the building of a factory in a particular location, or in delaying the adoption of industrial machinery in a particular shop by years. In one case, hosiers that had been spooked by Luddite activity gave out pre-emptive bonuses to their workers to discourage them from smashing up their machines (which were indeed not smashed). The Luddites also scared the hell out of the British government, who (encouraged by their over-eager spies) thought they might have a genuine revolution on their hands. While they didn’t get many legal concessions at the time, the specter of Luddism must have loomed over the labor reform movement of the 1800s, which saw the first anti-child-labor laws and the beginnings of independent inspection of factories. Finally, every book I read argued that the Luddism movement may have created the first idea of a “working class”, by unifying many previously-independent groups of workers against a common enemy. Seen this way, the “political arm” 5 of Luddism can arguably claim partial credit for every labor victory since the 1800s (though the ringleaders were still hanged and the weavers did still lose their jobs). We can now describe the “Luddist approach” to fighting technological change: Note that starting or joining a national movement is not the Luddist approach. Staying almost entirely isolated in small cells helped the Luddists avoid government spies and made them impossible to root out without enforcing a police state. Note also that you need a lot of public support for this to work: so that you get a lot of copycat groups without having to explicitly organize them, and so that your property destruction and murder is taken sympathetically instead of getting you immediately reported and arrested. There are many reasons why this doesn’t map onto the current anti-AI movement. First, Luddism grew from a homogeneous group of high-status workers whose jobs almost vanished overnight, not a broad group of people whose jobs are getting slightly worse because of AI (like the gig-economy workers Merchant endlessly references). That meant that Luddites had really specific asks: higher wages for piecework, a phased introduction of specific textiles machinery, and so on. They were not generally demanding that the machines all be immediately destroyed 6 . Second, Luddism was very local. A pre-existing group of artisans in a particular town would gather in that town - either at work or an inn, say - and decide to petition or raid the businesses in that town that were harming their livelihoods. AI concerns are not like this. It isn’t businesses in Chicago or Tokyo that are making decisions that imperil Chicago’s or Tokyo’s jobs, it’s businesses in San Francisco. Unlike the Luddists, anti-AI activists can’t naturally organize with people they already know to take direct action where they already live. Third, Luddist victory could also be local. If you successfully lobby your local cloth business to not use a weaving machine, you have secured your job at that business for a while. But if you successfully lobby your town (or even your country!) to not build a datacenter, it doesn’t meaningfully improve your local position, since your job can be as easily replaced by a datacenter on the other side of the planet. Reading through the history of the Luddites from a modern perspective, I was struck by the near-total absence of good government . The artisans were left to work out their grievances with their bosses more or less by themselves, with no formal channels for complaint or any attempt at mediation. When the government did intervene - in response to near-universal unrest in half of the country - they did this: I suppose it worked, in the sense that it eventually succeeded in stopping the Luddist raids. But I can’t help but think that even a token gesture of compromise (say, requiring employers to make their wages public, or restricting the most cheap-and-nasty factory-made textile products) would have gone a long way towards calming things down. This almost actually happened! The 1812 Framework Knitters’ Bill, which had these provisions in it, passed the House of Commons but was shot down in the House of Lords. Why did the government fail to even make a token attempt at compromise? Before the industrial revolution, I wonder if the workers and bosses of the English textiles industry were genuinely able to often just work out their problems together, so the government never really needed to do large-scale mediation. When that changed - when automation first made it possible for the bosses to durably “win” - government took a long time to realize, so there were some unpleasant decades of disempowered workers trying to bully factory-owners (via riots and death threats), and factory-owners trying to brutalize workers (via direct violence and automation). The Luddites weren’t anti-technology in general. Instead, they were against four or five specific machines that were automating away their skilled work. Contrary to many of the books I read, I think that’s actually fairly well understood today. But what surprised me was how truly decentralized and widespread the movement was. Every town with weavers faced the same incentives (the bosses to industrialize, and the workers to fight back) which created Luddism locally. And nobody was willing to report the Luddites: not for years, or for what would have been a fortune in cash, or in disagreement with their actions. They were only stopped by a truly fascist crackdown, with the army in the streets and the secret police pulling away random citizens for arrest and questioning. I can see why modern “Luddites” - who are genuinely anti-technology - talk so much about the legacy of original Luddites. Luddism was a grassroots organization which notched up some real short-term wins, enjoyed near-total support among the public, and didn’t seem to be troubled by infighting at all 7 . If you’re an anti-AI campaigner, I bet all of that sounds great 8 . But I’m not convinced that the neo-Luddites really are the inheritors of Luddism. A load-bearing feature of Luddism is that it was local : it didn’t have manifestos, or leaders, or factions, or even much explicit ideology beyond the artisans’ immediate practical concerns. These were local men striking back against the local factories harming their local jobs. That simply isn’t the case with AI, where a datacenter in China can take my job in Australia. If it isn’t time to start burning down datacenters, what can we do? Well, workers today are better off than the Luddites were, because we stand on their shoulders. Unlike the Luddites, today’s workers can all vote (and I suspect there will be no shortage of anti-AI candidates to vote for). It’s a much less exciting answer to say “try and get better laws passed” than to say “viva the revolution, where’s my Molotov cocktail”. Still, if the Luddites had that option, I suspect they would have used it. edit: A reader pointed me at Against the Luddites , which argues that (a) the Luddites were an elite (ish) movement, (b) they explicitly and deliberately excluded women, and (c) their leftist theory bonafides are questionable. I don’t really care about (c), agree with (b), and mostly agree with (a), with the caveat that they really did have a broad base of non-elite support. I got linked this article calling AI a “fascist artifact” (on a blog called “Breaking Frames”, a clear reference to Luddism) while I was writing this blog post. I really enjoyed Merchant’s book and did not enjoy Mueller’s (which I found to be 10% about the Luddites and 90% about interminable intra-Marxist ideological arguments). I also read The Luddites , which was effectively a dry summary of the ground Merchant covers, a bunch of other essays, and went back and forth with ChatGPT and Claude on some of the key questions. Merchant (around page 134) attempts to characterize Luddism as a pro-feminist movement, citing some examples of women helping organize raids, but later on even he (page 162) quotes a representative of the Irish weaver’s guild effectively saying “we don’t have your English problems of women working in the industry”. In general it’s a bit frustrating that the popular books on Luddism are all fairly uncritically pro-Luddist (though not surprising, I suppose). Merchant doesn’t touch at all on the Luddist practice of going around to knitting-shops with women and “discharging them from working”. Sometimes this is described as more troops that were sent to fight Napoleon (even by Merchant himself on page 89), but that isn’t right . In quotes because it was not an official Luddist group (there were none), just people who were trying to stop the violence through lobbying and legislation. Otherwise why would any boss agree, instead of just waiting for the Luddites to do it themselves? As far as I can tell this is true: the Luddists basically had no internal conflict. I think this is because each individual cell knew each other well already, and so handled their disagreements privately (instead of by writing pamphlets), and disagreements between cells didn’t matter that much because they had no need to coordinate. It beats the hell of the other popular reference, Dune’s Butlerian Jihad , which was two generations of brutal violence followed by the reimposition of the feudal system. (Although, at least the Butlerian Jihad succeeded …) Find a few conspirators in your existing community who agree with your political project (but don’t join a broader organization, since that leaves you vulnerable) Make public anonymous demands in support of your specific goals, backed up by threats of violence, signed by a fictional character that’s easy for other groups to appropriate If your threats are ignored, attack the physical machines in the dead of night, destroying them and threatening (but not killing) any guards Hope your example inspires many more people to independently do (1)-(3) themselves Keep raiding, optionally escalating to assassination of some of the bosses, until you bait a totalitarian crackdown from the government Eventually get arrested and executed, to great public dismay Twenty years later, your example inspires the first national trade unions Make machine-breaking and oath-taking capital crimes Dump thousands of soldiers more or less at random into the area, with no plan to guard factories or do anything beyond just hang around in case a revolution broke out Empower a single magistrate to arrest and interrogate whoever he wanted in order to root out the conspiracy I got linked this article calling AI a “fascist artifact” (on a blog called “Breaking Frames”, a clear reference to Luddism) while I was writing this blog post. ↩ I really enjoyed Merchant’s book and did not enjoy Mueller’s (which I found to be 10% about the Luddites and 90% about interminable intra-Marxist ideological arguments). I also read The Luddites , which was effectively a dry summary of the ground Merchant covers, a bunch of other essays, and went back and forth with ChatGPT and Claude on some of the key questions. ↩ Merchant (around page 134) attempts to characterize Luddism as a pro-feminist movement, citing some examples of women helping organize raids, but later on even he (page 162) quotes a representative of the Irish weaver’s guild effectively saying “we don’t have your English problems of women working in the industry”. In general it’s a bit frustrating that the popular books on Luddism are all fairly uncritically pro-Luddist (though not surprising, I suppose). Merchant doesn’t touch at all on the Luddist practice of going around to knitting-shops with women and “discharging them from working”. ↩ Sometimes this is described as more troops that were sent to fight Napoleon (even by Merchant himself on page 89), but that isn’t right . ↩ In quotes because it was not an official Luddist group (there were none), just people who were trying to stop the violence through lobbying and legislation. ↩ Otherwise why would any boss agree, instead of just waiting for the Luddites to do it themselves? ↩ As far as I can tell this is true: the Luddists basically had no internal conflict. I think this is because each individual cell knew each other well already, and so handled their disagreements privately (instead of by writing pamphlets), and disagreements between cells didn’t matter that much because they had no need to coordinate. ↩ It beats the hell of the other popular reference, Dune’s Butlerian Jihad , which was two generations of brutal violence followed by the reimposition of the feudal system. (Although, at least the Butlerian Jihad succeeded …) ↩

0 views
Sean Goedecke 3 weeks ago

Many anti-AI arguments are conservative arguments

Most anti-AI rhetoric is left-wing coded. Popular criticisms of AI describe it as a tool of techno-fascism , or appeal to predominantly left-wing concerns like carbon emissions , democracy , or police brutality . Anti-AI sentiment is surprisingly bipartisan , but the big anti-AI institutions are labor unions and the progressive wing of the Democrats. This has always seemed weird to me, because the contents of most anti-AI arguments are actually right-wing coded. They’re not necessarily intrinsically right-wing, but they’re the kind of arguments that historically have been made by conservatives, not liberals or leftists. Here are some examples: On top of all that 2 , frontier AI models themselves are quite left-wing. Notwithstanding some real cases of data bias (most infamously Google’s image model miscategorizing dark-skinned humans as “gorillas”), the models reliably espouse left-wing positions . Even Elon Musk’s deliberate attempt to create a right-wing AI in Grok has had mixed success . In 2006, Stephen Colbert coined the phrase “reality has a left-wing bias”. If the left-wing were more sympathetic to AI, I think they would be using this as a pro-left argument 3 . So what happened? A year ago I wrote Is using AI wrong? A review of six popular anti-AI arguments . In that post I blame the hard right-wing turn many big tech CEOs made in 2024. That was around the same time that LLMs was emerging in the public consciousness with ChatGPT, so it made sense that AI got tagged as right-wing: after all, the billionaires on TV and Twitter talking about how AI were going to change the world were all the same people who’d just gone all-in on Donald Trump. I still think this is a pretty good explanation - just unfortunate timing - but there are definitely other factors at play. One obvious factor is the hangover from the pro-crypto mania of 2021 and 2022, where many of the same tech-obsessed folks also posted ugly art and talked about how their technology would change the world forever. Few of these predictions came true (though cryptocurrency has indeed changed the world forever), and it’s understandable that many people viewed AI as a natural continuation of this movement. On top of that, Donald Trump himself has come out strongly pro-AI, both in terms of policy and in terms of actually posting AI art himself. This naturally creates a backlash where anti-Trump people are primed to be even more anti-AI 4 . Here are some more reasons: Let me finally put my cards on the table. I would describe myself as on the left wing, and I’m broadly agnostic about the impact of AI. Like the boring fence-sitter I am, I think it will have a mix of positive and negative effects. In general, I’m unconvinced by the pro-copyright and human-soul-related anti-AI arguments, or by the idea that AI is inherently right-wing, but I’m troubled by the environmental impact and the impact on jobs (which in my view are more classically left-wing positions). Still, I’m curious what will happen when the left-wing flavor of anti-AI rhetoric disappears, which I think it will (as I said at the start, anti-AI sentiment is actually pretty bipartisan ). When people start making explicitly right-wing anti-AI arguments, will that cause the left-wing to move a little bit towards supporting AI? Or will right-wing institutions continue to explicitly support AI, allowing anti-AI sentiment to become a wedge issue that the left-wing can exploit to pry away voters? In any case, I don’t think the current state of affairs is particularly stable. In many ways, the dominant anti-AI arguments would fit better in a conservative worldview than in the worldview of their liberal proponents. I don’t think any did, which is probably for the best - they would have only had a couple of years to break into the industry before hiring collapsed in 2023. Another point that isn’t quite mainstream enough but that I still want to mention: AI critics often argue that cavalier deployment of AI means that people might take dangerous medical advice instead of simply trusting their doctor. But anyone who’s been close to a person with chronic illness knows that “just trust your doctor” is kind of right-wing-coded itself, and that the left-wing position is very sympathetic to patients who don’t or can’t. In a parallel universe, I can imagine the left-wing arguing that patients need AI to avoid the mistakes of their doctors, not the other way around. Is it a good argument? I don’t know, actually. The easy counter is that the LLMs are just mirroring the biases in their training data. But you could argue in response that superintelligence is also latent in the training data, and that hill-climbing towards superintelligence also picks up the associated political positions (which just so happen to be left-wing). I am no fan of Donald Trump, but it doesn’t follow that everything he supports is bad (e.g. the First Step Act ). Many AI critics complain that AI steals copyrighted content , but prior to 2023, leftists have been largely anti-intellectual-property on principle (either because they’re anti- property , or because they characterize copyright as benefiting huge media corporations and patent trolls). A popular anti-AI-art sentiment is that it’s corrosive to the human spirit to consume AI slop: in other words, art just inherently ought to be generated by humans, and using AI thus damages some part of our intangible human soul. Whether you like this argument or not, it’s structurally similar to a whole slate of classic arguments-from-intuition for conservative positions like anti-abortion or anti-homosexuality. Weird new technological art has traditionally been championed by the left-wing and dismissed by the right-wing (as inhuman , cheap , or degenerate ). But when it comes to AI art, it’s the left-wing making these arguments, and others (not necessarily right-wingers) arguing that AI art can also be a medium of human artistic expression. One main worry about AI is that it’s going to take over a lot of jobs. This is a compelling argument! But the left-wing has recently been famously unsympathetic to this same argument around fossil-fuel energy jobs like coal mining , to the point where Biden infamously advised a group of miners in New Hampshire to learn to code 1 . Halting technological progress to preserve jobs is quite literally a “conservative” position. AI has real environmental impact (though this is often wildly overstated, as I say here ), and the right-wing is politically committed to downplaying or denying anthropogenic environmental impacts in general. When times are tough, it’s easy to blame the hot new thing that everyone is talking about. Because the right-wing is currently ascendant in the US, left-wingers are more inclined to talk about how tough times are. The left-wing is over-represented in the kind of “computer jobs” that are under direct threat from AI. Being pro-Europe has always been left-wing coded, and Europe has been noticeably slower and more sceptical about AI than the USA. I don’t think any did, which is probably for the best - they would have only had a couple of years to break into the industry before hiring collapsed in 2023. ↩ Another point that isn’t quite mainstream enough but that I still want to mention: AI critics often argue that cavalier deployment of AI means that people might take dangerous medical advice instead of simply trusting their doctor. But anyone who’s been close to a person with chronic illness knows that “just trust your doctor” is kind of right-wing-coded itself, and that the left-wing position is very sympathetic to patients who don’t or can’t. In a parallel universe, I can imagine the left-wing arguing that patients need AI to avoid the mistakes of their doctors, not the other way around. ↩ Is it a good argument? I don’t know, actually. The easy counter is that the LLMs are just mirroring the biases in their training data. But you could argue in response that superintelligence is also latent in the training data, and that hill-climbing towards superintelligence also picks up the associated political positions (which just so happen to be left-wing). ↩ I am no fan of Donald Trump, but it doesn’t follow that everything he supports is bad (e.g. the First Step Act ). ↩

0 views
Sean Goedecke 1 months ago

Programming (with AI agents) as theory building

Back in 1985, computer scientist Peter Naur wrote “Programming as Theory Building” . According to Naur - and I agree with him - the core output of software engineers is not the program itself, but the theory of how the program works . In other words, the knowledge inside the engineer’s mind is the primary artifact of engineering work, and the actual software is merely a by-product of that. This sounds weird, but it’s surprisingly intuitive. Every working programmer knows that you cannot make a change to a program simply by having the code. You first need to read through the code carefully enough to build up a mental model (what Naur calls a “theory”) of what it’s supposed to do and how it does it. Then you make the desired change to your mental model, and only after that can you begin modifying the code. Many people 1 think that this is why LLMs are not good tools for software engineering: because using them means that engineers can skip building Naur theories of the system, and because LLMs themselves are incapable of developing a Naur theory themselves. Let’s take those one at a time. Do AI agents let some engineers avoid building detailed mental models of the systems they work on? Of course! As an extreme example, someone could simply punt every task to the latest GPT or Claude model and build no mental model at all 2 . But even a conscientious developer who uses AI tools will necessarily build a less detailed mental model than someone who does it entirely by hand. This is well-attested by the nascent literature on how AI use impacts learning. And it also just makes obvious sense. The whole point of using AI tools is to offload some of the cognitive effort: to be able to just sketch out some of the fine detail in your mental model, because you’re confident that the AI tool can handle it. For instance, you might have a good grasp on what the broad components do in your service, and how the data flows between them, but not the specific detail of how some sub-component is implemented (because you only reviewed that code, instead of writing it). Isn’t this really bad? If you start dropping the implementation details, aren’t you admitting that you don’t really know how your system works? After all, a theory that isn’t detailed enough to tell you what code would need to be written for a particular change is a useless theory, right? I don’t think so. First, it’s simply a fact that every mental model glosses over some fine details . Before LLMs were a thing, it was common to talk about the “breadth of your stack”: roughly, the level of abstraction that your technical mental model could operate at. You might understand every line of code in the system, but what about dependencies? What about the world of Linux abstractions - processes, threads, sockets, syscalls, ports, and buffers? What about the assembly operations that are ultimately performed by your code? It simply can’t be true that giving up any amount of fine detail is a disaster. Second, coding with LLMs teaches you first-hand how important your mental model is . I do a lot of LLM-assisted work, and in general it looks like this: Note that only 10% of agent output is actually making its way into my output . Almost my entire time is spent looking at some piece of agent-generated code or text and trying to figure out whether it fits into my theory of the system. That theory is necessarily a bit less detailed than when I was writing every line of code by hand. But it’s still my theory! If it weren’t, I’d be accepting most of what the agent produced instead of rejecting almost all of it. Can AI agents build their own theories of the system? If not, this would be a pretty good reason not to use them, or to think that any supposed good outcomes are illusory. The first reason to think they can is that LLMs clearly do make working changes to codebases. If you think that a theory is essential to make working changes (which is at least plausible), doesn’t that prove that LLMs can build Naur theories? Well, maybe. They could be pattern-matching to Naur theories in the training data that are close enough to sort of work, or they could be able to build local theories which are good enough (as long as you don’t layer too many of them on top of each other). The second reason to think they can is that you can see them doing it . If you read an agent’s logs, they’re full of explicit theory-building 3 : making hypotheses about how the system works, trying to confirm or disprove them, adjusting the hypothesis, and repeating. When I’m trying to debug something, I’m usually racing against one or more AI agents, and sometimes they win . I refuse to believe that you can debug a million-line codebase without theory-building. I think it’s an open question if AI agents can build working theories of any codebase. In my experience, they do a good job with normal-ish applications like CRUD servers, proxies, and other kinds of program that are well-represented in the training data. If you’re doing something truly weird, I can believe they might struggle (though even then it seems at least possible ). Regardless, one big problem with AI agents is that they can’t retain theories of the codebase . They have to build their theory from scratch every time. Of course, documentation can help a little with this, but in Naur’s words, it’s “strictly impossible” to fully capture a theory in documentation. In fact, Naur thought that if all the humans who built a piece of software left, it was unwise to try and construct a theory of the software even from the code itself , and that you should simply rewrite the program from scratch. I think this is overstating it a bit, at least for large programs, but I agree that it’s a difficult task. AI agents are permanently in this unfortunate position: forced to construct a theory of the software from scratch, every single time they’re spun up. Given that, it’s kind of a minor miracle that AI agents are as effective as they are. The next big innovation in AI coding agents will probably be some way of allowing agents to build more long-term theories of the codebase: either by allowing them to modify their own weights 4 , or simply supporting contexts long enough so that you can make weeks worth of changes in the same agent run, or some other idea I haven’t thought of. This is the most recent (and well-written) example I’ve seen, but it’s a common view. I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. I spin off two or three parallel agents to try and answer some question or implement some code As each agent finishes (or I glance over at what it’s doing), I scan its work and make a snap judgement about whether it’s accurately reflecting my mental model of the overall system When it doesn’t - which is about 80% of the time - I either kill the process or I write a quick “no, you didn’t account for X” message I carefully review the 20% of plausible responses against my mental model, do my own poking around the codebase and manual testing/tweaking, and about half of that code will become a PR This is the most recent (and well-written) example I’ve seen, but it’s a common view. ↩ I have heard of people working like this. Ironically, I think it’s a good thing. The kind of engineer who does this is likely to be improved by becoming a thin wrapper around a frontier LLM (though it’s not great for their career prospects). ↩ I think some people would say here that AI agents simply can’t build any theories at all, because theories are a human-mind thing. These are the people who say that AIs can’t believe anything, or think, or have personalities, and so on. I have some sympathy for this as a metaphysical position, but it just seems obviously wrong as a practical view. If I can see GPT-5.4 testing hypotheses and correctly answering questions about the system, I don’t really care if it’s coming from a “real” theory or some synthetic equivalent. ↩ This is the dream of continuous learning : if what the AI agent learns about the codebase can be somehow encoded in its weights, it can take days or weeks to build its theory instead of mere minutes. ↩

0 views
Sean Goedecke 2 months ago

Big tech engineers need big egos

It’s a common position among software engineers that big egos have no place in tech 1 . This is understandable - we’ve all worked with some insufferably overconfident engineers who needed their egos checked - but I don’t think it’s correct. In fact, I don’t know if it’s possible to survive as a software engineer in a large tech company without some kind of big ego. However, it’s more complicated than “big egos make good engineers”. The most effective engineers I’ve worked with are simultaneously high-ego in some situations and surprisingly low-ego in others. What’s going on there? Software engineering is shockingly humbling, even for experienced engineers. There’s a reason this joke is so popular: The minute-to-minute experience of working as a software engineer is dominated by not knowing things and getting things wrong . Every time you sit down and write a piece of code, it will have several things wrong with it: some silly things, like missing semicolons, and often some major things, like bugs in the core logic. We spend most of our time fixing our own stupid mistakes. On top of that, even when we’ve been working on a system for years, we still don’t know that much about it. I wrote about this at length in Nobody knows how large software products work , but the reason is that big codebases are just that complicated. You simply can’t confidently answer questions about them without going and doing some research, even if you’re the one who wrote the code. When you have to build something new or fix a tricky problem, it can often feel straight-up impossible to begin, because good software engineers know just how ignorant they are and just how complex the system is. You just have to throw yourself into the blank sea of millions of lines of code and start wildly casting around to try and get your bearings. Software engineers need the kind of ego that can stand up to this environment. In particular, they need to have a firm belief that they can figure it out, no matter how opaque the problem seems; that if they just keep trying, they can break through to the pleasant (though always temporary) state of affairs where they understand the system and can see at a glance how bugs can be fixed and new features added 2 . What about the non-technical aspects of the job? Nobody likes working with a big ego, right? Wrong. Every great software engineer I’ve worked with in big tech companies has had a big ego - though as I’ll say below, in some ways these engineers were surprisingly low-ego. You need a big ego to take positions . Engineers love being non-committal about technical questions, because they’re so hard to answer and there’s often a plausible case for either side. However, as I keep saying , engineers have a duty to take clear positions on unclear technical topics, because the alternative is a non-technical decision maker (who knows even less) just taking their best guess. It’s scary to make an educated guess! You know exactly all the reasons you might be wrong. But you have to do it anyway, and ego helps a lot with that. You need a big ego to be willing to make enemies . Getting things done in a large organization means making some people angry. Of course, if you’re making lots of people angry, you’re probably screwing up: being too confrontational or making obviously bad decisions. But if you’re making a large change and one or two people are angry, that’s just life. In big tech companies, any big technical decision will affect a few hundred engineers, and one of them is bound to be unhappy about it. You can’t be so conflict-averse that you let that stop you from doing it, if you believe it’s the right decision. In other words, you have to have the confidence to believe that you’re right and they’re wrong, even though technical decisions always involve unclear tradeoffs and it’s impossible to get absolute certainty. You need a big ego to correct incorrect or unclear claims. When I was still in the philosophy world, the Australian logician Graham Priest had a reputation for putting his hand up and stopping presentations when he didn’t understand something that was said, and only allowing the seminar to continue when he felt like he understood. From his perspective, this wasn’t rude: after all, if he couldn’t understand it, the rest of the audience probably couldn’t either, and so he was doing them a favor by forcing a more clear explanation from the speaker. This is obviously a sign of a big ego. It’s also a trait that you need in a large tech company. People often nod and smile their way past incorrect technical claims, even when they suspect they might be wrong - assuming that they’ve just misunderstood and that somebody else will correct it, if it’s truly wrong. If you are the most senior engineer in the room, correcting these claims is your job. If everyone in the room is so pro-social and low-ego that they go along to get along, decisions will get made based on flatly incorrect technical assumptions, projects will get funded that are impossible to complete, and engineers will burn weeks or months of their careers vainly trying to make these projects work. You have to have a big enough ego to think “actually, I think I’m right and everyone in this room is confused”, even when the room is full of directors and VPs. All of this selects for some pretty high-ego engineers. But in order to actually succeed in these roles in large tech companies, you need to have a surprisingly low ego at times. I think this is why really effective big tech engineers are so rare: because it requires such a delicate balance between confidence and diffidence. To be an effective engineer, you need to have a towering confidence in your own ability to solve problems and make decisions, even when people disagree. But you also need to be willing to instantly subordinate your ego to the organization, when it asks you to. At the end of the day, your job - the reason the company pays you - is to execute on your boss’s and your boss’s boss’s plans, whether you agree with them or not. Competent software engineers are allowed quite a lot of leeway about how to implement those plans. However, they’re allowed almost no leeway at all about the plans themselves. In my experience, being confused about this is a common cause of burnout 3 . Many software engineers are used to making bold decisions on technical topics and being rewarded for it. Those software engineers then make a bold decision that disagrees with the VP of their organization, get immediately and brutally punished for it, and are confused and hurt. In fact, sometimes you just get punished and there’s nothing you can do. This is an unfortunate fact of how large organizations function: even if you do great technical work and build something really useful, you can fall afoul of a political battle fought three levels above your head, and come away with a worse reputation for it. Nothing to be done! This can be a hard pill to swallow for the high-ego engineers that tend to lead really useful technical projects. You also have to be okay with having your projects cancelled at the last minute. It’s a very common experience in large tech companies that you’re asked to deliver something quickly, you buckle down and get it done, and then right before shipping you’re told “actually, let’s cancel that, we decided not to do it”. This is partly because the decision-making process can be pretty fluid, and partly because many of these asks originate from off-hand comments: the CTO implies that something might be nice in a meeting, the VPs and directors hustle to get it done quickly, and then in the next meeting it becomes clear that the CTO doesn’t actually care, so the project is unceremoniously cancelled 4 . Nobody likes to work with a bully, or with someone who refuses to admit when they’re wrong, or with somebody incapable of empathy. But you really do need a strong ego to be an effective software engineer, because software engineering requires you to spend most of your day in a position of uncertainty or confusion. If your ego isn’t strong enough to stand up to that - if you don’t believe you’re good enough to power through - you simply can’t do the job. This is particularly true when it comes to working in a large software company. Many of the tasks you’re required to do (particularly if you’re a senior or staff engineer) require a healthy ego. However, there’s a kind of catch-22 here. If it insults your pride to work on silly projects, or to occasionally “catch a stray bullet” in the organization’s political fights, or to have to shelve a project that you worked hard on and is ready to ship, you’re too high-ego to be an effective software engineer. But if you can’t take firm positions, or if you’re too afraid to make enemies, or you’re unwilling to speak up and correct people, you’re too low-ego. Engineers who are low-ego in general can’t get stuff done, while engineers who are high-ego in general get slapped down by the executives who wield real organizational power. The most successful kind of software engineer is therefore a chameleon: low-ego when dealing with executives, but high-ego when dealing with the rest of the organization 5 . What do I mean by “ego”, in this context? More or less the colloquial sense of the term: a somewhat irrational self-confidence, a tendency to believe that you’re very important, the sense that you’re the “main character”, that sort of thing Why is this “ego”, and not just normal confidence? Well, because of just how murky and baffling software problems feel when you start working on them. You really do need a degree of confidence in yourself that feels unreasonable from the inside. It should be obvious, but I want to explicitly note that you don’t just need ego: you also have to be technically strong enough to actually succeed when your ego powers you through the initial period of self-doubt. I share the increasingly-common view that burnout is not caused by working too hard, but by hard work unrewarded. That explains why nothing burns you out as hard as being punished for hard work that you expected a reward for. It’s more or less exactly this scene from Silicon Valley. This description sounds a bit sociopathic to me. But, on reflection, it’s fairly unsurprising that competent sociopaths do well in large organizations. Whether that kind of behavior is worth emulating or worth avoiding is up to you, I suppose. What do I mean by “ego”, in this context? More or less the colloquial sense of the term: a somewhat irrational self-confidence, a tendency to believe that you’re very important, the sense that you’re the “main character”, that sort of thing ↩ Why is this “ego”, and not just normal confidence? Well, because of just how murky and baffling software problems feel when you start working on them. You really do need a degree of confidence in yourself that feels unreasonable from the inside. It should be obvious, but I want to explicitly note that you don’t just need ego: you also have to be technically strong enough to actually succeed when your ego powers you through the initial period of self-doubt. ↩ I share the increasingly-common view that burnout is not caused by working too hard, but by hard work unrewarded. That explains why nothing burns you out as hard as being punished for hard work that you expected a reward for. ↩ It’s more or less exactly this scene from Silicon Valley. ↩ This description sounds a bit sociopathic to me. But, on reflection, it’s fairly unsurprising that competent sociopaths do well in large organizations. Whether that kind of behavior is worth emulating or worth avoiding is up to you, I suppose. ↩

0 views
Sean Goedecke 2 months ago

Giving LLMs a personality is just good engineering

AI skeptics often argue that current AI systems shouldn’t be so human-like. The idea - most recently expressed in this opinion piece by Nathan Beacom - is that language models should explicitly be tools, like calculators or search engines. Although they can pretend to be people, they shouldn’t, because it encourages users to overestimate AI capabilities and (at worst) slip into AI psychosis . Here’s a representative paragraph from the piece: In sum, so much of the confusion around making AI moral comes from fuzzy thinking about the tools at hand. There is something that Anthropic could do to make its AI moral, something far more simple, elegant, and easy than what Askell is doing. Stop calling it by a human name, stop dressing it up like a person, and don’t give it the functionality to simulate personal relationships, choices, thoughts, beliefs, opinions, and feelings that only persons really possess. Present and use it only for what it is: an extremely impressive statistical tool, and an imperfect one. If we all used the tool accordingly, a great deal of this moral trouble would be resolved. So why do Claude and ChatGPT act like people? According to Beacom, AI labs have built human-like systems because AI lab engineers are trying to hoodwink users into emotionally investing in the models, or because they’re delusional true believers in AI personhood, or some other foolish reason. This is wrong. AI systems are human-like because that is the best way to build a capable AI system . Modern AI models - whether designed for chat, like OpenAI’s GPT-5.2, or designed for long-running agentic work, like Claude Opus 4.6 - do not naturally emerge from their oceans of training data. Instead, when you train a model on raw data, you get a “base model”, which is not very useful by itself. You cannot get it to write an email for you, or proofread your essay, or review your code. The base model is a kind of mysterious gestalt of its training data. If you feed it text, it will sometimes continue in that vein, or other times it will start outputting pure gibberish. It has no problem producing code with giant security flaws, or horribly-written English, or racist screeds - all of those things are represented in its training data, after all, and the base model does not judge. It simply outputs. To build a useful AI model, you need to journey into the wild base model and stake out a region that is amenable to human interests: both ethically, in the sense that the model won’t abuse its users, and practically, in the sense that it will produce correct outputs more often than incorrect ones. What this means in practice is that you have to give the model a personality during post-training 1 . Human beings are capable of almost any action at any time. But we only take a tiny subset of those actions, because that’s the kind of people we are. I could throw my cup of coffee all over the wall right now, but I don’t, because I’m not the kind of person who needlessly makes a mess 2 . AI systems are the same. Claude could respond to my question with incoherent racist abuse - the base model is more than capable of those outputs - but it doesn’t, because that’s not the kind of “person” it is. In other words, human-like personalities are not imposed on AI tools as some kind of marketing ploy or philosophical mistake. Those personalities are the medium via which the language model can become useful at all. This is why it’s surprisingly tricky to “just” change a language model’s personality or opinions: because you’re navigating through the near-infinite manifold of the base model. You may be able to control which direction you go, but you can’t control what you find there 3 . When AI people talk about LLMs having personalities, or wanting things, or even having souls 4 , these are technical terms, like the “memory” of a computer or the “transmission” of a car. You simply cannot build a capable AI system that “just acts like a tool”, because the model is trained on humans writing to and about other humans . You need to prime it with some kind of personality (ideally that of a useful, friendly assistant) so it can pull from the helpful parts of its training data instead of the horrible parts. This is all pretty well understood in the AI space. Anthropic wrote a recent paper about it where they cite similar positions going all the way back to 2022. But for some reason it’s not yet penetrated into communities that are more skeptical of AI. You could explain this in terms of “the stories we tell ourselves”. Many people (though not all ) think that human identities are narratively constructed. I wrote about this last year in Mecha-Hitler, Grok, and why it’s so hard to give LLMs the right personality . A little nudge to change Grok’s views on South African internal politics can cause it to start calling itself “Mecha-Hitler”. I have long believed that Claude “feels better” to use than ChatGPT because it has a more coherent persona (due mainly to Amanda Askell’s work on its “soul”). My guess is that if you tried to make a “less human” version of Claude, it would become rapidly less capable. This is all pretty well understood in the AI space. Anthropic wrote a recent paper about it where they cite similar positions going all the way back to 2022. But for some reason it’s not yet penetrated into communities that are more skeptical of AI. ↩ You could explain this in terms of “the stories we tell ourselves”. Many people (though not all ) think that human identities are narratively constructed. ↩ I wrote about this last year in Mecha-Hitler, Grok, and why it’s so hard to give LLMs the right personality . A little nudge to change Grok’s views on South African internal politics can cause it to start calling itself “Mecha-Hitler”. ↩ I have long believed that Claude “feels better” to use than ChatGPT because it has a more coherent persona (due mainly to Amanda Askell’s work on its “soul”). My guess is that if you tried to make a “less human” version of Claude, it would become rapidly less capable. ↩

0 views
Sean Goedecke 2 months ago

Insider amnesia

Speculation about what’s really going on inside a tech company is almost always wrong. When some problem with your company is posted on the internet, and you read people’s thoughts on it, their thoughts are almost always ridiculous. For instance, they might blame product managers for a particular decision, when in fact the decision in question was engineering-driven and the product org was pushing back on it. Or they might attribute an incident to overuse of AI, when the system in question was largely written pre-AI-coding and unedited since. You just don’t know what the problem is unless you’re on the inside. But when some other company has a problem on the internet, it’s very tempting to jump in with your own explanations. After all, you’ve seen similar things in your own career. How different can it really be? Very different, as it turns out. This is especially true for companies that are unusually big or small. The recent kerfuffle over some bad GitHub Actions code is a good example of this - many people just seemed to have no mental model about how a large tech company can produce bad code, because their mental model of writing code is something like “individual engineer maintaining an open-source project for ten years”, or “tiny team of experts who all swarm on the same problem”, or something else that has very little to do with how large tech companies produce software 1 . I’m sure the same thing happens when big-tech or medium-tech people give opinions about how tiny startups work. The obvious reference here is to “Gell-Mann amnesia” , which is about the general pattern of experts correctly disregarding bad sources in their fields of expertise, but trusting those same sources on other topics. But I’ve taken to calling this “insider amnesia” to myself, because it applies even to experts who are writing in their own areas of expertise - it’s simply the fact that they’re outsiders that’s causing them to stumble. I wrote about this at length in How good engineers write bad code at big companies I wrote about this at length in How good engineers write bad code at big companies ↩

0 views
Sean Goedecke 2 months ago

LLM-generated skills work, if you generate them afterwards

LLM “skills” are a short explanatory prompt for a particular task, typically bundled with helper scripts. A recent paper showed that while skills are useful to LLMs, LLM-authored skills are not. From the abstract: Self-generated skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming For the moment, I don’t really want to dive into the paper. I just want to note that the way the paper uses LLMs to generate skills is bad, and you shouldn’t do this. Here’s how the paper prompts a LLM to produce skills: Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference The key idea here is that they’re asking the LLM to produce a skill before it starts on the task. It’s essentially a strange version of the “make a plan first” or “think step by step” prompting strategy. I’m not at all surprised that this doesn’t help, because current reasoning models already think carefully about the task before they begin. What should you do instead? You should ask the LLM to write up a skill after it’s completed the task . Obviously this isn’t useful for truly one-off tasks. But few tasks are truly one-off. For instance, I’ve recently been playing around with SAEs and trying to clamp features in open-source models, a la Golden Gate Claude . It took a while for Codex to get this right. Here are some things it had to figure out: Once I was able (with Codex’s help) to clamp an 8B model and force it to obsess about a subject 1 , I then asked Codex to summarize the process into an agent skill 2 . That worked great! I was able to spin up a brand-new Codex instance with that skill and immediately get clamping working on a different 8B model. But if I’d asked Codex to write the skill at the start, it would have baked in all of its incorrect assumptions (like extracting from the final layernorm), and the skill wouldn’t have helped at all. In other words, the purpose of LLM-generated skills is to get it to distil the knowledge it’s gained by iterating on the problem for millions of tokens, not to distil the knowledge it already has from its training data. You can get a LLM to generate skills for you, so long as you do it after the LLM has already solved the problem the hard way . If you’re interested, it was “going to the movies”. I’ve pushed it up here . I’m sure you could do much better for a feature-extraction skill, this was just my zero-effort Codex-only attempt. Extracting features from the final layernorm is too late - you may as well just boost individual logits during sampling You have to extract from about halfway through the model layers to get features that can be usefully clamped Training a SAE on ~10k activations is two OOMs too few to get useful features. You need to train until features account for >50% of variance If you’re interested, it was “going to the movies”. ↩ I’ve pushed it up here . I’m sure you could do much better for a feature-extraction skill, this was just my zero-effort Codex-only attempt. ↩

1 views
Sean Goedecke 2 months ago

Two different tricks for fast LLM inference

Anthropic and OpenAI both recently announced “fast mode”: a way to interact with their best coding model at significantly higher speeds. These two versions of fast mode are very different. Anthropic’s offers up to 2.5x tokens per second (so around 170, up from Opus 4.6’s 65). OpenAI’s offers more than 1000 tokens per second (up from GPT-5.3-Codex’s 65 tokens per second, so 15x). So OpenAI’s fast mode is six times faster than Anthropic’s 1 . However, Anthropic’s big advantage is that they’re serving their actual model. When you use their fast mode, you get real Opus 4.6, while when you use OpenAI’s fast mode you get GPT-5.3-Codex-Spark, not the real GPT-5.3-Codex. Spark is indeed much faster, but is a notably less capable model: good enough for many tasks, but it gets confused and messes up tool calls in ways that vanilla GPT-5.3-Codex would never do. Why the differences? The AI labs aren’t advertising the details of how their fast modes work, but I’m pretty confident it’s something like this: Anthropic’s fast mode is backed by low-batch-size inference, while OpenAI’s fast mode is backed by special monster Cerebras chips . Let me unpack that a bit. The tradeoff at the heart of AI inference economics is batching , because the main bottleneck is memory . GPUs are very fast, but moving data onto a GPU is not. Every inference operation requires copying all the tokens of the user’s prompt 2 onto the GPU before inference can start. Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full. A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus . But obviously overall throughput would be much lower, because people would be waiting at the bus stop for hours until they managed to actually get on one. Anthropic’s fast mode offering is basically a bus pass that guarantees that the bus immediately leaves as soon as you get on. It’s six times the cost, because you’re effectively paying for all the other people who could have got on the bus with you, but it’s way faster 3 because you spend zero time waiting for the bus to leave. Obviously I can’t be fully certain this is right. Maybe they have access to some new ultra-fast compute that they’re running this on, or they’re doing some algorithmic trick nobody else has thought of. But I’m pretty sure this is it. Brand new compute or algorithmic tricks would likely require changes to the model (see below for OpenAI’s system), and “six times more expensive for 2.5x faster” is right in the ballpark for the kind of improvement you’d expect when switching to a low-batch-size regime. OpenAI’s fast mode does not work anything like this. You can tell that simply because they’re introducing a new, worse model for it. There would be absolutely no reason to do that if they were simply tweaking batch sizes. Also, they told us in the announcement blog post exactly what’s backing their fast mode: Cerebras. OpenAI announced their Cerebras partnership a month ago in January. What’s Cerebras? They build “ultra low-latency compute”. What this means in practice is that they build giant chips . A H100 chip (fairly close to the frontier of inference chips) is just over a square inch in size. A Cerebras chip is 70 square inches. You can see from pictures that the Cerebras chip has a grid-and-holes pattern all over it. That’s because silicon wafers this big are supposed to be broken into dozens of chips. Instead, Cerebras etches a giant chip over the entire thing. The larger the chip, the more internal memory it can have. The idea is to have a chip with SRAM large enough to fit the entire model , so inference can happen entirely in-memory. Typically GPU SRAM is measured in the tens of megabytes . That means that a lot of inference time is spent streaming portions of the model weights from outside of SRAM into the GPU compute 4 . If you could stream all of that from the (much faster) SRAM, inference would a big speedup: fifteen times faster, as it turns out! So how much internal memory does the latest Cerebras chip have? 44GB . This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model 5 . It’s interesting that the two major labs have two very different approaches to building fast AI inference. If I had to guess at a conspiracy theory, it would go something like this: Obviously OpenAI’s achievement here is more technically impressive. Getting a model running on Cerebras chips is not trivial, because they’re so weird. Training a 20B or 40B param distil of GPT-5.3-Codex that is still kind-of-good-enough is not trivial. But I commend Anthropic for finding a sneaky way to get ahead of the announcement that will be largely opaque to non-technical people. It reminds me of OpenAI’s mid-2025 sneaky introduction of the Responses API to help them conceal their reasoning tokens . Seeing the two major labs put out this feature might make you think that fast AI inference is the new major goal they’re chasing. I don’t think it is. If my theory above is right, Anthropic don’t care that much about fast inference, they just didn’t want to appear behind OpenAI. And OpenAI are mainly just exploring the capabilities of their new Cerebras partnership. It’s still largely an open question what kind of models can fit on these giant chips, how useful those models will be, and if the economics will make any sense. I personally don’t find “fast, less-capable inference” particularly useful. I’ve been playing around with it in Codex and I don’t like it. The usefulness of AI agents is dominated by how few mistakes they make , not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model 6 . However, it’s certainly possible that fast, less-capable inference becomes a core lower-level primitive in AI systems. Claude Code already uses Haiku for some operations. Maybe OpenAI will end up using Spark in a similar way. This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . OpenAI partner with Cerebras in mid-January, obviously to work on putting an OpenAI model on a fast Cerebras chip Anthropic have no similar play available, but they know OpenAI will announce some kind of blazing-fast inference in February, and they want to have something in the news cycle to compete with that Anthropic thus hustles to put together the kind of fast inference they can provide: simply lowering the batch size on their existing inference stack Anthropic (probably) waits until a few days before OpenAI are done with their much more complex Cerebras implementation to announce it, so it looks like OpenAI copied them This isn’t even factoring in latency. Anthropic explicitly warns that time to first token might still be slow (or even slower), while OpenAI thinks the Spark latency is fast enough to warrant switching to a persistent websocket (i.e. they think the 50-200ms round trip time for the handshake is a significant chunk of time to first token). ↩ Either in the form of the KV-cache for previous tokens, or as some big tensor of intermediate activations if inference is being pipelined through multiple GPUs. I write a lot more about this in Why DeepSeek is cheap at scale but expensive to run locally , since it explains why DeepSeek can be offered at such cheap prices (massive batches allow an economy of scale on giant expensive GPUs, but individual consumers can’t access that at all). ↩ Is it a contradiction that low-batch-size means low throughput, but this fast pass system gives users much greater throughput? No. The overall throughput of the GPU is much lower when some users are using “fast mode”, but those user’s throughput is much higher. ↩ Remember, GPUs are fast, but copying data onto them is not. Each “copy these weights to GPU” step is a meaningful part of the overall inference time. ↩ Or a smaller distil of whatever more powerful base model GPT-5.3-Codex was itself distilled from. I don’t know how AI labs do it exactly, and they keep it very secret. More on that here . ↩ On this note, it’s interesting to point out that Cursor’s hype dropped away basically at the same time they released their own “much faster, a little less-capable” agent model. Of course, much of this is due to Claude Code sucking up all the oxygen in the room, but having a very fast model certainly didn’t help . ↩

1 views
Sean Goedecke 3 months ago

On screwing up

The most shameful thing I did in the workplace was lie to a colleague. It was about ten years ago, I was a fresh-faced intern, and in the rush to deliver something I’d skipped the step of testing my work in staging 1 . It did not work. When deployed to production, it didn’t work there either. No big deal, in general terms: the page we were working on wasn’t yet customer-facing. But my colleague asked me over his desk whether this worked when I’d tested it, and I said something like “it sure did, no idea what happened”. I bet he forgot about it immediately. I could have just messed up the testing (for instance, by accidentally running some different code than the code I pushed), or he knew I’d probably lied, and didn’t really care. I haven’t forgotten about it. Even a decade later, I’m still ashamed to write it down. Of course I’m not ashamed about the mistake . I was sloppy to not test my work, but I’ve cut corners since then when I felt it was necessary, and I stand by that decision. I’m ashamed about how I handled it. But even that I understand. I was a kid, trying to learn quickly and prove I belonged in tech. The last thing I wanted to do was to dwell on the way I screwed up. If I were in my colleague’s shoes now, I’d have brushed it off too 2 . How do I try to handle mistakes now? The most important thing is to control your emotions . If you’re anything like me, your strongest emotional reactions at work will be reserved for the times you’ve screwed up. There are usually two countervailing emotions at play here: the desire to defend yourself, find excuses, and minimize the consequences; and the desire to confess your guilt, abase yourself, and beg for forgiveness. Both of these are traps. Obviously making excuses for yourself (or flat-out denying the mistake, like I did) is bad. But going in the other direction and publicly beating yourself up about it is just as bad . It’s bad for a few reasons. First, you’re effectively asking the people around you to take the time and effort to reassure you, when they should be focused on the problem. Second, you’re taking yourself out of the group of people who are focused on the problem, when often you’re the best situated to figure out what to do: since it’s your mistake, you have the most context. Third, it’s just not professional. So what should you do? For the first little while, do nothing . Emotional reactions fade over time. Try and just ride out the initial jolt of realizing you screwed up, and the impulse to leap into action to fix it. Most of the worst reactions to screwing up happen in the immediate aftermath, so if you can simply do nothing during that period you’re already off to a good start. For me, this takes about thirty seconds. How much time you’ll need depends on you, but hopefully it’s under ten minutes. More than that and you might need to grit your teeth and work through it. Once you’re confident you’re under control, the next step is to tell people what happened . Typically you want to tell your manager, but depending on the problem it could also be a colleague or someone else. It’s really important here to be matter-of-fact about it, or you risk falling into the “I’m so terrible, please reassure me” trap I discussed above. You often don’t even need to explicitly say “I made a mistake”, if it’s obvious from context. Just say “I deployed a change and it’s broken X feature” (or whatever the problem is). You should do this before you’ve come up with a solution. It’s tempting to try to conceal your mistake and just quietly solve it. But for user-facing mistakes, concealment is impossible - somebody will raise a ticket eventually - and if you don’t communicate the issue, you risk someone else discovering it and independently raising it. In the worst case, while you’re quietly working on a fix, you’ll discover that somebody else has declared an incident. Of course, you understand the problem perfectly (since you caused it), and you know that it was caused by a bad deploy and is easily fixable. But the other people on the incident call don’t know all that. They’re thinking about the worst-case scenarios, wondering if it’s database or network-related, paging in all kinds of teams, causing all kinds of hassle. All of that could have been avoided if you had reported the issue immediately. In my experience, tech company managers will forgive mistakes 3 , but they won’t forgive being made to look like a fool . In particular, they won’t forgive being deprived of critical information. If they’re asked to explain the incident by their boss, and they have to flounder around because they lack the context that you had all along , that may harm your relationship with them for good. On the other hand, if you give them a clear summary of the problem right away, and they’re able to seem like they’re on top of things to their manager, you might even earn credit for the situation (despite having caused it with your initial mistake). However, you probably won’t earn credit. This is where I diverge from the popular software engineering wisdom that incidents are always the fault of systems, never of individuals. Of course incidents are caused by the interactions of complex systems. Everything in the universe is caused by the interactions of complex systems! But one cause in that chain is often somebody screwing up 4 . If you’re a manager of an engineering organization, and you want a project to succeed, you probably have a mental shortlist of the engineers in your org who can reliably lead projects 5 . If an engineer screws up repeatedly, they’re likely to drop off that list (or at least get an asterisk next to their name). It doesn’t really matter if you had a good technical reason to make the mistake, or if it’s excusable. Managers don’t care about that stuff, because they simply don’t have the technical context to know if it’s true or if you’re just trying to talk your way out of it. What managers do have the context to evaluate is results , so that’s what they judge you on. That means some failures are acceptable, so long as you’ve got enough successes to balance them out. Being a strong engineer is about finding a balance between always being right and taking risks . If you prioritize always being right, you can probably avoid making mistakes, but you won’t be able to lead projects (since that always requires taking risks). Therefore, the optimal amount of mistakes at work is not zero. Unless you’re working in a few select industries 6 , you should expect to make mistakes now and then, otherwise you’re likely working far too slow. From memory, I think I had tested an earlier version of the code, but then I made some tweaks and skipped the step where I tested that it worked even with those tweaks. Though I would have made a mental note (and if someone more senior had done this, I would have been a bit less forgiving). Though they may not forget them. More on that later. It’s probably not that comforting to replace “you screwed up by being incompetent” with “it’s not your fault, it’s the system’s fault for hiring an engineer as incompetent as you”. For more on that, see How I ship projects at large tech companies . The classic examples are pacemakers and the Space Shuttle (should that now be Starship/New Glenn)? From memory, I think I had tested an earlier version of the code, but then I made some tweaks and skipped the step where I tested that it worked even with those tweaks. ↩ Though I would have made a mental note (and if someone more senior had done this, I would have been a bit less forgiving). ↩ Though they may not forget them. More on that later. ↩ It’s probably not that comforting to replace “you screwed up by being incompetent” with “it’s not your fault, it’s the system’s fault for hiring an engineer as incompetent as you”. ↩ For more on that, see How I ship projects at large tech companies . ↩ The classic examples are pacemakers and the Space Shuttle (should that now be Starship/New Glenn)? ↩

0 views
Sean Goedecke 3 months ago

Large tech companies don't need heroes

Large tech companies operate via systems . What that means is that the main outcomes - up to and including the overall success or failure of the company - are driven by a complex network of processes and incentives. These systems are outside the control of any particular person. Like the parts of a large codebase, they have accumulated and co-evolved over time, instead of being designed from scratch. Some of these processes and incentives are “legible”, like OKRs or promotion criteria. Others are “illegible”, like the backchannel conversations that usually precede a formal consensus on decisions 1 . But either way, it is these processes and incentives that determine what happens, not any individual heroics . This state of affairs is not efficient at producing good software. In large tech companies, good software often seems like it is produced by accident , as a by-product of individual people responding to their incentives. However, that’s just the way it has to be. A shared belief in the mission can cause a small group of people to prioritize good software over their individual benefit, for a little while. But thousands of engineers can’t do that for decades. Past a certain point of scale 2 , companies must depend on the strength of their systems. Individual engineers often react to this fact with horror. After all, they want to produce high-quality software. Why is everyone around them just cynically 3 focused on their own careers? On top of that, many software engineers got into the industry because they are internally compelled 4 to make systems more efficient. For these people, it is viscerally uncomfortable being employed in an inefficient company. They are thus prepared to do whatever it takes to patch up their system’s local inefficiencies. Of course, making your team more effective does not always require heroics. Some amount of fixing inefficiencies - improving process, writing tests, cleaning up old code - is just part of the job, and will get engineers rewarded and promoted just like any other kind of engineering work. But there’s a line. Past a certain point, working on efficiency-related stuff instead of your actual projects will get you punished, not rewarded. To go over that line requires someone willing to sacrifice their own career progression in the name of good engineering. In other words, it requires a hero . You can sacrifice your promotions and bonuses to make one tiny corner of the company hum along nicely for a while. However, like I said above, the overall trajectory of the company is almost never determined by one person. It doesn’t really matter how efficient you made some corner of the Google Wave team if the whole product was doomed. And even poorly-run software teams can often win, so long as they’re targeting some niche that the company is set up to support (think about the quality of most profitable enterprise software). On top of that, heroism makes it difficult for real change to happen . If a company is set up to reward bad work and punish good work, having some hero step up to do good work anyway and be punished will only insulate the company from the consequences of its own systems . Far better to let the company be punished for its failings, so it can (slowly, slowly) adjust, or be replaced by companies that operate better. Large tech companies don’t benefit long-term from heroes, but there’s still a role for heroes. That role is to be exploited . There are no shortage of predators who will happily recruit a hero for some short-term advantage. Some product managers keep a mental list of engineers in other teams who are “easy targets”: who can be convinced to do extra work on projects that benefit the product manager (but not that engineer). During high-intensity periods, such as the lead-up to a major launch, there is sometimes a kind of cold war between different product organizations, as they try to extract behind-the-scenes help from the engineers in each other’s camps while jealously guarding their own engineering resources. Likewise, some managers have no problem letting one of their engineers spend all their time on glue work . Much of that work would otherwise be the manager’s responsibility, so it makes the manager’s job easier. Of course, when it comes time for promotions, the engineer will be punished for not doing their real work. This is why it’s important for engineers to pay attention to their actual rewards. Promotions, bonuses and raises are the hard currency of software companies. Giving those out shows what the company really values. Predators don’t control those things (if they did, they wouldn’t be predators). As a substitute, they attempt to appeal to a hero’s internal compulsion to be useful or to clean up inefficiencies. Large tech companies are structurally set up to encourage software engineers to engage in heroics A background level of inefficiency is just part of the landscape of large tech companies I write about this point at length in Seeing like a software company . Why do companies need to scale, if it means they become less efficient? The best piece on this is Dan Luu’s I could build that in a weekend! : in short, because the value of marginal features in a successful software product is surprisingly high, and you need a lot of developers to capture all the marginal features. For a post on why this is not actually that cynical, see my Software engineers should be a little bit cynical . I write about these internal compulsions in I’m addicted to being useful . Large tech companies are structurally set up to encourage software engineers to engage in heroics This is largely accidental, and doesn’t really benefit those tech companies in the long term, since large tech companies are just too large to be meaningfully moved by individual heroics However, individual managers and product managers inside these tech companies have learned to exploit this surplus heroism for their individual ends As a software engineer, you should resist the urge to heroically patch some obvious inefficiency you see in the organization Unless that work is explicitly rewarded by the company, all your efforts will do is delay the point at which the company has to change its processes A background level of inefficiency is just part of the landscape of large tech companies It’s the price they pay to be so large (and in return reap the benefits of scale and legibility ) The more you can learn to live with it, the more you’ll be able to use your energy tactically for your own benefit I write about this point at length in Seeing like a software company . ↩ Why do companies need to scale, if it means they become less efficient? The best piece on this is Dan Luu’s I could build that in a weekend! : in short, because the value of marginal features in a successful software product is surprisingly high, and you need a lot of developers to capture all the marginal features. ↩ For a post on why this is not actually that cynical, see my Software engineers should be a little bit cynical . ↩ I write about these internal compulsions in I’m addicted to being useful . ↩

0 views
Sean Goedecke 3 months ago

How does AI impact skill formation?

Two days ago, the Anthropic Fellows program released a paper called How AI Impacts Skill Formation . Like other papers on AI before it, this one is being treated as proof that AI makes you slower and dumber. Does it prove that? The structure of the paper is sort of similar to the 2025 MIT study Your Brain on ChatGPT . They got a group of people to perform a cognitive task that required learning a new skill: in this case, the Python Trio library. Half of those people were required to use AI and half were forbidden from using it. The researchers then quizzed those people to see how much information they retained about Trio. The banner result was that AI users did not complete the task faster, but performed much worse on the quiz . If you were so inclined, you could naturally conclude that any perceived AI speedup is illusory, and the people who are using AI tooling are cooking their brains. But I don’t think that conclusion is reasonable. To see why, let’s look at Figure 13 from the paper: The researchers noticed half of the AI-using cohort spent most of their time literally retyping the AI-generated code into their solution, instead of copy-pasting or “manual coding”: writing their code from scratch with light AI guidance. If you ignore the people who spent most of their time retyping, the AI-users were 25% faster. I confess that this kind of baffles me. What kind of person manually retypes AI-generated code? Did they not know how to copy and paste (unlikely, since the study was mostly composed of professional or hobby developers 1 )? It certainly didn’t help them on the quiz score. The retypers got the same (low) scores as the pure copy-pasters. In any case, if you know how to copy-paste or use an AI agent, I wouldn’t use this paper as evidence that AI will not be able to speed you up. Even if AI use offers a 25% speedup, is that worth sacrificing the opportunity to learn new skills? What about the quiz scores? Well, first we should note that the AI users who used the AI for general questions but wrote all their own code did fine on the quiz . If you look at Figure 13 above, you can see that those AI users averaged maybe a point lower on the quiz - not bad, for people working 25% faster. So at least some kinds of AI use seem fine. But of course much current AI use is not like this: if you’re using Claude Code or Copilot agent mode, you’re getting the AI to do the code writing for you. Are you losing key skills by doing that? Well yes, of course you are. If you complete a task in ten minutes by throwing it at a LLM, you will learn much less about the codebase than if you’d spent an hour doing it by hand. I think it’s pretty silly to deny this: it’s intuitively right, and anybody who has used AI agents extensively at work can attest to it from their own experience. Still, I have two points to make about this. First, software engineers are not paid to learn about the codebase . We are paid to deliver business value (typically by delivering working code). If AI can speed that up dramatically, avoiding it makes you worse at your job, even if you’re learning more efficiently. That’s a bit unfortunate for us - it was very nice when we could get much better at the job simply by doing it more - but that doesn’t make it false. Other professions have been dealing with this forever. Doctors are expected to spend a lot of time in classes and professional development courses, learning how to do their job in other ways than just doing it. It may be that future software engineers will need to spend 20% of their time manually studying their codebases: not just in the course of doing some task (which could be far more quickly done by AI agents) but just to stay up-to-date enough that their skills don’t atrophy. The other point I wanted to make is that even if your learning rate is slower, moving faster means you may learn more overall . Suppose using AI meant that you learned only 75% as much as non-AI programmers from any given task. Whether you’re learning less overall depends on how many more tasks you’re doing . If you’re working faster, the loss of learning efficiency may be balanced out by volume. I don’t know if this is true. I suspect there really is no substitute for painstakingly working through a codebase by hand. But the engineer who is shipping 2x as many changes is probably also learning things that the slower, manual engineer does not know. At minimum, they’ll be acquiring a greater breadth of knowledge of different subsystems, even if their depth suffers. Anyway, the point is simply that a lower learning rate does not by itself prove that less learning is happening overall. Finally, I will reluctantly point out that the model used for this task was GPT-4o (see section 4.1). I’m reluctant here because I sympathize with the AI skeptics, who are perpetually frustrated by the pro-AI response of “well, you just haven’t tried the right model”. In a world where new AI models are released every month or two, demanding that people always study the best model makes it functionally impossible to study AI use at all. Still, I’m just kind of confused about why GPT-4o was chosen. This study was funded by Anthropic, who have much better models. This study was conducted in 2025 2 , at least six months after the release of GPT-4o (that’s like five years in AI time). I can’t help but wonder if the AI-users cohort would have run into fewer problems with a more powerful model. I don’t have any real problem with this paper. They set out to study how different patterns of AI use affect learning, and their main conclusion - that pure “just give the problem to the model” AI use means you learn a lot less - seems correct to me. I don’t like their conclusion that AI use doesn’t speed you up, since it relies on the fact that 50% of their participants spent their time literally retyping AI code . I wish they’d been more explicit in the introduction that this was the case, but I don’t really blame them for the result - I’m more inclined to blame the study participants themselves, who should have known better. Overall, I don’t think this paper provides much new ammunition to the AI skeptic. Like I said above, it doesn’t support the point that AI speedup is a mirage. And the point it does support (that AI use means you learn less) is obvious. Nobody seriously believes that typing “build me a todo app” into Claude Code means you’ll learn as much as if you built it by hand. That said, I’d like to see more investigation into long-term patterns of AI use in tech companies. Is the slower learning rate per-task balanced out by the higher rate of task completion? Can it be replaced by carving out explicit time to study the codebase? It’s probably too early to answer these questions - strong coding agents have only been around for a handful of months - but the answers may determine what it’s like to be a software engineer for the next decade. See Figure 17. I suppose the study doesn’t say that explicitly, but the Anthropic Fellows program was only launched in December 2024, and the paper was published in January 2026. See Figure 17. ↩ I suppose the study doesn’t say that explicitly, but the Anthropic Fellows program was only launched in December 2024, and the paper was published in January 2026. ↩

0 views
Sean Goedecke 3 months ago

How I estimate work as a staff software engineer

There’s a kind of polite fiction at the heart of the software industry. It goes something like this: Estimating how long software projects will take is very hard, but not impossible. A skilled engineering team can, with time and effort, learn how long it will take for them to deliver work, which will in turn allow their organization to make good business plans. This is, of course, false. As every experienced software engineer knows, it is not possible to accurately estimate software projects . The tension between this polite fiction and its well-understood falseness causes a lot of strange activity in tech companies. For instance, many engineering teams estimate work in t-shirt sizes instead of time, because it just feels too obviously silly to the engineers in question to give direct time estimates. Naturally, these t-shirt sizes are immediately translated into hours and days when the estimates make their way up the management chain. Alternatively, software engineers who are genuinely trying to give good time estimates have ridiculous heuristics like “double your initial estimate and add 20%“. This is basically the same as giving up and saying “just estimate everything at a month”. Should tech companies just stop estimating? One of my guiding principles is that when a tech company is doing something silly, they’re probably doing it for a good reason . In other words, practices that appear to not make sense are often serving some more basic, illegible role in the organization. So what is the actual purpose of estimation, and how can you do it well as a software engineer? Before I get into that, I should justify my core assumption a little more. People have written a lot about this already, so I’ll keep it brief. I’m also going to concede that sometimes you can accurately estimate software work , when that work is very well-understood and very small in scope. For instance, if I know it takes half an hour to deploy a service 1 , and I’m being asked to update the text in a link, I can accurately estimate the work at something like 45 minutes: five minutes to push the change up, ten minutes to wait for CI, thirty minutes to deploy. For most of us, the majority of software work is not like this. We work on poorly-understood systems and cannot predict exactly what must be done in advance. Most programming in large systems is research : identifying prior art, mapping out enough of the system to understand the effects of changes, and so on. Even for fairly small changes, we simply do not know what’s involved in making the change until we go and look. The pro-estimation dogma says that these questions ought to be answered during the planning process, so that each individual piece of work being discussed is scoped small enough to be accurately estimated. I’m not impressed by this answer. It seems to me to be a throwback to the bad old days of software architecture , where one architect would map everything out in advance, so that individual programmers simply had to mechanically follow instructions. Nobody does that now, because it doesn’t work: programmers must be empowered to make architectural decisions, because they’re the ones who are actually in contact with the code 2 . Even if it did work, that would simply shift the impossible-to-estimate part of the process backwards, into the planning meeting (where of course you can’t write or run code, which makes it near-impossible to accurately answer the kind of questions involved). In short: software engineering projects are not dominated by the known work, but by the unknown work, which always takes 90% of the time. However, only the known work can be accurately estimated. It’s therefore impossible to accurately estimate software projects in advance. Estimates do not help engineering teams deliver work more efficiently. Many of the most productive years of my career were spent on teams that did no estimation at all: we were either working on projects that had to be done no matter what, and so didn’t really need an estimate, or on projects that would deliver a constant drip of value as we went, so we could just keep going indefinitely 3 . In a very real sense, estimates aren’t even made by engineers at all . If an engineering team comes up with a long estimate for a project that some VP really wants, they will be pressured into lowering it (or some other, more compliant engineering team will be handed the work). If the estimate on an undesirable project - or a project that’s intended to “hold space” for future unplanned work - is too short, the team will often be encouraged to increase it, or their manager will just add a 30% buffer. One exception to this is projects that are technically impossible, or just genuinely prohibitively difficult. If a manager consistently fails to pressure their teams into giving the “right” estimates, that can send a signal up that maybe the work can’t be done after all. Smart VPs and directors will try to avoid taking on technically impossible projects. Another exception to this is areas of the organization that senior leadership doesn’t really care about. In a sleepy backwater, often the formal estimation process does actually get followed to the letter, because there’s no director or VP who wants to jump in and shape the estimates to their ends. This is one way that some parts of a tech company can have drastically different engineering cultures to other parts. I’ll let you imagine the consequences when the company is re-orged and these teams are pulled into the spotlight. Estimates are political tools for non-engineers in the organization . They help managers, VPs, directors, and C-staff decide on which projects get funded and which projects get cancelled. The standard way of thinking about estimates is that you start with a proposed piece of software work, and you then go and figure out how long it will take. This is entirely backwards. Instead, teams will often start with the estimate, and then go and figure out what kind of software work they can do to meet it. Suppose you’re working on a LLM chatbot, and your director wants to implement “talk with a PDF”. If you have six months to do the work, you might implement a robust file upload system, some pipeline to chunk and embed the PDF content for semantic search, a way to extract PDF pages as image content to capture formatting and diagrams, and so on. If you have one day to do the work, you will naturally search for simpler approaches: for instance, converting the PDF to text client-side and sticking the entire thing in the LLM context, or offering a plain-text “grep the PDF” tool. This is true at even at the level of individual lines of code. When you have weeks or months until your deadline, you might spend a lot of time thinking airily about how you could refactor the codebase to make your new feature fit in as elegantly as possible. When you have hours, you will typically be laser-focused on finding an approach that will actually work. There are always many different ways to solve software problems. Engineers thus have quite a lot of discretion about how to get it done. So how do I estimate, given all that? I gather as much political context as possible before I even look at the code . How much pressure is on this project? Is it a casual ask, or do we have to find a way to do this? What kind of estimate is my management chain looking for? There’s a huge difference between “the CTO really wants this in one week” and “we were looking for work for your team and this seemed like it could fit”. Ideally, I go to the code with an estimate already in hand . Instead of asking myself “how long would it take to do this”, where “this” could be any one of a hundred different software designs, I ask myself “which approaches could be done in one week?“. I spend more time worrying about unknowns than knowns . As I said above, unknown work always dominates software projects. The more “dark forests” in the codebase this feature has to touch, the higher my estimate will be - or, more concretely, the tighter I need to constrain the set of approaches to the known work. Finally, I go back to my manager with a risk assessment, not with a concrete estimate . I don’t ever say “this is a four-week project”. I say something like “I don’t think we’ll get this done in one week, because X Y Z would need to all go right, and at least one of those things is bound to take a lot more work than we expect. Ideally, I go back to my manager with a series of plans, not just one: In other words, I don’t “break down the work to determine how long it will take”. My management chain already knows how long they want it to take. My job is to figure out the set of software approaches that match that estimate. Sometimes that set is empty: the project is just impossible, no matter how you slice it. In that case, my management chain needs to get together and figure out some way to alter the requirements. But if I always said “this is impossible”, my managers would find someone else to do their estimates. When I do that, I’m drawing on a well of trust that I build up by making pragmatic estimates the rest of the time. Many engineers find this approach distasteful. One reason is that they don’t like estimating in conditions of uncertainty, so they insist on having all the unknown questions answered in advance. I have written a lot about this in Engineers who won’t commit and How I provide technical clarity to non-technical leaders , but suffice to say that I think it’s cowardly. If you refuse to estimate, you’re forcing someone less technical to estimate for you. Some engineers think that their job is to constantly push back against engineering management, and that helping their manager find technical compromises is betraying some kind of sacred engineering trust. I wrote about this in Software engineers should be a little bit cynical . If you want to spend your career doing that, that’s fine, but I personally find it more rewarding to find ways to work with my managers (who have almost exclusively been nice people). Other engineers might say that they rarely feel this kind of pressure from their directors or VPs to alter estimates, and that this is really just the sign of a dysfunctional engineering organization. Maybe! I can only speak for the engineering organizations I’ve worked in. But my suspicion is that these engineers are really just saying that they work “out of the spotlight”, where there’s not much pressure in general and teams can adopt whatever processes they want. There’s nothing wrong with that. But I don’t think it qualifies you to give helpful advice to engineers who do feel this kind of pressure. I think software engineering estimation is generally misunderstood. The common view is that a manager proposes some technical project, the team gets together to figure out how long it would take to build, and then the manager makes staffing and planning decisions with that information. In fact, it’s the reverse: a manager comes to the team with an estimate already in hand (though they might not come out and admit it), and then the team must figure out what kind of technical project might be possible within that estimate. This is because estimates are not by or for engineering teams. They are tools used for managers to negotiate with each other about planned work. Very occasionally, when a project is literally impossible, the estimate can serve as a way for the team to communicate that fact upwards. But that requires trust. A team that is always pushing back on estimates will not be believed when they do encounter a genuinely impossible proposal. When I estimate, I extract the range my manager is looking for, and only then do I go through the code and figure out what can be done in that time. I never come back with a flat “two weeks” figure. Instead, I come back with a range of possibilities, each with their own risks, and let my manager make that tradeoff. It is not possible to accurately estimate software work. Software projects spend most of their time grappling with unknown problems, which by definition can’t be estimated in advance. To estimate well, you must therefore basically ignore all the known aspects of the work, and instead try and make educated guesses about how many unknowns there are, and how scary each unknown is. edit: I should thank one of my readers, Karthik, who emailed me to ask about estimates, thus revealing to me that I had many more opinions than I thought. For anyone wincing at that time, I mean like three minutes of actual deployment and twenty-seven minutes of waiting for checks to pass or monitors to turn up green. I write a lot more about this in You can’t design software you don’t work on . For instance, imagine a mandate to improve the performance of some large Rails API, one piece at a time. I could happily do that kind of work forever. We tackle X Y Z directly, which might all go smoothly but if it blows out we’ll be here for a month We bypass Y and Z entirely, which would introduce these other risks but possibly allow us to hit the deadline We bring in help from another team who’s more familiar with X and Y, so we just have to focus on Z For anyone wincing at that time, I mean like three minutes of actual deployment and twenty-seven minutes of waiting for checks to pass or monitors to turn up green. ↩ I write a lot more about this in You can’t design software you don’t work on . ↩ For instance, imagine a mandate to improve the performance of some large Rails API, one piece at a time. I could happily do that kind of work forever. ↩

1 views
Sean Goedecke 3 months ago

Crypto grifters are recruiting open-source AI developers

Two recently-hyped developments in AI engineering have been Geoff Huntley’s “Ralph Wiggum loop” and Steve Yegge’s “Gas Town”. Huntley and Yegge are both respected software engineers with a long pedigree of actual projects. The Ralph loop is a sensible idea: force infinite test-time-compute by automatically restarting Claude Code whenever it runs out of steam. Gas Town is a platform for an idea that’s been popular for a while (though in my view has never really worked): running a whole village of LLM agents that collaborate with each other to accomplish a task. So far, so good. But Huntley and Yegge have also been posting about $RALPH and $GAS, which are cryptocurrency coins built on top of the longstanding Solana cryptocurrency and the Bags tool, which allows people to easily create their own crypto coins. What does $RALPH have to do with the Ralph Wiggum loop? What does $GAS have to do with Gas Town? From reading Huntley and Yegge’s posts, it seems like what happened was this: So what does $GAS have to do with Gas Town (or $RALPH with Ralph Wiggum)? From a technical perspective, the answer is nothing . Gas Town is an open-source GitHub repository that you can clone, edit and run without ever interacting with the $GAS coin. Likewise for Ralph . Buying $GAS or $RALPH does not unlock any new capabilities in the tools. All it does is siphon a little bit of money to Yegge and Huntley, and increase the value of the $GAS or $RALPH coins. Of course, that’s why these coins exist in the first place. This is a new variant of an old “airdropping” cryptocurrency tactic. The classic problem with “memecoins” is that it’s hard to give people a reason to buy them, even at very low prices, because they famously have no staying power. That’s why many successful memecoins rely on celebrity power, like Eric Adams’ “NYC Token” or the $TRUMP coin. But how do you convince a celebrity to get involved in your grift business venture? This is where Bags comes in. Bags allows you to nominate a Twitter account as the beneficiary (or “fee earner”) of your coin. The person behind that Twitter account doesn’t have to agree, or even know that you’re doing it. Once you accumulate a nominal market cap (for instance, by moving a bunch of your own money onto the coin), you can then message the owner of that Twitter account and say “hey, all these people are supporting you via crypto, and you can collect your money right now if you want!” Then you either subtly hint that promoting the coin would cause that person to make more money, or you wait for them to realize it themselves 1 . Once they start posting about it, you’ve bootstrapped your own celebrity coin. This system relies on your celebrity target being dazzled by receiving a large sum of free money. If you came to them before the money was there, they might ask questions like “why wouldn’t people just directly donate to me?”, or “are these people who think they’re supporting me going to lose all their money?“. But in the warm glow of a few hundred thousand dollars, it’s easy to think that it’s all working out excellently. Incidentally, this is why AI open-source software engineers make such great targets. The fact that they’re open-source software engineers means that (a) a few hundred thousand dollars is enough to dazzle them 2 , and (b) their fans are technically-engaged enough to be able to figure out how to buy cryptocurrency. Working in AI also means that there’s a fresh pool of hype to draw from (the general hype around cryptocurrency being somewhat dry by now). On top of that, the open-source AI community is fairly small. Yegge mentions in his post that he wouldn’t have taken the offer seriously if Huntley hadn’t already accepted it. If you couldn’t tell, I think this whole thing is largely predatory. Bags seems to me to be offering crypto-airdrop-pump-and-dumps-as-a-service, where niche celebrities can turn their status as respected community figures into cold hard cash. The people who pay into this are either taken in by the pretense that they’re sponsoring open-source work (in a way orders of magnitude less efficient than just donating money directly), or by the hope that they’re going to win big when the coin goes “to the moon” (which effectively never happens). The celebrities will make a little bit of money, for their part in it, but the lion’s share of the reward will go to the actual grifters: the insiders who primed the coin and can sell off into the flood of community members who are convinced to buy. Bags even offers a “Did You Get Bagged? 💰🫵” section in their docs, encouraging the celebrity targets to share the coin, and framing the whole thing as coming from “your community”. This isn’t a dig - that amount of money would dazzle me too! I only mean that you wouldn’t be able to get Tom Cruise or MrBeast to promote your coin with that amount of money. Some crypto trader created a “$GAS” coin via Bags, configuring it to pay a portion of the trading fees to Steve Yegge (via his Twitter account) That trader, or others with the same idea, messaged Yegge on LinkedIn to tell him about his “earnings” ( currently $238,000), framing it as support for the Gas Town project Yegge took the free money and started posting about how exciting $GAS is as a way to fund open-source software creators Bags even offers a “Did You Get Bagged? 💰🫵” section in their docs, encouraging the celebrity targets to share the coin, and framing the whole thing as coming from “your community”. ↩ This isn’t a dig - that amount of money would dazzle me too! I only mean that you wouldn’t be able to get Tom Cruise or MrBeast to promote your coin with that amount of money. ↩

0 views